This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
One of the serious problems and challenges in the Internet is the Web authoring and credibility. With the continuous expansion on Internet services especially when it is used to communication important sensitive information, there is a need to verify the Webpage content and authors. In this paper, we evaluate the elements required to evaluate Websites and pages credibility based on Website, Web pages, and authors credibility metrics. A case study of selected websites in Jordan is used to assess the proposed credibility metrics. Results showed that there are many metrics to measure a trust in Website or a Webpage. Results also showed the need to have a clear standard to evaluate Website content authenticity and credibility.
Information retrieval, Website credibility, trust rank, content authentication
Research papers and publications are important indicators for the ability of an author or an education community to conduct research projects in the different human science fields. In general, the number of publications and the increase in this number is a direct indicator of the size or the volume of research activities for a particular author or university. Nonetheless, the number of publications merely, is showed to be a limited indicator to show the impact of those publications. The number of citations for a particular paper is shown to be more relevant and important in comparison to the number of publications. This is why early citation indices such as H-index and G-index gave more weight and important to the number of citations in comparison to the number of publications.
The changing nature and the huge size of the web have led to shed light on information retrieval systems. It has become increasingly difficult to retrieve the required web pages for users on the web. It becomes a necessity for a user to do that for searching certain queries with a minimum number of irrelevant web pages with the desired search features such as file type, domain, desired words, and so on. To resolve this issue, programs called spiders have been built to retrieve the desired web pages automatically.
Crawlers or spiders are automated tools that parse through websites and retrieve all pages and their contents. Users' needs are dynamic and over time they might need to reuse the same web pages that they have downloaded before, two types of crawlers are proposed to resolve this issue; batch crawler which doesn't allow duplication, instead it brings the last snapshot of the web page that has been downloaded by user, and incremental crawler which allows the duplication of web pages occurrence, and the crawling process is considered to be continuous.
There are several parameters for measuring crawler performance:
1. The importance of the page which is measured by keywords such as unique ones or their frequency, similarity to a user query description, similarity to seed pages which calculates the cosine similarity of the relevance ones, classifier scores which is given base upon
ISBN: 978-0-9853483-3-5 ï¿½2013 SDIWC 174
classifier existing knowledge, retrieval system ranking which uses many crawlers, and the popularity of the link which uses the Page Rank or the HITS algorithm.
2. Precision and recall. Unreliable information has misled internet users who rely on the web as a major source of knowledge. Search engines focus on retrieving the Web pages that are most popular and relevance to user query without taken into consideration the credibility of those web pages.
Many studies and algorithms focus on measuring page rank, the relevancy of the results to user query, and the behavior of users through accessing the web by using data mining techniques.
This research aims to study the credibility of web pages by interpreting the credibility guidelines into several measurements.
We will study the credibility from three perspectives:
1. Domain or website will be measured in terms of domain age, number of indexed pages in various search engines such as Google, inlinks, outlinks, number of broken links, website size, number of authenticated pages, trust rank, popularity, traffic, number of materials and publications, number of contacts, freshness, and age.
2. Web page/file: each web page or file will be measured in terms of freshness, popularity, trust, inlinks, outlinks, and age.
3. Author: the authors will be measured by number of citations and for example the number of indexed pages in Google.
2 RELATED WORKS
The increased number of web users has led to dynamically changing of the web confident in many areas. They recognize the importance of the web pages credibility as a measurement of web page quality. Many users concern in finding adequate and trusted source of information in order to gain the desired knowledge, The researchers study many elements of credibility such as freshness and Publication dates (P-dates) of web pages, they are extremely important issues for verification the quality of the web content, where the oldest web pages supposed to be a strong indication of confidence to the users. Yet most search engines rely on the relevancy between user query and web pages' contents in retrieving search results without looking at novelty of it. Another issue are studied to indicate the importance of credibility is finding the citations of published papers, which help users in evaluating the academic research and knowing the strength or weakness of author, where gaining the citations of an author is a challenge issue.
In , the authors study the freshness of the web page in terms of two elements the web page freshness and the inlinks freshness, then a temporal correlation is used.
In , the authors developed a temporal ranking algorithm, there work beats page rank algorithm by 17.8% and 13.5% in terms of NDCG. The limitation of their work that is not all web pages of certain website are indexed or achieved.
In  the author defined five elements of trusted web pages, three of them; expertise, experience, impartiality express the relation between user and topic, where affinity and record track express the relation between two users. The authors developed a Hoonoh ontology which stands on "who" and "know" in order to highlight the relations that are related to surfing on trusted information. The authors developed a search engine using the Hoonoh ontology to help users in seeking the web for trusted information and providing them a worth suggestions and directions regarding their search query.
In , the authors developed a supervised machine learning method to investigate the P-dates, where the linguistic information and the coordination of information extracted from the Document Object Model (DOM) tree of Web pages are used as elements of learning. Experiments explain that the developed model beats the F1 Score for English and Chinese web pages, in terms of three types of dates; first, last, and latest dates. Then a model for page ranking has been improved by using the P-dates, scores for relevancy between user query and the content of the web page, and scores of the importance of the page.
In a preprocessing phase, the Webpage is used as an input, and then extracts as a series of units. The unit consists of temporal element and text content. The output is represented as DOM tree. In training
ISBN: 978-0-9853483-3-5 ï¿½2013 SDIWC 175
phase the P-dates is assigned a score. In post processing phase P-dates are extracted using heuristic rules based upon the following elements:
1. Linguistic information including temporal elements, count of numerical characters, count of alphabetic characters, and words that point to the mean of publication such as "updated", "published", and so on.
2. Locations of the unit on the web page. For example before title, after title, at the bottom, and at the end.
3. The format of information, such as font type, alignment, and font size.
Then the page ranking is calculated according to the following formula:
rank(i) = a * sim(i,q) + ï¿½ * f(i) * Pagerank(i)
The limitation of this work is the implicit P-date of the web page is not considered.
In , authors proposed an approach for obtaining the citations for an author by using his/her name and some vocabularies which are extracted from the title of the published articles. their approach is applied by using Google Scholar and implementing a filter on the data as a preprocessing phase. The results of their work give an average sensitivity of 98% and 72% specificity over traditional search.
The limitation of their work is related to accuracy which is obtained by using vocabulary filter. They recommended using other types of filtering algorithms on words such as treating plurals and misspelling, or implement a clustering technique as a preprocessing phase.
In , the authors proposed system for helping users to judge the credibility of Web search results and to search for credible Web pages with providing them a brief knowledge of certain topic. Conventional Web search engines present only titles, snippets, and URLs for users, which give few clues to judge the credibility of Web search results.
Moreover, ranking algorithms of the conventional Web search engines are often based on relevance and popularity of Web. they have implemented three functions: 1- computing and visualization credibility scores of web pages, 2- using users' feedback of credibility to estimate a credibility decision model of users, and 3- re-ranking web pages based upon users' feedback.
In , the author proposed an approach for measuring credibility of web articles by using Wikipedia articles for two reasons; Firstly, its public use by students and researchers. Secondly it is free online encyclopedia. 200 articles are selected for testing and key sentences if each article are extracted and assigned a score with consideration of natural language processing elements such as text similarity and word count, also credibility is measured by using Page Rank algorithm. The key sentences of articles are tested by using Google. Based on author, those are the summary findings:
1. Google doesn't retrieve credible search results which are based on the key sentences of the article.
2. Google returns not trusted and unrelated web pages.
3. The key sentences retrieve credible web pages if there is an exact match, but it is achieving poorly with partial matching of web pages that are retrieved by Google.
4. The credibility is different of the key sentence which is using different words or synonyms, or if the sentence contains more or less words.
5. The key sentences may not be clear.
6. Some key sentences are depending on the trustworthiness of author, because they are used in a specific domain.
The following are list of studies that developed metrics of trust, where they focus on one perspective, and neglecting the rest.
In this paper, we will integrate some of these metrics and assign a score for each website.
1. Compete Rank: it is an online project, which is providing users, the traffic of website and the usage of website by users through number of visitors .
2. Search Engine Optimization (SEO) Scores: The researchers developed a formula that uses the content of the website, in order to measure its credibility such as, the number of links, images, and unique terms .
3. Alexa Traffic: it is an online project, which is providing users, the traffic on the website universally and locally, and the top 100 websites which are linking to a website .
4. Wayback Machine (WBM): it archives more than 150 billion web pages since 1996, it provides important metrics such as number of indexed web pages of certain domain, domain
ISBN: 978-0-9853483-3-5 ï¿½2013 SDIWC 176
age, and the frequent update of certain web page .
5. PageRank: it is illustrated in the related work section.
6. The number of indexed pages of certain domain is an indicator of its credibility.
3 WEBSITE CREDIBILITY
This is an indicator of how much to trust or believe what this Website says (i.e. as content). It consists of two elements: trustworthiness; where terms such unbiased, truthful, good, and honest are referred to this website. The second type of elements includes terms related to the level of expertise. This may be referred to using terms such as: experienced, intelligent, powerful, and knowledgeable, are referred to it, also it is agreed that it is a "perceived quality" .
The aim of this paper; is to highlight the metrics for assessing the credibility of websites in order to provide users with certain important clues about a particular website. We used a case study of several Websites from Jordan selected from two sectors, Universities, banks and e-government websites. Tables 1 and 2 show a sample of metrics related to credibility measured for several Websites of Universities, banks and e-government entities in Jordan. The reason for selecting Websites of Universities, banks and e-government is that since those are examples of websites that should provide highly credible information and entities who own those websites are liable for announcing any possible incorrect information.
Results showed that Universities are getting higher trust ranks in comparison to banks and e-government websites due several factors such as: the large number of possible audience, the age of those websites, their popularity, etc.
Table 1: Metrics related to credibility measured for several Jordanian Universities. UN. Visit SEO Alexa Traffic in JO sites linking in Age in days indexed pages in Google Trust Rank
Table 2: Metrics related to credibility measured for several Jordanian banks and e-government websites. Site Visit SEO Alexa Traffic in JO Sites linking in Age in days pages in Google Trust Rank
4 RESULTS AND DISCUSSION
We used data mining prediction to evaluate which metric(s) have significant impact on calculating
ISBN: 978-0-9853483-3-5 ï¿½2013 SDIWC 177
credibility for a particular Website. It should be mentioned however, that the experiments in this area are still immature and perhaps Trust rank metric is calculated from some specific attributes ignoring several other attributes that in future they should be also considered in calculating trust rank metric. In order to convert trust rank metric to a categorical attribute, we divided the values heuristically into three level: Values of trust rank less than 1 is given the label (low), values between 1 and 3 are given the label (medium) and values above 3 are given the label (High).
Figure 1 shows using J48 prediction metric on the collected dataset and using trust rank metric as a class label. The Figure shows that trust rank is solely depending on one attribute which is the number of index pages in Yahoo. This may indicate that the website calculating trust rank is actually taking or using the data from Yahoo pages count. As explained earlier, future formulas should take all relevant attributes into consideration and not only focus on one attribute which may make results biased. Figure 2 shows the accuracy of the predicted rank which shows that recall and precision are high.
Figure 1: J48 trust rank prediction results.
Figure 2: Trust rank prediction performance metrics.
Tables 1 and 3 shows that current trust rank metrics highly, and possibly solely, depend on popularity and traffic related metrics. While popularity should of course be an important criteria to indicate a trust in a Website where the high number of visitors for a Website means that such Website is known and trustworthy, nonetheless, this should not be the only or the major criteria to take into consideration.
In this paper, we only focused on trust rank metrics related to the whole Website in general. However, our preliminary investigations showed that there is a triangle of three factors related to the Website that may impact its trust rank. Those are, the Web pages that Website contains and the Authors that write in this Website. Each one of those three may have unique attributes that can define or specify their own trust metric which may further impact the trust rank of the others. For example, authors with high trust ranks usually write or post on Websites with also high trust ranks, and vice versa.
To study the effect of the classes that were given to the trust rank from original trust rank website (http://www.seomastering.com/trust-rank-checker.php) in the second experiment.
Table 3: Trust Rank class labels Range Label
More than 5
Less than 3
Figure 3 agrees also with our previous findings that Yahoo backlinks metric is a major metric in deciding the trust rank metric. It shows here also other parameters that are related to traffic (i.e. Alexa and visitors metrics).
Figure 3: Trust rank prediction
Figure 4 shows the decision tree for trust rank based on the domain type. Based upon the three Jordanian domains that picked, the tested websites of (i.e. Universities, ministries and banks) results show that Universities have the high rank values
ISBN: 978-0-9853483-3-5 ï¿½2013 SDIWC 178
in comparison to the other two domains. Results also showed that this time Good popularity or page rank value was the first in distinguishing the trust rank based on domains.
Figure 4: Trust rank based on domain types
In general, results confirmed two major points regarding the trust rank metric:
There is a clear high dependability of trust rank on popularity metrics. While popularity should be a major factor, however, it should not be the de facto to judge trust-ability based upon. It is possible that since those metrics are easy to collect and less subjective, this make them the first to consider.
While trust rank checker website claims to base their formula on several other factors, results showed that such claim cannot be proven based on results and statistics.
In this paper, we evaluated metrics related to Websites and pages credibility and authenticity. Those metrics are indicators for the level of confidence and trust users should have and trust on visited websites and the content on those websites. Results showed that the issue is very complex and while we listed several important related metrics to evaluate, nonetheless, the process of evaluating such credibility can still be far more complicated. Results also showed that credibility is an integral process among the three major dimensions in a website: The Website itself and credibility related to the website itself, credibility related to the web pages and the content in those web pages, and credibility related to the authors of the website and pagesï¿½ content.