Website Data And Collection Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Clickstream data is essentiality information about a visitors path through a website and how long they spend on each page they view. From this analysts can determine visitor behaviour patterns and website performance/engagement.

There are several issues with this information regarding the collection mechanisms.

Almost all OLAP tools calculate the time a visitor spends on a page by computing the difference in time between that page being viewed and the next subsequent page viewed. This works for most page views in a session except the last page. There is no way to tell how long a visitor spent on the last page in their session. The data collection mechanism will usually just terminate a session after thirty minutes of inactivity.

The same logic applies to the fist page viewed if is the only page requested in a session. There is no way to tell how long the visitor spent on that first page before leaving. They may have found exactly what they wanted and expected, then left. Or they may have disliked some aspect of the site and left immediately. There is no way to tell with current standard data collection mechanisms.

Most data collection mechanisms use cookies to uniquely identify visitors. This is a reasonable method to identify individual machines, but unless a visitor has provided information regarding their identity (usually by signing up/logging in) there is no sure way to identify the person sitting behind a machine. Another similar weakness to this is that cookies cannot physically follow people around. For example one person may visit a website from home, work and an internet cafe regularly. These would all log separate unique visits using any data collection mechanism but in reality it is the same person accessing the site.

Users deleting cookies can cause similar issues. The one sure way to link sessions to each other is by encouraging visitors to sign up or log in, only after such information has been provided by a visitor can we be confident of their identity (but there can still be no guarantee that a visitor is not sharing credentials with another human).

Page views are generally tracked by URL but as pages become more dynamic, more nonsensical fields are appearing in URLs query strings. This is especially true in Amazon due to its highly dynamic nature. This can pose a problem when analysing the data regarding pages since every unique URL is generally reported on.

Data Collection Mechanisms

The two main competing mechanisms for collecting web data are the more traditional web server log files (using parsers for analyses), and the newer more modern java script page tagging method. Other less popular methodologies exist such as network data collection i.e. packet sniffers which operate by gathering low level network traffic. But these are unconventional and the limitations are similar to log-files and therefore are not discussed in detail.

I will try to contrast the more popular methodologies in the following section.

There are two primary types - page tagging (client-side data collection) and web log files (server side data collection).

Log Files

Log files accumulate records of requests to web servers made by visitor's web browsers. When a visitor makes a web request to a server for a web page or image, each request is recorded in the web server log file. Typically log files will also record such things as error messages, status messages and database transaction details. The following is an example of a log file generated by a web server running on an IIS server.

Log files have several advantages in the type of data that they can provide, some of which cannot be tracked via page tags. These include:-

Log-files contain information on visits from search engine spiders. Although these visits should not be reported as part of the human activity, knowing the usage patterns of spiders can be valuable when engaging in search engine optimization(SEO). This data can be utilized to optimize the content of the site for those spiders.

Using log files, the information is available to track the amount of downloads for files that are successfully completed for comparison with the number of downloads that were not fully completed.

The log file data remains on the company's own servers allowing them to perform analysis of historical data with new programs; any hosted solution must accumulate data before any analyses can be undertaken.

Server error code data is recorded automatically in most log files and can provide useful information about site functionality and expose design flaws that would be difficult to detect through other means.

Log files themselves have several other advantages over page tagging including:

Log files are "built-in" to the protocol, so almost every web server already produces them, thus the raw data is already available for analysis and many ISPs provide a free licence to use a log parsing/analysing tool. Any other data collection method requires some software or hardware change to implement.

Logfiles are a server-side data collection mechanism and thus require no additional DNS lookups/ requests to external data collection servers at the client side which can slow down page loads or result in uncounted page views if a connection to the external server cannot be established.

These advantages are severely overshadowed by their flaws and limitations. There are many issues which are highlighted below:

Log files can grow to be incredibly large, thus a large amount of disk space is necessary to allow for this growth and a large amount of computing power is needed to analyse it.

Data May not be Centralized. If there is a load balancing system in place whereby one website is being served by multiple machines then each machine will produce a separate log-file residing on separate disk. These must be combined before any analysis can be take place which requires even more processing.

Reports are usually generated in batches meaning that they are reporting on historical data rather than in real time which a major issue with log file analysis.

It is impossible to track events that occur within HTML controls using log file analysis, this ability appeared with the addition of java script tags to web pages.

Log file analysers identify (unique) visitors by their IP addresses. This was useful in the 1990's when ISP's assigned static IP addresses but now most connections use a dynamic IP. This makes it impossible to link sessions without using another mechanism e.g. a cookie left on the client's machine which identifies that visitor as unique to the site, but even these have their limitations in linking sessions.

Proxy servers used by most major companies and ISPs, can create obstacles for server side data collection mechanisms. For such systems, proxy servers may interfere and prevent complete data from even reaching the web server to be logged. Manion (2010) illustrates this point well with the following example "If 3,000 people in a proxy group viewed a web page, the web server would only log it as one request because the proxy server requests the web page only once, and then distributes the web page to the 3,000 users in the proxy group. The result is an incomplete picture of visitor behaviour."

Similar issues are caused when a visitor's browser caches a page and the visitor uses the browser navigation buttons. Not counting cached pages can severely skew some site metrics.

Web server log analysis based systems sometimes go to extraordinary measures to filter out machine-generated traffic. The constantly changing signature of machine-generated traffic requires an enormous on-going effort to keep filters current and up to date. Machine generated traffic can place the same load on web servers as human generated traffic and can make it difficult to understand what actual visitors are really doing. Although it is also listed as an advantage that robots show up in logs, this is primarily useful when engaging in SEO, it is not particularly useful for analysts and can cause a lot of problems.

Page tags

Page tagging is based upon the client-side data collection methodology and lately has become the defacto standard in web analytics. Tumurcuoglu et al. (2010)

Data is gathered via a script component in the page, usually written in JavaScript, although any scripting language could be used.

Scripts execute code directly on the client's machine via interpretation by the browsers script engine. These scripts send information directly from the client to a data collection server. This allows for the gathering of much more detailed information about the user and their configuration / environment e.g. OS, browser etc.

Usually the scripts send the data to a third party web server the sole purpose of which is to collect such data, and make it available for analysis. These Software as a Service (SaaS) analytics services provide the scripts for web developers to use in their web pages in order to streamline the process of adding tags to every page in a website. These can however be modified to suit individual needs. Examples of major SaaS analytics applications are Google analytics, SiteCatalyst, Yahoo! web analytics or Ad-Centre analytics(beta).

If it is important that data remains on-site then page tagging may still be used. It is a common misconception that page tagging solutions are only offered by SaaS vendors. There are some vendors offering the option of page tag based data collection e.g. Unica NetInsight. This allows companies to retain ownership of their data if they unwilling to share it, but may be a more expensive option than a hosted service.

A simple HTML script tag such as the following hypothetical example would be all that is necessary to collect detailed data on a user.

<script type="text/javascript" src=""/>

This example would have limitations, typically it would be placed in the body of the document and execute as the page loads, usually requesting an image with various parameters of interest appended to the query string. In order to perform more detailed event tracking, such as a visitor's progress through a form; java script functions must be inserted into the event handler from the form's controls. The following example shows how this would be done with a Google Analytics function.

<a href="" onClick="recordOutboundLink(this, 'Outbound Links', '');return false;"/>

Page tagging provides many advantages over log files. These are discussed in the following section.

Since the script is embedded in the page, it will run every time the page is viewed in a browser; even if the page is retrieved from a cache on the local machine or a proxy. This overcomes one of the major limitations of log-file (and all server-side) mechanisms and allows for a more complete picture of visitor behaviour.

Page tagging can report on any event which occurs within the HTML controls on a web page; including those which do not involve a request to the web server, such as partial form completion, mouse events such as onClick, onMouseOver, onFocus etc. Such interactions are impossible to capture with server-side data collection mechanisms. Page tagging also provides the ability to collect data on interactions within flash objects e.g. video players

It is a common misconception than if a visitor has JavaScript disabled on their browser their visit to a web site may go unrecorded. However, the best of the client-side tracking applications rely on JavaScript only for their ability to track unique users. Cookies are sent with requests for web resources even when JavaScript is turned off thus it should still be possible to combine page views into sessions in this scenario.

More control can be exercised on what data elements are collected within page tags, specific variables such as customer purchase value, discounts received or any other data can be much more easily collected using page tags. If an organisation had desire to capture such data using log-files it would need to be appended to a URL query string (a probable security concern)

Using a page tagging solution from a SaaS vendor offers additional advantages:

The page tagging service manages the process of assigning unique cookies to visitors for identification purposes.

It is possible for companies or departments without access to their web server to perform analysis of their user data, the only requirement is that they can somehow access the web site code and install tracking tags.

Large volumes of historical data can be kept and the concern of scaling up the amount of physical storage required for this data can be left to the vendor.

Analyses can be performed from any computer connected to the internet, even if the web site server goes offline or fails. Web server logs typically require access to the web server to perform any kind of analysis.

The disadvantages of page tagging are few and often there are workarounds for difficult situations. There are some major issues which can influence data collected.

All pages that are to be tracked need to have the tag placed on each individual page, which may take a great deal of time and effort, and allows for the possibility of pages not being tagged due to oversight. If at a later date you wish to change software, every tag in each page would need to be replaced with the proprietary tag from your vendor of choice; also taking considerable time and effort. For this reason some consider page tagging solutions to involve vendor lock-in but this is not necessarily true, it is possible to use tags from multiple vendors simultaneously on one page.

Cookies are used by most page tagging software to identify visitors. Saas product vendors usually assume the responsibility of assigning cookies to visitors. However, this means that, unless using a clever mechanism to make cookies seem to come with a web response (first party e.g. Google Analytics), assigned cookies will be third party cookies i.e. originating from a server other than the web server to which the request was made. Third party cookies are more likely to be rejected by browsers thus causing inaccuracies in collected data. Even if a visitor accepts cookies primarily, there is no guarantee that when the user returns to the site it will still be present as the user could delete their cookies. However most internet users are non-technical and blocking or deleting cookies is of little concern to them, so this is not a major issue. It is important to note that this affects any mechanism relying on cookies and does not purely affect page tagging methods.

Most tagging solutions only allow for the tracking of the start of a download so it is unknown whether or not the download was completed successfully. This information is usually logged on a web server.

There are some situations for which page tagging is not feasible. Static content such as PDFs can be requested from a web server but it is impossible to include scripts in these files. Some pages may be dynamically generated by an application and thus harder to alter to include tags. Although it is possible to track events which lead to these "difficult to track" pages so this is not a major issue.

Hybrid Methodology

There are some situations in which more than one method of data collection is required. For example, java script tagging could be used on a web site to analyze visitor behaviour. However if it was a requirement to analyze the behaviour of search engine robots on a website it would be necessary to use log files since robots do not execute java script and thus leave no data to be analysed in the usual way.

As another example, consider the situation where there is a "dud" link for a file download. This page has a high proportion of exits. It would be difficult to tell that visitors were not able to download the file as the link may get recorded if only using page tags. But the log file would be littered with "404 - file not found error" which would complete the picture of visitor behaviour. Using a hybrid solution, an analyst should be able to spot such things immediately.

From the above summary it is obvious that page tags offer solutions to most of the disadvantages of log file analysis and vice versa; which is why I believe hybrid systems using both approaches to complement each other can instil higher confidence in the accuracy of the data collected and offer more choice in the type data which will be available for analysis. Additionally hybrid solutions offer a medium for comparison of data for verification purposes.