The Internet as a Means of Political Communication
Analyzing Web Site Usage

The only way to analyze the use of any web site is through its log files, which, for each page served, record the time of the request, the address of the computer making the request, and several internal status codes. By analyzing these log files, it is possible to get a general picture of how a site is being used. Files can be compared by the number of times they are served, or "hit." Requests from an individual computer can be tracked to see where a user went within the site over time, and repeat visitors can be identified to some extent. Unfortunately, none of these methods in fact allow us to make very precise judgments about the nature of site users. There are a number of issues which make it impossible to track with complete accuracy the use of a web site, especially at the level of the individual user.16
 The first problem that impedes such analysis is the way in which web browsers store, or cache, information. In order to reduce the amount of network traffic, most browsers store copies of recently visited documents on the local hard drive. If the user returns to a site that is still cached, the browser will read the page off the disk instead of sending a request to the remote web server. While this speeds up the web surfing experience and prevents information "traffic jams" it makes it extremely difficult to track a user's movement within a site, particularly if he or she is likely to be returning frequently to pages that have already been visited. Thus it is impossible to follow exactly in a user's footsteps as he or she navigates throughout a site. It is also impossible to judge the amount of time that a user spends reading any given page. Because of caching, the time between one request and the next is not necessarily indicative of a user's time on that page. The user could even have left the computer and then come back minutes later and loaded a new page without ever reading the first one.
 The nature of computer addresses also makes it difficult to identify individual users. Because the web server logs only the physical address of the computer requesting a file, it has no way to differentiate between different people accessing the site from the same computer.17 This problem manifests itself in a number of ways. The most basic is the situation of a multi-user household or office in which two or more people use the same computer to access the Internet. For example, if three co-workers each access information from the same web server, the server views them as the same user, because the address of the requesting computer is the same.
 This problem manifests itself in many corporate computing settings, in which all of a company's Internet traffic is routed through one central computer, known as a firewall, for security reasons. In order to keep the amount of traffic going through the firewall to a minimum, many such systems will cache copies of frequently-requested Internet documents. When a user inside the corporate network requests one of these documents, it will be served from the local proxy server, instead of from the Internet web server where the information originates from, and no hit will be logged. The most notable use of proxy servers is by large Internet service providers, such as America Online and WebTV, which use proxy servers extensively to provide more consistent service to their customers. All AOL web browsing is funneled through its proxy servers, and thus it is impossible to track any of the service's more than eight million members18, who comprise a significant portion of the web surfing community.
 A third related problem occurs when users must use a modem to dial into their service provider's network. Each time a user connects, he or she is assigned an address from the provider's pool of available modems. Because each provider has a finite number of assignable addresses smaller than its number of users, the addresses are reassigned and recycled with each new connection. Thus, multiple dial-in users can be assigned to the same address throughout the course of a day. In the case of large, national service providers, these users could be even geographically distant.
 Thus, it is virtually impossible to gain a completely accurate understanding of how any web site is being used, especially at the level of individual users' movements. However, it is safe to assume that these trends in proxy serving and address sharing do not have a significant effect on the proportion of hits received by different pages on the same site, nor on the number of hits between two sites. If we assume that the actions of users who are behind proxy servers is relatively similar to those of users with direct access to the sites -- there is no reason to believe that they behave any differently on the whole -- then we can safely compare the number of raw hits between different pages on the same site. While we cannot paint a completely accurate picture of site usage, we can compare the aggregate and daily totals for individual pages and groups of pages, and draw conclusions about the nature of site usage from those relative comparisons, rather than trying to infer usage levels from raw numbers of hits.

©1997 David W. MacLeay