Analyzing Web Site Usage
The only way to analyze the use of any web site is through its log
files, which, for each page served, record the time of the request, the address
of the computer making the request, and several internal status codes. By
analyzing these log files, it is possible to get a general picture of how a
site is being used. Files can be compared by the number of times they are
served, or "hit." Requests from an individual computer can be tracked to see
where a user went within the site over time, and repeat visitors can be
identified to some extent. Unfortunately, none of these methods in fact allow
us to make very precise judgments about the nature of site users. There are a
number of issues which make it impossible to track with complete accuracy the
use of a web site, especially at the level of the individual user.16
The first problem that impedes such analysis is the way in which web browsers
store, or cache, information. In order to reduce the amount of network traffic,
most browsers store copies of recently visited documents on the local hard
drive. If the user returns to a site that is still cached, the browser will
read the page off the disk instead of sending a request to the remote web
server. While this speeds up the web surfing experience and prevents
information "traffic jams" it makes it extremely difficult to track a user's
movement within a site, particularly if he or she is likely to be returning
frequently to pages that have already been visited. Thus it is impossible to
follow exactly in a user's footsteps as he or she navigates throughout a site.
It is also impossible to judge the amount of time that a user spends reading
any given page. Because of caching, the time between one request and the next
is not necessarily indicative of a user's time on that page. The user could
even have left the computer and then come back minutes later and loaded a new
page without ever reading the first one.
The nature of computer addresses also makes it difficult to identify
individual users. Because the web server logs only the physical address of the
computer requesting a file, it has no way to differentiate between different
people accessing the site from the same computer.17 This problem manifests itself in a number
of ways. The most basic is the situation of a multi-user household or office in
which two or more people use the same computer to access the Internet. For
example, if three co-workers each access information from the same web server,
the server views them as the same user, because the address of the requesting
computer is the same.
This problem manifests itself in many corporate computing settings, in which
all of a company's Internet traffic is routed through one central computer,
known as a firewall, for security reasons. In order to keep the amount of
traffic going through the firewall to a minimum, many such systems will cache
copies of frequently-requested Internet documents. When a user inside the
corporate network requests one of these documents, it will be served from the
local proxy server, instead of from the Internet web server where the
information originates from, and no hit will be logged. The most notable use of
proxy servers is by large Internet service providers, such as America Online
and WebTV, which use proxy servers extensively to provide more consistent
service to their customers. All AOL web browsing is funneled through its proxy
servers, and thus it is impossible to track any of the service's more than
eight million members18, who comprise a
significant portion of the web surfing community.
A third related problem occurs when users must use a modem to dial into their
service provider's network. Each time a user connects, he or she is assigned an
address from the provider's pool of available modems. Because each provider
has a finite number of assignable addresses smaller than its number of users,
the addresses are reassigned and recycled with each new connection. Thus,
multiple dial-in users can be assigned to the same address throughout the
course of a day. In the case of large, national service providers, these users
could be even geographically distant.
Thus, it is virtually impossible to gain a completely accurate understanding
of how any web site is being used, especially at the level of individual users'
movements. However, it is safe to assume that these trends in proxy serving and
address sharing do not have a significant effect on the proportion of hits
received by different pages on the same site, nor on the number of hits between
two sites. If we assume that the actions of users who are behind proxy servers
is relatively similar to those of users with direct access to the sites --
there is no reason to believe that they behave any differently on the whole --
then we can safely compare the number of raw hits between different pages on
the same site. While we cannot paint a completely accurate picture of site
usage, we can compare the aggregate and daily totals for individual pages and
groups of pages, and draw conclusions about the nature of site usage from those
relative comparisons, rather than trying to infer usage levels from raw numbers
of hits.
|