Web Server Log File Analysis -- Basics


After reading this material, you will be able to:

Describe a Typical Entry in a Web Server Log
Explain Hit, View, and Visit
Evaluate Results from a Log File Analysis Tool

It is easy to fall into the “hits” trap. Hit counters have been around for some time now and hits are an easy statistic to quote. When managing a web site you don’t need to know all the details of log files, but understanding what information is available to you and making your analysis consistent over time is important.

The type and depth of log file analysis you perform depends on the type of Web site you manage. If you are simply providing information, knowing who and how many are visiting your site is interesting information, but not vital to the company. If, however, you site is an e-commerce site with a significant impact on the bottom line, you may want to know how users are moving through the site, especially just before they decide to make a purchase.

Information in Server Logs

Web server logs are plain text (ASCII) files, independent of server platform. There are some differences between server software, but traditionally there are four types of server logs:

  1. Transfer (access) log
  2. Error log
  3. Referer log
  4. Agent log

The first two types of log files are standard. The referer and agent logs may or may not be “turned on” at the server or may be added to the transfer log file to create an “extended” log file format. Each HTTP protocol transaction, whether completed or not, is recorded in the logs, and some transactions are recorded in more that one log. For example, most (but not all) HTTP errors are recorded in the transfer log and the error log. Lets take a look at the type of information collected in an “extended format” transfer log file.

Transfer (access) log

The following is an example of a single line in a common transfer log. This typically displays as one long line of ASCII text, separated by tabs and spaces (useful for importing it into a spreadsheet program).

1Cust216.tnt1.santa-monica.ca.da.uu.net - -[08/May/1999:12:13:03 -0700] 
GET /gen/meeting/ssi/next/HTTP/1.0 200 9887 http://www.slac.stanford.edu/  
Mozilla/3.01-C-MACOS8 (Macintosh; I; PPC)  GET /gen/meeting/ssi/next/ - HTTP/1.0IP 

Lets look at each section of this entry.

Address or DNS

1Cust216.tnt1.santa-monica.ca.da.uu.net

This is the address of the computer making the HTTP request. The server records the IP and then, if configured, will lookup the Domain Name Server (DNS). However, with all the dynamically assigned IP addresses these days, you don’t learn as much as you’d expect from the domain name. In this case the visitor seems to be a customer of an ISP, which is located in Santa Monica, California.

RFC931 (or identification)

-

Rarely used, the field was designed to identify the requestor. If this information is not recorded, a hyphen (-) holds the column in the log.

Authuser

-

List the authenticated user, if required for access. This authentication is sent via clear text, so it is not really intended for security. This field is usually filled by a hyphen (-).

Time Stamp

[08/May/1999:12:13:03 -0700]

The date, time, and offset from Greenwich Mean Time (GMT x 100) are recorded for each hit. The date and time format is: DD/Mon/YYYY HH:MM:SS. The example above shows that the transaction was recorded at 12:13 pm on May 9, 1999 at a location 7 hours behind GMT.

By comparing time stamps between entries, we can also determine how long a visitor spent on a given page. From the following excerpts from the log, we see that the visitor spent about a minute on the page before moving to the detailed page.

[08/May/1999:12:20:53 -0700] GET /gen/meeting/ssi/next/index.html HTTP/1.0 200 9887 
http://www.slac.stanford.edu/gen/meeting/ssi/
[08/May/1999:12:21:50 -0700] GET / HTTP/1.0 200 13516 
http://www.slac.stanford.edu/detailed.html HTTP Request GET /gen/meeting/ssi/next/ HTTP/1.0

One of three types of HTTP requests is recorded in the log. GET is the standard request for a document or program. POST tells the server that data is following. HEAD is used by link checking programs, not browsers, and downloads just the information in the HEAD tag information. The specific level of HTTP protocol is also recorded.

Status Code

200

There are four classes of codes

  1. Success (200 series)
  2. Redirect (300 series)
  3. Failure (400 series)
  4. Server Error (500 series)

A status code of 200 means the transaction was successful. Common 300-series codes are 302, for a redirect from http://www.mydomain.com to http://www.mydomain.com/, and 304 for a conditional GET. This occurs when the server checks if the version of the file or graphic already in cache is still the current version and directs the browser to use the cached version. The most common failure codes are 401 (failed authentication), 403 (forbidden request to a restricted subdirectory), and the dreaded 404 (file not found) messages. Sever errors are red flags for the server administrator.

Transfer Volume

9887

For GET HTTP transactions, the last field is the number of bytes transferred. For other commands this field will be a hyphen (-) or a zero (0).

The transfer volume statistic marks the end of the common log file. The remaining fields make up the referer and agent logs, added to the common log format to create the “extended” log file format. Lets look at these fields.

Referer URL

http://www.slac.stanford.edu/

The referrer URL indicates the page where the visitor was located when making the next request. The actual request is shown in the last field of the entry

GET /gen/meeting/ssi/next/ - HTTP/1.0

and is duplicated from the HTTP Request, the fifth field in this log.

If you were looking at just the referer log, not integrated into the transfer log, it would be made up of just two fields. The left field is the starting URL and the right field is where the reader went from the URL. Transfers within your site would also show in the transfer log. For example, movement from one page to another within a web site might show in the referrer log as:

http://www.slac.stanford.edu/ -> /gen/meeting/ssi/next/

The visitor went from the top-level page to information about the next SSI conference through a link on the page.

User Agent

Mozilla/3.01-C-MACOS8 (Macintosh; I; PPC)

The user agent is information about the browser, version, and operating system of the reader. The general format is:

Browser name/version (operating system) 

The confusion comes from the word “Mozilla,” which is the original code name for Netscape. Now almost all browsers compatible with Netscape use the Mozilla code. The following are entries into a recent agent log: 

Mozilla/4.05 [en]C-PBI-NC404 (Win95; U)
Mozilla/4.51 [en] (WinNT; I)
W3C_Validator/1.22 libwww-perl/5.43
Mozilla/4.0 (compatible; MSIE 5.0; Windows 95; DigExt)
Mozilla/2.0 (compatible; MSIE 2.1; AOL 3.0; Mac_68K)
Mozilla/4.0 (compatible; MSIE 4.01; AOL 4.0; Mac_PPC)
Mozilla/4.0 (compatible; MSIE 4.01; Windows NT)
Mozilla/4.06 [en]C-gatewaynet (Win98; I) 

The one entry not identified by Mozilla is a robot used to test a specific page. Search engine robots are often similarly identified in the agent log. This is one way to find out who is indexing your site. You can also find out how a reader is reaching your site from a referrer log. The following is a typical entry from a user who entered your site from a category-based index, in this case Yahoo. From this information you can detect how your site is categorized.

http://dir.yahoo.com/Science/Physics/High_Energy_and_Particle_Physics/Research/
Accelerators/Stanford_Linear_Accelerator_Center__SLAC_/ à / 

The following is an entry from a search engine, in this case Alta Vista. From this data you can see the key words used by the visitor to find your site. 

http://www.altavista.com/cgi-bin/query?pg=q&user=yahoo&q=computer+programming+videos+&stq=20&c9k –->
/comp/edu/classes.html

Error Log

The following is a typical example of an error log transaction:

[Wed Aug 4 00:02:21 1999] HTTPd: send aborted for adsl-209-233-19-101.dsl.snfc21.pacbell.net, 
URL: /_vti_bin/_vti_aut/author.exe

Compared to the access log, the error log is designed for human eyes to decode. It starts with a time stamp, in an entirely different format than the access log, and is followed by a textual description of the error. The time stamp is in the following format:

[DAY MON DATE HR:MIN:SEC YEAR]

A entry is made to the error log when there are server startups and shutdowns, access failures (such as 404 error), lost connections, timeouts, and cancellations by the visitor. There is no information about the user. 

Hits, Views, and Visits

Hit counters continue to be popular features on web pages, but they, in fact, have little value. First, most hit counters can be adjusted to start at any number. So any number you see in on a hit counter may be artificial. Second, just what is defined as a hit? In fact, requesting a single web page can result in multiple hits to the server. First, if the requested URL does not have a following slash (such as http://www.mydomain.com), the server will first redirect to the URL with a slash (such as http://www.mydomain.com/). Then the page of html will show as a hit and each graphic on the page will also record as a hit in the log. So a page of html with six graphics in a navigation bar could record eight individual hits in the log. This set of eight hits is described as a view, all the hits necessary to display a web page. The next analysis is to look through the views to recreate the user’s visit to the site – how they got to your site, where they went in the site and how long they spent. 


Exercises

1. Interpreting Transaction Log File Data

Take a look at the following excerpt from a real extended-format web server log file.

Transaction #1

dejh.ipm.ac.ir - - [08/May/1999:00:47:07 -0700] 
"GET /spires/form/hepfnal.html HTTP/1.0" 200 3529 
"http://www-spires.slac.stanford.edu/spires/forms.html" 
"Mozilla/4.05 [en] (Win95; I)" 
GET /spires/form/hepfnal.html - "HTTP/1.0"

Transaction #2

202.41.102.153 - - [08/May/1999:02:11:25 -0700] 
"POST /cgi-bin/form-mail.pl HTTP/1.1" 200 649 "http://www.slac.stanford.edu/spires/find/hepnames/wwwupd?ID=RCV&NODE=PBI.ERNET.IN" 
"Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)" 
POST /cgi-bin/form-mail.pl - "HTTP/1.1"

Transaction #3

oeias1-p2.telepac.pt - - [08/May/1999:03:16:08 -0700] 
"GET /BFROOT/Images/BABAR2.gif HTTP/1.1" 404 360 "http://www.slac.stanford.edu/BFROOT/old-www/Physics/Workshops/wkshp_home.html" 
"Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)" 
GET /BFROOT/Images/BABAR2.gif - "HTTP/1.1"
  1. How many visitors are reflected in these web transactions? Explain.
  2. On what date did these transactions take place?
  3. How many minutes passed between the first and last entry?
  4. Were all three transactions successful? If not, explain.
  5. Which transaction requested the largest file? What size was the file?
  6. What browsers are being used to access these pages?
  7. What platforms are being used to access these pages?
  8. Can you determine the path the visitor from 202.41.102.153 took through the website? here

2. Hits, Views, and Visits

Dowload a “sample.log.” This file, in plain ASCII text, is real extended transfer log data from a single visitor to a website we manage. This visit is made up of ninety records. Open the file in a tool of your choice (any text editor or spreadsheet program will open it) and answer the following questions. As supplied, the entries are sorted by time stamp.

  1. How many hits could this visit record to a simple hit counter?
  2. How many unique pages were viewed during the visit?
  3. Were any pages viewed more than once?
  4. How long did the visitor spend on the site during that single visit?
  5. From this data can you easily track the path the visitor took through your site?
  6. As the manager of a web site, which of these bits of data is most useful to know?

3. Analyzing Data From Web Log Analysis Tools

Large web sites can have web files with more than 100,000 daily transactions. Analyzing this data by hand is a difficult task. Consider the following data, collected using WebTrends Professional on one week of data from the Stanford Linear Accelerator Center’s Unix web server (the web site contained over 420,000 pages in June 1999).

General Statistics (total activity)   

 Date & Time This Report was generated  Sunday August 15, 1999 – 15:457:49
Timeframe     05/02/99 00:30:34 - 05/09/99 00:29:34
Number of Hits for Home Page  15,166
Number of Successful Hits for Entire Site 682,664
Number of Page Views (Impressions)  152,904
Number of User Sessions  51,399
User Sessions from United States  58.35%
International User Sessions  28.16%
User Sessions of Unknown Origin  13.47%
Average Number of Hits Per Day  97,523
Average Number of Page Views Per Day  21,843
Average Number of User Sessions Per Day  7,342
Average User Session Length  00:15:25 

Summary of Activity for Report Period (excludes errors)

Average Number of Users per day on Weekdays  8,870
Average Number of Hits per day on Weekdays  119,419
Average Number of Users for the entire Weekend  3,523
Average Number of Hits for the entire Weekend  42,783
Most Active Day of the Week  Mon
Least Active Day of the Week  Sat
Most Active Day Ever  May 03, 1999
Number of Hits on Most Active Day  123,856
Least Active Day Ever  May 09, 1999
Number of Hits on Least Active Day  444
  1. How many transactions (hits) were recorded for this site for the week? What is the daily range (low and high, excluding error)?
  2. How many actual page views were reflected in these hits?
  3. How many visits were reflected in these views?
  4. How many views per day, on average?
  5. How long does the average visit last?
  6. If the server needs to be shut down for servicing, what would be the best day of the week to perform that work?
  7. Does the number of hits on the least active day fit with the remaining statistics? Explain.

Review Questions

1. Site visitors are identified in which field of the Access Log?

a)_____ Authuser
b)_____ HTTP Request
c)_____ IP Address or DNS
d)_____ Status Code

2. Which of the following indicates a successful transaction in the web log?

a)_____ 200
b)_____ 302
c)_____ 404
d)_____ 500

3. The referer URL shows?

a)_____ the browser, version, and operating system of the reader
b)_____ number of bytes transferred
c)_____ page where the visitor was located when making the next request
d)_____ address of the computer making the HTTP request

4. Errors are recorded only in the error log.

_____ True  
_____ False

5. Which of the following is probably a higher number?

a)_____ Hits 
b)_____ Views
c)_____ Visits
d)_____ Hits = Views = Visits

6. Which of the following is the best description of a site visit?

a)_____ A list of all requests made to a web server on a given date
b)_____ All the pages and graphics needed to view a given page on the web
c)_____ The set of recorded transactions by a given IP address or DNS entry, from site entry to exit.
d)_____ Only the image downloads necessary to use a page effectively.

Answers


mcdunn
02/23/01 08:41:38 AM