| Describe a Typical Entry in a Web Server Log | |
| Explain Hit, View, and Visit | |
| Evaluate Results from a Log File Analysis Tool |
It is easy to fall into the “hits” trap. Hit counters have been around for some time now and hits are an easy statistic to quote. When managing a web site you don’t need to know all the details of log files, but understanding what information is available to you and making your analysis consistent over time is important.
The type and depth of log file analysis you perform depends on the type of Web site you manage. If you are simply providing information, knowing who and how many are visiting your site is interesting information, but not vital to the company. If, however, you site is an e-commerce site with a significant impact on the bottom line, you may want to know how users are moving through the site, especially just before they decide to make a purchase.
Web server logs are plain text (ASCII) files, independent of server platform. There are some differences between server software, but traditionally there are four types of server logs:
The first two types of log files are standard. The referer and agent logs may or may not be “turned on” at the server or may be added to the transfer log file to create an “extended” log file format. Each HTTP protocol transaction, whether completed or not, is recorded in the logs, and some transactions are recorded in more that one log. For example, most (but not all) HTTP errors are recorded in the transfer log and the error log. Lets take a look at the type of information collected in an “extended format” transfer log file.
The following is an example of a single line in a common transfer log. This typically displays as one long line of ASCII text, separated by tabs and spaces (useful for importing it into a spreadsheet program).
1Cust216.tnt1.santa-monica.ca.da.uu.net - -[08/May/1999:12:13:03 -0700] GET /gen/meeting/ssi/next/HTTP/1.0 200 9887 http://www.slac.stanford.edu/ Mozilla/3.01-C-MACOS8 (Macintosh; I; PPC) GET /gen/meeting/ssi/next/ - HTTP/1.0IP
Lets look at each section of this entry.
1Cust216.tnt1.santa-monica.ca.da.uu.net
This is the address of the computer making the HTTP request. The server records the IP and then, if configured, will lookup the Domain Name Server (DNS). However, with all the dynamically assigned IP addresses these days, you don’t learn as much as you’d expect from the domain name. In this case the visitor seems to be a customer of an ISP, which is located in Santa Monica, California.
-
Rarely used, the field was designed to identify the requestor. If this information is not recorded, a hyphen (-) holds the column in the log.
-
List the authenticated user, if required for access. This authentication is sent via clear text, so it is not really intended for security. This field is usually filled by a hyphen (-).
[08/May/1999:12:13:03 -0700]
The date, time, and offset from Greenwich Mean Time (GMT x 100) are recorded for each hit. The date and time format is: DD/Mon/YYYY HH:MM:SS. The example above shows that the transaction was recorded at 12:13 pm on May 9, 1999 at a location 7 hours behind GMT.
By comparing time stamps between entries, we can also determine how long a visitor spent on a given page. From the following excerpts from the log, we see that the visitor spent about a minute on the page before moving to the detailed page.
[08/May/1999:12:20:53 -0700] GET /gen/meeting/ssi/next/index.html HTTP/1.0 200 9887 http://www.slac.stanford.edu/gen/meeting/ssi/
[08/May/1999:12:21:50 -0700] GET / HTTP/1.0 200 13516 http://www.slac.stanford.edu/detailed.html HTTP Request GET /gen/meeting/ssi/next/ HTTP/1.0
One of three types of HTTP requests is recorded in the log. GET is the standard request for a document or program. POST tells the server that data is following. HEAD is used by link checking programs, not browsers, and downloads just the information in the HEAD tag information. The specific level of HTTP protocol is also recorded.
200
There are four classes of codes
A status code of 200 means the transaction was successful. Common 300-series codes are 302, for a redirect from http://www.mydomain.com to http://www.mydomain.com/, and 304 for a conditional GET. This occurs when the server checks if the version of the file or graphic already in cache is still the current version and directs the browser to use the cached version. The most common failure codes are 401 (failed authentication), 403 (forbidden request to a restricted subdirectory), and the dreaded 404 (file not found) messages. Sever errors are red flags for the server administrator.
For GET HTTP transactions, the last field is the number of bytes transferred. For other commands this field will be a hyphen (-) or a zero (0).
The transfer volume statistic marks the end of the common log file. The remaining fields make up the referer and agent logs, added to the common log format to create the “extended” log file format. Lets look at these fields.
http://www.slac.stanford.edu/
The referrer URL indicates the page where the visitor was located when making the next request. The actual request is shown in the last field of the entry
GET /gen/meeting/ssi/next/ - HTTP/1.0
and is duplicated from the HTTP Request, the fifth field in this log.
If you were looking at just the referer log, not integrated into the transfer log, it would be made up of just two fields. The left field is the starting URL and the right field is where the reader went from the URL. Transfers within your site would also show in the transfer log. For example, movement from one page to another within a web site might show in the referrer log as:
http://www.slac.stanford.edu/ -> /gen/meeting/ssi/next/
The visitor went from the top-level page to information about the next SSI conference through a link on the page.
Mozilla/3.01-C-MACOS8 (Macintosh; I; PPC)
The user agent is information about the browser, version, and operating system of the reader. The general format is:
Browser name/version (operating system)
The confusion comes from the word “Mozilla,” which is the original code name for Netscape. Now almost all browsers compatible with Netscape use the Mozilla code. The following are entries into a recent agent log:
| Mozilla/4.05 [en]C-PBI-NC404 (Win95; U) | |
| Mozilla/4.51 [en] (WinNT; I) | |
| W3C_Validator/1.22 libwww-perl/5.43 | |
| Mozilla/4.0 (compatible; MSIE 5.0; Windows 95; DigExt) | |
| Mozilla/2.0 (compatible; MSIE 2.1; AOL 3.0; Mac_68K) | |
| Mozilla/4.0 (compatible; MSIE 4.01; AOL 4.0; Mac_PPC) | |
| Mozilla/4.0 (compatible; MSIE 4.01; Windows NT) | |
| Mozilla/4.06 [en]C-gatewaynet (Win98; I) |
The one entry not identified by Mozilla is a robot used to test a specific page. Search engine robots are often similarly identified in the agent log. This is one way to find out who is indexing your site. You can also find out how a reader is reaching your site from a referrer log. The following is a typical entry from a user who entered your site from a category-based index, in this case Yahoo. From this information you can detect how your site is categorized.
http://dir.yahoo.com/Science/Physics/High_Energy_and_Particle_Physics/Research/ Accelerators/Stanford_Linear_Accelerator_Center__SLAC_/ à /
The following is an entry from a search engine, in this case Alta Vista. From this data you can see the key words used by the visitor to find your site.
http://www.altavista.com/cgi-bin/query?pg=q&user=yahoo&q=computer+programming+videos+&stq=20&c9k –-> /comp/edu/classes.html
The following is a typical example of an error log transaction:
[Wed Aug 4 00:02:21 1999] HTTPd: send aborted for adsl-209-233-19-101.dsl.snfc21.pacbell.net, URL: /_vti_bin/_vti_aut/author.exe
Compared to the access log, the error log is designed for human eyes to decode. It starts with a time stamp, in an entirely different format than the access log, and is followed by a textual description of the error. The time stamp is in the following format:
[DAY MON DATE HR:MIN:SEC YEAR]
A entry is made to the error log when there are server startups and shutdowns, access failures (such as 404 error), lost connections, timeouts, and cancellations by the visitor. There is no information about the user.
Hit counters continue to be popular features on web pages, but they, in fact, have little value. First, most hit counters can be adjusted to start at any number. So any number you see in on a hit counter may be artificial. Second, just what is defined as a hit? In fact, requesting a single web page can result in multiple hits to the server. First, if the requested URL does not have a following slash (such as http://www.mydomain.com), the server will first redirect to the URL with a slash (such as http://www.mydomain.com/). Then the page of html will show as a hit and each graphic on the page will also record as a hit in the log. So a page of html with six graphics in a navigation bar could record eight individual hits in the log. This set of eight hits is described as a view, all the hits necessary to display a web page. The next analysis is to look through the views to recreate the user’s visit to the site – how they got to your site, where they went in the site and how long they spent.
Take a look at the following excerpt from a real extended-format web server log file.
Transaction #1
dejh.ipm.ac.ir - - [08/May/1999:00:47:07 -0700] "GET /spires/form/hepfnal.html HTTP/1.0" 200 3529 "http://www-spires.slac.stanford.edu/spires/forms.html" "Mozilla/4.05 [en] (Win95; I)" GET /spires/form/hepfnal.html - "HTTP/1.0"
Transaction #2
202.41.102.153 - - [08/May/1999:02:11:25 -0700] "POST /cgi-bin/form-mail.pl HTTP/1.1" 200 649 "http://www.slac.stanford.edu/spires/find/hepnames/wwwupd?ID=RCV&NODE=PBI.ERNET.IN" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)" POST /cgi-bin/form-mail.pl - "HTTP/1.1"
Transaction #3
oeias1-p2.telepac.pt - - [08/May/1999:03:16:08 -0700] "GET /BFROOT/Images/BABAR2.gif HTTP/1.1" 404 360 "http://www.slac.stanford.edu/BFROOT/old-www/Physics/Workshops/wkshp_home.html" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)" GET /BFROOT/Images/BABAR2.gif - "HTTP/1.1"
Dowload a “sample.log.” This file, in plain ASCII text, is real extended transfer log data from a single visitor to a website we manage. This visit is made up of ninety records. Open the file in a tool of your choice (any text editor or spreadsheet program will open it) and answer the following questions. As supplied, the entries are sorted by time stamp.
Large web sites can have web files with more than 100,000 daily transactions. Analyzing this data by hand is a difficult task. Consider the following data, collected using WebTrends Professional on one week of data from the Stanford Linear Accelerator Center’s Unix web server (the web site contained over 420,000 pages in June 1999).
General Statistics (total activity) | |
| Date & Time This Report was generated | Sunday August 15, 1999 – 15:457:49 |
| Timeframe | 05/02/99 00:30:34 - 05/09/99 00:29:34 |
| Number of Hits for Home Page | 15,166 |
| Number of Successful Hits for Entire Site | 682,664 |
| Number of Page Views (Impressions) | 152,904 |
| Number of User Sessions | 51,399 |
| User Sessions from United States | 58.35% |
| International User Sessions | 28.16% |
| User Sessions of Unknown Origin | 13.47% |
| Average Number of Hits Per Day | 97,523 |
| Average Number of Page Views Per Day | 21,843 |
| Average Number of User Sessions Per Day | 7,342 |
| Average User Session Length | 00:15:25 |
Summary of Activity for Report Period (excludes errors) | |
| Average Number of Users per day on Weekdays | 8,870 |
| Average Number of Hits per day on Weekdays | 119,419 |
| Average Number of Users for the entire Weekend | 3,523 |
| Average Number of Hits for the entire Weekend | 42,783 |
| Most Active Day of the Week | Mon |
| Least Active Day of the Week | Sat |
| Most Active Day Ever | May 03, 1999 |
| Number of Hits on Most Active Day | 123,856 |
| Least Active Day Ever | May 09, 1999 |
| Number of Hits on Least Active Day | 444 |
1. Site visitors are identified in which field of the Access Log?
a)_____ Authuser
b)_____ HTTP Request
c)_____ IP Address or DNS
d)_____ Status Code
2. Which of the following indicates a successful transaction in the web log?
a)_____ 200
b)_____ 302
c)_____ 404
d)_____ 500
3. The referer URL shows?
a)_____ the browser, version, and operating system of the reader
b)_____ number of bytes transferred
c)_____ page where the visitor was located when making the next request
d)_____ address of the computer making the HTTP request
4. Errors are recorded only in the error log.
_____ True
_____ False
5. Which of the following is probably a higher number?
a)_____ Hits
b)_____ Views
c)_____ Visits
d)_____ Hits = Views = Visits
6. Which of the following is the best description of a site visit?
a)_____ A list of all requests made to a web server on a given date
b)_____ All the pages and graphics needed to view a given page on the web
c)_____ The set of recorded transactions by a given IP address or DNS entry, from site entry to exit.
d)_____ Only the image downloads necessary to use a page effectively.
mcdunn
02/23/01
08:41:38 AM