NAME
visitors - a fast web server log analyzer
SYNOPSIS
visitors [options] <filename> [<filename> ...]
DESCRIPTION
Visitors generates access statistics from specified web log files.
The resulting reports contain a number of useful informations and
statistics:
· Requested pages
· Requested images
· Referers by number of visits and age
· Unique visitors in each day
· Page views per visit
· Pages accessed by the Google crawler (and the date of google’s last
access on every page)
· Pages accessed by the AdSense crawler (and the date of adsense’s last
access on every page)
· Percentage of visits originated from Google searches for every day
· User navigation patterns (web trails)
· Keyphrases used in Google searches
· Human languages used in google searches
· User agents
· Weekdays and Hours distributions of accesses
· Weekdays/Hours combined bidimensional map
· Month/Day combined bidimensional map
· Visual path analysis with Graphviz
· Operating systems, browsers and domains popularity
· Visitors screen resolution and color depth
· 404 errors
The web log files don’t need to follow a strict format, except: the
date MUST be included between [ and ] chars, the client hostname MUST
be the first entry in the log, referers and requests MUST be included
between double quote chars. Out of the box Apache log file will work
without problems.
It’s possible to use Visitors with IIS log files converting them using
the iis2apache.pl utility distributed with Visitors (The utility is the
same you can find at http://www.jammed.com/~jwa/hacks/ and is
distributed under the GPL license).
Note that logfile can be a - character to use the standard input.
Available options:
-A --all
Activate all the optional reports. This option is equivalent to
-GKUWRDOB. Note that --trails is not implicitly included in
this option because it also requires --prefix. See the
--trails option documentation for details.
-T --trails
Enable the Web Trails feature. The report will show what are
the more frequent moves between pages of your site. This option
requires the --prefix option to work.
-G --google
Activate two reports about pages accessed by the Google and
Adsense web crawlers. Pages are shown ordered accordingly to
the last time the Google web crawler requested the page. The
first page shown is the latest that was accessed.
-K --google-keyphrases
Activate a report that shows common search keyphrases used to
found your web site from Google.
-Z --google-keyphrases-age
Activate a report that shows common the lastest keyphrases used
to found your site from Google.
-H --google-human-language
Activate a report that shows common human languages used to
serach from Google. This feature uses the ’hl’ variable of the
Google referer URL.
-U --user-agents
Show information about common user agents.
-W --weekday-hour-map
Activate the generation of a combined weekdays/hours
bidimensional map that shows information about traffic in every
168 different hours of a 7 days week. Brighter colors mean
higher traffic. This is ideal to figure what’s the best moment
on a week for a maintenance downtime, what’s the target of the
site, if people are accessing it from work or from home, and so
on. The map is generated as pure html inside the report.
-M --month-day-map
Activate the generation of a combined month/day bidimensional
map that shows information about traffic in every 365 different
days of the year. Brighter colors mean higher traffic. This is
useful in order to figure with a quick look traffic trends and
days with particuarly high or low traffic. The map is generated
as pure html inside the report.
-R --referers-age
Shows referers ordered by age. The ’age’ of a referer is the
date it appeared the first time. In the report, newer referers
are on top. This report is useful to check for new external
links.
-D --domains
Activate the generation of information about Top Level Domains
popularity. This information may be useful to guess the amount
of visits from different countries. Note that Visitors will not
resolve numerical IP addresses if they are not already resolved
in the log file. All the unresolved IP addresses will be shown
in this report under the entry Unresolved IP.
-O --operating-systems
Activate the report about Operating Systems popularity, sorted
by number of accesses. All the common operating systems are
listed in the report, while unknown operating systems will be
summed in the unknown entry.
-B --browsers
Activate the report about Browsers popularity, sorted by number
of accesses. All the common browsers are listed in the report,
while unknown browsers will be summed in the unknown entry.
Browsers are listed by family (for example Internet Explorer,
Opera, and so on), and not by specific version.
-X --error404
Activate the generation of missing documents (404 error)
report. This report will show files requested, but missing,
ordered by number of requests. The report is useful in order to
discover if for some mistake there is some file missing in the
web site, but often you will see bizarre requests performed by
users or internet worms and security scans.
-Y --pageviews
Activate the generation of a report that shows (and
approximation) of the percentage of pages viewed per unique
visit. The goal of this report is to understand the usage
pattern of the site and the level of interest of the visitors.
For example, in a site that provides a number of pages with
interesting contents, the percentage of visitors performing a
single page view per visit is probably searching for something
else.
-S --robots
Activate the generation of a report that shows user agents of
clients requesting the file robots.txt, with the exception of
the MSIE Crawler requests. The result is a list of web robots
and spieders that accessed your web site, ordered by number of
requests of robots.txt.
--screen-info
Activate the screen resolution and color depth reports. Note
that for this report to work you have to insert on your HTML
pages the javascript code you can find in the README file in
the visitors tarball.
--stream
Enable the Stream Mode (see the STREAM MODE DETAILS section for
more information). Shortly: when in stream mode Visitors will
process all the log files specified (possibly none, that’s
valid in this mode) as usual, producing the report. Then the
stream mode is entered and Visitors will start to read from
standard input for a continuous stream of web logs, updating
the statistics incrementally as new data is available. A new
report is produced periodically if new data arrived,
accordingly to the --update-every option (default is to update
the statistics every ten minutes). It’s possible to ask
Visitors to reset the statistics after some period of time
using the --reset-every option. This allows to have a snapshot
of what is going on in the last five minutes, hour, day or
week. Note that --stream requires --output-file because
Visitors needs to overwrite the report for every update, so
can’t output to standard output as usually. If you plan to use
the stream mode, also check the --tail option.
--update-every seconds
By default in Stream Mode statistics are updated every 10
minutes. This option specifies a different period in seconds.
--reset-every seconds
By default in Stream Mode statistics are never reset, but
continuously updated incrementally. This option specifies to
reset statistics after the given amount of time in seconds.
This is useful to have a snapshot of the web site usage.
-f --output-file file
Write output to file instead of stdout.
-m --max-lines number
Set the max number of entries that should be shown in reports
like referers, keyphrases and so on. This option sets all the
reports max number of entries for all the reports at once.
-r --max-referers number
Set the max number of entries in the referer report.
-p --max-pages number
Set the max number of entries in the accessed pages report.
-i --max-images number
Set the max number of entries in the accessed images report.
-x --max-error404 number
Set the max number of entries in the missing documents report.
-u --max-useragents number
Set the max number of entries in the user agents report.
-t --max-trails number
Set the max number of entries in the web trails report.
-g --max-googled number
Set the max number of entries in the crawled pages report
(google bot).
--max-adsensed number
Set the max number of entries in the crawled pages report
(adsense bot).
-k --max-google-keyphrases number
Set the max number of entries in the Google keyphrases report.
-a --max-referers-age number
Set the max number of entries in the referers by date report.
-d --max-domains number
Set the max number of entries in the domains report.
-P --prefix string
Prefixes specify to visitors how a link should look like to be
classified as internal to your site. This option is required
for --trails and will also have the nice effect to avoid that
internal links are shown in the referers report. If you are
analyzing statistics for http://www.your.site.com/, just use:
--prefix http://www.your.site.com
If your site is reachable using more hostnames you should
specify all these, like in the following example:
--prefix http://www.your.site.com --prefix http://your.site.com
-o --output html|text
Output module. You can use text or html. The default is html.
-V --graphviz
This option enables the Graphviz mode: Visitors will analyze
the log file and create a graph describing the access patterns
of your web site. The information used to create the graph is
the same as the web trails report (that you can enable with
--trails), but as a graph it can be more readable for non
trivial sites. An example on how to use this feature:
% visitors access.log --prefix http://www.hping.org \
--graphviz > graph.dot
% dot /tmp/graph.dot -Tpng > graph.png
On Debian systems, the dot command is included in the graphviz
package. The generated graph will have edges of different
colors, from blue to red to specify a low to high level of
popularity of a given movement from one page to another of the
web site. This option requires one or more --prefix options in
order to work, just like the --trails option.
-V --graphviz-ignorenode-google
Don’t put the google node on the generated graph. Only useful
with --trails
-V --graphviz-ignorenode-external
Don’t put the external referer node on the generated graph.
Only useful with --trails
-V --graphviz-ignorenode-noreferer
Don’t put the node indicating requests without referer on the
generated graph. Only useful with --trails
--tail When this option is specified Visitors will emulate the Unix
command tail -f --max-unchanged-stats=1 -q. You can specify the
log file names to monitor for changes, once new data is
appended in any of the specified file, visitors will output the
new data to the standard output. This option is useful
conjunction to the Stream Mode (--stream). Files can be log-
rotated because Visitors in Tail Mode will always try to reopen
the file to check for changes.
--time-delta delta
If your web server is in a different timezone than most of your
visitors or yourself, you will notice a shift in the reports
regarding time and days of week. By default, Visitors will
generate output using the host’s locale. You can use the
--time-delta option in order to adjust the output. Positive
values will shift on the right (toward future) from the given
number of hours, negative values will shift on the left (toward
past). In the future this option may have support to directly
specify the output timezone.
--filter-spam
Filter referer spam using a keyword-based filter (see
blacklist.h for more information on keywords). If you don’t
know what referer spam is check this Wikipedia page:
http://en.wikipedia.org/wiki/Referer_spam
--ignore-404
When this option is turned on log lines with 404 errors are
just used to generate the 404 errors report and not used for
other reports.
--grep pattern
Process only log lines matching the specified pattern.
Patterns are matched using the glob-style matching (the one
used by the unix shell):
* Matches any sequence of characters in string,
including a null string.
? Matches any single character in string.
[chars] Matches any character in the set given by chars. If
a sequence of the form x-y appears in chars, then any
character between x and y, inclusive, will match.
\x Matches the single character x. This provides a way
of avoiding the special interpretation of the
characters *?[]\ in pattern.
For default matching is performed in a case sensitive way, but case
insensitive matching may be forced prefixing the pattern with the
string cs:, so for example the pattern cs:firefox will match all the
log lines containing the string firefox, FireFox, FIREFOX and so on.
--exclude pattern
Works exactly like --grep, but only lines NOT matching the
specified pattern are processed. Note that --grep and --exclude
can be used multiple times, and are processed sequentially.
For example visitors --grep firefox --exclude download will
process only lines including the string firefox but not
including the string download.
--debug Show additional information on errors. For example invalid
lines are printed on the standard error if found. Mainly useful
for developers and error reporting.
-h --help
Show usage and copyright information.
-v --version
Show program version.
EXAMPLES
The simplest usage, to be used interactively when you have a web log to
check (for example over ssh in your web server), just use:
% visitors access.log | less
That will produce a human readable output in text only. To generate
html web stats with much more information you may use instead this:
% visitors --output text -A -m 30 access.log -o html > report.html
If you want information on the usage patterns for your site you must
provide the url prefix of your web site, and specify the --trails
option. The next example produces an HTML report with usage patterns
information.
% visitors -A -m 30 access.log --trails \
--prefix http://www.hping.org > report.html
Note that it’s ok to specify multiple file names, or to provide the
input using the standard input like in the following two examples:
% visitors /var/log/apache/access.log.*
% zcat access.log.*.gz | visitors -
STREAM MODE DETAILS
The usual way to run Visitors is to specify some option to control the
report generation, and the name of log files. For example to generate
a report from two Apache’s access log files you can write:
% visitors -A access.log.1 access.log.2 > report.html
Visitors will analyze the log files, and will output the report.
Sometimes it can be more interesting to have web statistics updated
continuously, almost in real time, as new data is available. In order
to provide this feature Visitors implements a mode called Stream Mode
that reads a stream of logs from the standard input. The following
command line shows how to use it (but check the --stream option
documentation for more information).
% tail -f /var/log/apache/access.log | \
visitors --stream -A --update-every 60 \
--output-file /tmp/report.html
Visitors will incrementally update the statistics as new logs are
available and will update the html report every 60 seconds. As you can
see in this mode is required to specify the report file name using the
--output-file option because Visitors needs to overwrite the report to
update it. Note that instead of the tail command in the above example
it is possible to use instead Visitors in Tail Mode (an emulation for
the tail program):
% visitors --tail /var/log/apache/access.log | \
visitors --stream -A --update-every 60 \
--output-file /tmp/report.html
It’s possible to generate real time statistics about the last N seconds
of web traffic, where N is configurable and can be from few seconds to
one week or more, using the --reset-every option. The following example
generates statistics updated every 30 seconds about the last hour of
traffic:
% visitors --tail /var/log/apache/access.log | \
visitors --stream -A --update-every 30 --reset-every 3600 \
--output-file /tmp/report.html
AUTHORS
Visitors was written by Salvatore Sanfilippo <antirez@invece.org>.
COPYING
Copyright (C) 2004,2005 Salvatore Sanfilippo <antirez@invece.org>.
Visitors is distributed under the GNU General Public License.
This manual page was written (based on the original HTML documentation)
by Romain Francoise <rfrancoise@debian.org> for the Debian GNU/Linux
system, but may be used by others. Salvatore Sanfilippo updated this
man page starting from Visitors 0.5, this manual page is now part of
the Visitors tarball.