NAME
mailliststat - Display useful statistics on email messages
SYNOPSIS
mailliststat [-hvq] [-i file] [-o file] [-r|w|u file] [-t|T text] [-m
mode] [-n XX] [-g xxxx]
DESCRIPTION
MailListStat is program that prints some "useful" statistical info on
email messages. It’s main usage is in email conferences - mailing
lists. Currently it displays both tables and graphs. You can select
either TEXT or HTML output.
OPTIONS
-h print help text and exit
-q be quiet (print only errors to stderr)
-v turn on verbose mode - in this mode it will print more info to
stderr - indication of progress (will print every 10th, 20th,
..., 90th, 100th, 200th, ..., 900th, 1000th, 2000th ... message
being processed) and warnings about malformed headers found
-i file
name of input file (if not specified, use stdin). This file
should be in MBOX format. It should exist and be readable.
-o file
name of output file (if not specified, use stdout). If exists,
it will be overwritten.
-r file
read input from cache file instead of mailbox. You can read
input either from mailbox or cache file, not both!
-w file
write cache file (no stats produced). You can either produce
text output or write cache file, not both! When writing cache
file, output-related options are ignored.
-u file
update cache file = read cache, read input, write cache. For use
with .procmailrc/.forward
-t text
name of mailing list this statistics is computed for. If
specified, it is just appended to the title of statistics, so it
will be like "Statistics from 16.8.2001 to 7.9.2001 for text",
where text is whatever you put as this parameter (it could be
name of the mailing list or just its email, e.g.
mobil@mobil.sk).
-T text
title text (only this will be printed as title); this can be
used to supress normal title text (date of oldest/newest msg)
and completely replace it with your text.
-m mode
select mode of output (text, html, html2).
-n XX show TOP XX tables (default TOP 10). By default, mailliststat
displays tables of TOP 10 people, subjects, quoting or whatever.
Using this parameter, you can define how many lines shall these
tables have.
-g xxxx
graphs to show (Day, Week, Month, Year, Xnone) - specify first
letter (e.g. -g dmy).
EXIT STATUS
0 Everything went OK and no error occurred.
1 Error in sscanf() while reading & parsing cache file. It means
that the format of cache file is invalid. Try to create the
cache file again.
2 Invalid command-line option. You have specified an invalid
command-line parameter.
3 Cannot open input/output file. Please check that you have typed
correct filename and that you have read permissions for input
file and write permissions to destination directory (because
output file must be created). If output file exists, it’s
overwritten.
4 Not enough memory is available for dynamically allocated
variables. This could be caused by user-limits, because
mailliststat requires only few MBs of memory (it depends on
number of messages processed and number of different subjects
and authors).
5 Error compiling regex. This error should not occur in world-
available versions.
USAGE
Input
On input, there should be mailbox file in standard MBOX format. If the
file is in different format, the results are unpredictable. There
should be at least one email message, otherwise no stats can be
computed.
Warning: Be sure that no special messages are in input files (such as
that with "DON’T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA" subject),
because they will be also analysed. Many programs (POP3/IMAP daemons,
email readers) put their special messages to the mailbox. This message
is only ignored when reporting oldest message found.
Output
Statistics is put into output file (or stdout if unspecified). All
diagnostic messages are written to stderr. Output consists of several
statistical data - tables, graphs and summaries. The title has two
formats depending on -t parameter. If it’s not specified, it looks like
"Statistics from 16.8.2001 to 7.9.2001", where first date is date of
the oldest message found in input and second is date of the newest one.
If there is for example -t mobil@mobil.sk parameter, it will look like
"Statistics from 16.8.2001 to 7.9.2001 for mobil@mobil.sk". The problem
is that date of oldest & newest msg is often wrong (thanks to bad
date/time settings on PC of msg author), so you can specify entire
title using command-line option -T. When used, only your text will be
printed as title, nothing more. There you can put for example something
like "Statistics for mobil@mobil.sk".
Now you have option ( -g) to specify which graphs you want to show -
hours of Day, days of Week, days of Month, months of Year. Use 1st
letters as argument to -g option (so -g dw will print just hours of Day
and days of Week). Use -g x to disable printing of any graph. For
example you don’t want to show graph for months of Year if you are
presenting stats for one month, but for full-year stats you probably
want it.
HTML output
You can choose between 2 modes of output - TEXT and HTML. When in HTML
mode, mailliststat will produce the output as HTML page. When you
specify HTML2 mode, only the body of HTML document is produced (no
header/footer) - it can be used to have different HTML header/footer
when calling mailliststat as CGI or when using PHP wrapper. The output
consists of HTML tables and bar graphs. Almost every aspect of how it
looks can be configured by modifying CSS style-sheet. Please note that
files style_mls.css and bar.gif must be present in the same directory
as produced HTML file. You can, however, modify both to best suit your
needs. Everything should be clear after reading comments in CSS file
and looking at the produced HTML source.
I was unsure what type of graphs to produce. I have tried also
horizontal bar graphs and if you want to try them, just uncomment part
of code in PrintGraphHtml() in mls_text.c.
Cache file support
Instead of producing statistics in text format, you can save all the
generated values/results into "cache" file. Retrieving information from
this file is very fast, so it is useful for integration with web pages.
Now you can update the cache file just after new mail was received.
Users can view actual stats using mailliststat
as CGI script. It has an advantage over static stats that user can
choose options and it will be generated in a moment!
To update cache file, use the -u option. It works like this: first, the
stats are loaded from cache file (doesn’t have to exist) and then new
message(s) to be added are read from stdin (or from -i file) and added
to the stats. Finally the updated stats are written back to the cache
file. The process is really quick, because usually only one message is
added at a time. This is useful mainly for updating cache files upon
receiving new message. In the "examples/" subdir, you can find examples
of integration with your .forward and .procmailrc files. By running MLS
more than once, you can generate cache files for individual months and
also for whole years (see examples). Then use some PHP script to
present list of these cache files to user.
Format of cache files was changed in version 1.3, because of new stats
added. Now it contains version info, so mailliststat can inform you
that you have to re-create that cache file with new version.
Unfortunately, you have to re-create them also when you want new email
clients to be recognized also in old (already processed) messages. Note
that email clients detection was buggy in 1.2.2 (a lot of clients not
recognized).
PHP wrapper
I have written also PHP wrapper for mailliststat to make it more
"interactive". It has one major advantage over plain HTML output from
mailliststat: User can choose output number of TOP items to show. It
works by running mailliststat with appopriate command-line options.
It’s safe, because only one item from user is topXX which is checked
using regexp, so running arbitrary code is not possible. You can also
alter mailliststat output - for example change @ in email addresses to
(at) to prevent spamming.
You can have normal MBOX file as input, but I recommend using cache
file. When using cache file, the stats are produced in a moment. You
can see how long it took to generate the page, see the last line of
HTML source. However, there is minor speed problem. It takes longer
when you specify to show many topXX (like 999). The problem is regexp
that searches for @. It has to search for it in whole mailliststat
output together and when it is large, it takes a while (1.1 seconds on
my 2.1GHz pentium4). I have added an option which should use Perl-
compatible regex function (preg_replace) instead of POSIX
(ereg_replace), if available. This will result in MUCH faster execution
(50ms instead of 1.1sec).
NOTES
How it is all computed?
OK, so let’s start from beginning - the format of MBOX file. It’s plain
text file containing some email messages delimited with one empty line.
Each message starts with line like this From abc@a.sk Thu Aug 16
15:48:58 2001. After this line, there are few headers, one empty line
and message text. Storing emails in this format is quite common - your
incoming mail is usually saved in MBOX format and also your folders in
mail-readers like elm(1), pine(1), mutt(1)...
Who is author of an email message? It’s taken from From: header field
and everything except the actual email address (like your full name) is
stripped off using quite simple regular expression (regexp).
Subject is taken from Subject: header field. If it contains some Re:,
those will be stripped off. There can be up to 5 of them. Also counted
format ( Re[3]:) is supported. For example The Bat! email client uses
it. MIME-decoding is applied to subject lines (see below).
Date is just everything in the Date: header. This header is generated
by the email client, so it’s date of message creation and it doesn’t
have to be present in each message. If it isn’t, you are warned by
message like "Warning: 1 message(s) not counted." in output. Some
clients don’t put full date there and usually the day of week is
missing and you are warned. No timezones are considered, the date is
taken as-is.
Message size is everything between end of message header and beginning
of new email (or end of file). So only actual size of message text
(body) is counted, not headers.
Email clients are taken from X-Mailer: or User-Agent: or X-Newsreader:
headers and some grouping is done to avoid different versions of the
same mailer to take the whole TOP 10. There is also work-around for
Pine mailer (MLS will search also Message-ID: header).
What is quoting? Why I have it 95%?
What is quoting? When you reply to some message, you can insert part of
the original message there, you quote the author of original message.
Every line of original text is usually prepended with > or MP>, where
MP are initials of the original sender’s name (for example The Bat!
uses this second format).
And what is "quote ratio"? It’s size of quoted text divided by total
size of message, specified in percent. It’s included in stats, because
many people reply to message, add one line of text and leaving there
for example 10 pages of original text, which makes the quote ratio even
higher than 90%! In times of FIDONET, there were conferences, where
quote ratio higher than 50% was forbidden. Try to think about it when
replying to message in mailing list where more than 300 people will
download and read it.
And now all the stats
At first, there are TOP 10 tables (or TOP XX when using -n XX
parameter). First table shows people who have written most messages,
how much and how many percent of total message count it is. Last row
shows the "other" - number of messages written by everyone not listed
above and how many percent it is. Second and third tables are similar
to this one - they also show best authors, but not by the number of
messages written. Authors are sorted by total (or average) size of all
their messages, but without quoting (size of message minus how much was
quoted in that msg). Next table shows most successful subjects and how
many messages with this subject have been posted. The other table shows
most used email clients. The last table show people with maximal quote
ratio. It’s computed as sum of quoted text in all his/her messages
divided by total size of those messages. Last row shows an average -
sum of quoted text in all messages divided by total size of all
messages.
Next part of stats are some graphs. They show how much messages have
been written during different hours of day, days of month and days of
week. From these you can see for example when (and how much) people
sleep :) or if they work during the working-hours or just write tons of
messages...
Next part contains info about messages which are BEST in something -
message with max. quote ratio, longest message and some details about
most successful subject.
At the end, there is final summary - total number of messages, their
total and average size and number of different authors and subjects.
MIME (Multipurpose Internet Mail Extensions)
What is it? Original implementation email permitted only 7bit ASCII
messages. But during the time, there was need to send international or
even binary files. MIME defines how can these be encoded into 7bit form
suitable for emailing and how to decode it back to human readable form.
In email message, you can have MIME-encoded text (body of message), but
also some headers - for example subject and From field. MLS tries to
find out if subject lines are MIME-encoded and if so, it tries to
decode it, to present it to you in human-readable form. You can read
more about MIME in RFC 1521 and 1522.
Inspiration
I was inspired by similar DOS program used before few years in FIDONET
and Slovak ULTRANET. It was created by Ivan Friedlander.
BUGS/TODO
· doesn’t support header fields splitted to more lines (you can
use formail(1) to put them to one line before using MLS)
· charset conversion in MIME-decoding
· more stats
VERSION
This man page is written for mailliststat version 1.3.
AUTHOR
mailliststat (MailListStat) is written by Marek -Marki- Podmaka
<marki@nexin.sk>.
SEE ALSO
Visit http://freshmeat.net/projects/mls for more information and latest
version of mailliststat.
COPYING
MailListStat - print useful statistics on email messages Copyright (C)
2001-2003 Marek Podmaka <marki@nexin.sk>
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
59 Temple Place, Suite 330, Boston, MA 02111-1307 USA