NAME
harvestman - multithreaded desktop webcrawler written in Python
SYNOPSIS
harvestman [options] [-C configfile]
DESCRIPTION
HarvestMan is a desktop WebCrawler written completely in the python
programming language. It allows you to download a whole website from
the Internet and mirror it to the disk for browsing offline. HarvestMan
has many customizable options for the end-user. HarvestMan works by
scanning a web page for links that point to other web pages or files.
It downloads the files and copies them to the disk. HarvestMan
maintains the directory structure of the remote website when it mirrors
the website to the disk. Every html file is scanned like this
recursively, till the whole website is downloaded.
Once the download is complete, the links in downloaded html files are
localized to point to the files on the disk. This makes sure that when
the user browses the downloaded pages, he does not need to connect to
the Internet again. If any file failed to get downloaded for some
reason, HarvestMan will convert its relative Internet address to point
to the complete Internet address, so that the user will be connected to
the Internet when he clicks on the link, and does not get a dead-link
error. (404 error).
From version 1.2, HarvestMan uses two family of threads, the "Fetchers"
and the "Getters", for downloading. The Fetchers are threads which have
the responsibility of crawling webpages and finding links and the
Getters are threads which download those links (the non-html files).
HarvestMan, as of latest version is a console application. It can be
launched by running the HarvestMan script (HarvestMan.py) if you are
using the source code, or the HarvestMan executable, if you are using
the executable (available on Win32 platforms).It prints informational
messages to the console while it is working. These messages can be used
to debug the program and locate any errors.
HarvestMan works by reading its options either from the command line or
from a configuration file. The configuration file is named "config.xml"
by default.
The is a major change from HarvestMan 1.5 onwards is that the
configuration is now in an XML file called "config.xml". You can also
use the convertconfig.py script, present in HarvestMan/tools/ of your
installation to convert your configuration from text to XML and vice
versa. For full details, see the Changes.txt file and see the website
at http://harvestmanontheweb.com
HarvestMan writes a binary project file using the python pickle
protocol. This project file is saved under the HarvestMan base
directory with the extension .hbp. This is a complete record of all the
settings which were used to start HarvestMan and can be read back later
using the -- projectfile option to restart a HarvestMan project.
MODES OF OPERATION
HarvestMan has two major modes of operation. One is a fully
multithreaded mode, also called as a fast mode.
Fast Mode
Fast Mode is the most useful mode of HarvestMan. In this mode,
HarvestMan launches multiple threads for each url link, and
stores them in an internal queue. Also, HarvestMan will launch a
separate download thread for each non-html file encountered.
This process is very fast and you can download websites very
quickly using this mode as multiple downloads occur at the same
time.
This mode is the default. You can use this mode if you have a
relatively large bandwidth, and a reliable connection to the
Internet.
Since HarvestMan is network-bound, using multiple threads speeds
up the download.
Slow Mode
In the Slow Mode, download of websites happen in a single
thread, the main program thread. Each download will have to
wait for the previous one to get completed, so this is a
relatively slow process. You can use this mode, if you have an
unreliable Internet connection or a relatively small bandwidth,
which does not support opening of multiple sockets at the same
time.
This mode is disabled by default. You can enable it by setting
the variable FASTMODE in the configuration file to zero.
(Described somewhere in this document)
If you see a lot of "Socket" type errors when you launch a
HarvestMan project by using the default mode (fastmode), switch
to this mode. This would give you a very reliable download,
though a slow one.
USAGE
As said earlier, HarvestMan reads its options from a configuration file
or from the command line. The configuration file by default is named
"config.xml". You can pass another configuration file name to the
program by using the command line options -configfile/-C.
HarvestMan can also read options from the command line.
From version 1.1, HarvestMan would also be able to read back previous
project files by using the command line option -projectfile.
We will first discuss the structure of the configuration file and how
it can be used to create a HarvestMan project. For more information on
the command line arguments, run the program with the -help or -h
option.
CONFIGURATION FILE
The configuration file is a simple text file with many options which
are a pair of variable/value strings separated by tabs or spaces. Each
variable/value pair appears in a separate line. Comments can be by
adding the hash character ’#’, before any line.
HarvestMan has three basic options and some 50 advanced options.
BASIC OPTIONS
HarvestMan needs three basic configuration options to work. These are
described below:
project.name: This is the project name of the current download.
HarvestMan creates a directory of this name in its base directory
(described below) where it keeps all the downloaded files. The project
name needs to be a non-empty string. (Spaces are allowed.)
project.url: This is the starting url for the program from where it
starts download. HarvestMan supports the WWW/HTTP/HTTPS/FTP protocols
in this url. If a url does not begin with any of these, it will be
considered as an HTTP url. For example, http://www.python.org,
www.yahoo.com, cnn.com
project.basedir: This is the base directory for the program where it
creates the project directories and stores all downloaded files. If
this directory does not exist, HarvestMan will attempt to create it.
ADVANCED OPTIONS
For precisely configuring your download, HarvestMan supports about 30
advanced options. You will need to use many of them, if you would like
to control your download exactly the way you want. The following
section describes each of these settings and what they do. Read on.
The Fetchlevel setting
From Version 1.2, there is a change in this setting. Read on.
This is one of the most useful options to tweak in a HarvestMan
project. The option is controlled by the variable
download.fetchlevel in the configuration file.
Make sure you read the following documentation very carefully.
When you are downloading files from a website, you would prefer
to limit your download to certain areas of the Internet. For
example, you might want to download all links pointed by the url
http://www.foo.com/bar (a hypothetical example), that come under
the www.foo.com web server. Or you might want to download all
links under the directory path http://www.foo.com/pics and no
more. You can use this option to do exactly that.
The option download.fetchlevel has 5 possible values that range
from 0 - 4.
A value of 0 limits the download to a directory path from where
you start your download. For example, if your starting url was
http://www.foo.com/bar/index.html, this option makes sure that
all links downloaded will be belonging to the directory url path
http://www.foo.com/bar and below it. Any web links pointing to
directories outside or other web servers would be ignored.
A value of 1 limits the download to the starting server, but
does not limit it to paths below the starting directory.
For example, if your starting url was
http://www.foo.com/bar/index.html, this option would also
download files from the http://www.foo/com/other/index.html
page, since it belongs to the starting server.
A value of 2 performs the next level fetching. It allows all
paths in the starting server, and also all urls external to the
starting server, but linked directly from pages in the starting
server. For example, if your starting url
http://www.foo.com/bar/index.html contained a link to
http://www.foo2.com/bar2/index.html (an external server),
HarvestMan will try to download this link also. But all urls
linked linked from this link, i.e from
http://www.foo2.com/bar2/index.html, would be ignored.
A value of 3 performs a fetching similar to above, but the
difference is that it does not get files which are linked
outside the directory of the starting url, but gets the external
links which are linked one level from the starting url. For
example, if your starting url
<http://www.foo.com/bar/index.html> contained a link to
http://www.foo2.com/bar2/index.html (an external server),
HarvestMan will try to download this link also. But a url like
<http://www.foo.com/other/index.html> (a link outside the
starting url’s directory) will be ignored.
A value of 4 gives you no control to the fetching process. It
will allow all web pages to be downloaded, including web pages
linked from external server links, encountered in the starting
url’s page. Setting this option will mostly result in the
crawler trying to crawl the entire Internet, assuming that your
starting url has links to other outside servers. Set this
option, only if you are very sure of what you are doing. Any
value above 4 has no special meaning, and would behave just like
above.
For most downloads, this value can be specified between 0 and 2.
The Depth Setting
This is another setting that gives you control over your
download. It is denoted by the variable control.depth in the
configuration file.
This value specifies the distance of any url from the starting
url’s directory in terms of the directory path offset. This is
applicable only to the directories (links) in the starting
server, below the starting url’s directory. The default value is
10.
If a directory is found whose offset is more than this value,
any links under it will not be downloaded.
You can specify zero depths in which case the download will be
limited to files just below the directory of the starting url.
Examples: If the starting url is
http://www.foo.com/bar/foo.html, then the url
http://www.foo.com/bar/images/graphics/flowers/flower.jpg has a
depth of 3 relative to the starting url.
The External Depth Setting
This option also helps you to control downloads. It is denoted
by variable control.extdepth in the configuration file.
This value specifies the distance of a url from its base server
directory. This is applicable to urls which belong to external
servers and to urls outside the directory of the starting url.
If a directory is found whose distance from the base server path
is more than this value, any files under it will be ignored.
Note that this option does not support the notion of zero depth.
A valid value for this has to be greater than or equal to one.
Examples: The url http://www.foo.com/bar/images.html has an
external depth of 1 relative to the base server directory,
http://www.foo.com.
The External Servers Setting
This option tells the program whether to follow links belonging
to outside web servers. This is denoted by the variable
control.extserverlinks. By default, the program ignores external
server links.
The option has lesser precedence to the download.fetchlevel
setting. If download.fetchlevel is set to a value of 2 or
above, this setting is conveniently ignored.
The External Directories Setting
This option tells the program whether to download files
belonging to outside directories ,i.e directories external to
the directory of the starting url. This is denoted by the option
control.extpagelinks in the configuration file.
This option tells the program whether to follow links belonging
to outside directories.
The default value is 1 (Enabled). The download.fetchlevel
setting has precedence over this value. If download.fetchlevel
is set to a value of 1 or more, this setting is conveniently
ignored.
The Images Setting
Specifies the program whether to download images linked to
pages. Enabled by default. This option is denoted by the
variable download.images in the configuration file.
The Html Setting
Tells the program whether to download html files. Enabled by
default. Denoted by the variable download.html.
Maximum limit of External Servers
You can put a check on the number of external servers from which
you want to download files from, by setting this option to a
non-zero value. It takes precedence to the download.fetchlevel
setting. This option is controlled by the variable
control.maxextservers in the configuration file. The default
value is zero which means that this option is ignored.
To enable this option, set it to a value greater than zero.
Maximum limit on External Directories
You can put a check on the number of external directories from
which you want to download files from, by setting this option to
a non-zero value. It takes precedence over the
download.fetchlevel setting. This option is controlled by the
variable control.maxextdirs, in the configuration file.
The default value is zero which means that this option is
ignored.
To enable this option, set it to a value greater than zero.
Maximum limit on Number of Files
You can precisely control the number of total files you want to
download by setting this option. It is denoted by the variable,
control.maxfiles. The default value is 3000.
Default download of images
This option tells the program to always fetch images linked from
pages, though they might be belonging to external
servers/directories or might be violating the depth rules.
This option takes precedence over the
control.extpagelinks/control.extserverlinks settings and the
control.depth/control.extdepth settings.
The download.image setting has a higher precedence than this
setting.
This option is enabled by default. Denoted by the variable
download.linkedimages.
Default download of style sheets (.css files)
Same as the above option, but only that this options checks for
stylesheet (css) links. This has higher precedence over
control.extpagelinks/control.extserverlinks and the
control.depth/control.extdepth settings. Enabled by default.
This option is denoted by the variable
download.linkedstylesheets.
Maximum thread setting
This options sets the number of separate threads(trackers)
launched by the program at a time. This is not an accurate
setting. Note that a given time does not really mean that so
many connections are running per second but only tells the
program that it cannot launch threads above this limit.
This option makes sense only in multithreaded downloads, i.e,
only when the program is running in fastmode. In slowmode, this
setting has no effect.
Denoted by the variable system.maxtrackers. The default value is
10.
Separate threads for file download
This option controls the ,multithreaded download of non-html
files in the fastmode. In fastmode, separate download threads
are launched to retrieve non-html files. If you disable this
option, these files will be downloaded in the main thread of the
downloader thread.
By default, this option is enabled. You can tweak it by the
variable system.usethreads.
Mode Selection
As described in the beginning, there are two modes for
HarvestMan, the fast one and the slow one. This option allows
you to choose your mode of operation.
The variable for this option is system.fastmode. The default
value is 1, which means that the program uses fastmode. To
disable fastmode, and switch to slowmode, set this variable to
zero.
Size of the thread pool
This value controls the size of the thread pool used to download
non-html files when the program runs in fastmode and
system.usethreads is enabled. The default value is 10.
This option is controlled by the variable system.threadpoolsize.
It makes sense only if the program is running in fastmode and
the system.usethreads option is enabled.
Timeout value for a thread
This specifies the timeout value for a single download thread.
The default value is 200 seconds. Threads which overrun this
value are eventually killed and cleaned up.
This option is controlled by the variable system.threadtimeout.
This value is ignored when you are running the program in
slowmode, without using multiple threads.
Robot Exclusion Protocol
The Robot Exclusion Principle control flag. This tells the
spider whether to follow rules specified by the robots.txt file
on some web servers. Enabled by default.
We advice you to always enabled this option, since it shows good
Internet etiquette and respect for the download rules laid down
by webmasters of sites. Disable it after reading any legalities
laid down by the website, according to your discretion. We are
not responsible for any eventuality that arises from a user
violating these rules. (See LICENSE.txt file.)
The variable for this value is control.robots.
Proxy Server Support
HarvestMan is written taking into account corporate users (like
the authors!) who connect to Internet from behind
firewalls/proxies. Such users should set this option to the IP
address/name of their proxy server with the proxy port appended
to it.
The variables for this option are network.proxyserver and
network.proxyport. Set the first one to the ip address/name of
your proxy server and the second one to its port number.
Default values: proxy and 80.
Note: If you are creating the configuration file using the
script provided for that purpose, the proxy server string would
be encrypted and does not appear in plain text in the
configuration file.
Proxy Authentication Support
HarvestMan also supports proxies that require user
authentication.
The variables for this are network.proxyuser and
network.proxypasswd.
Note: If you are creating the configuration file using the
script provided for that purpose, these values would be
encrypted and does not appear in plain text in the configuration
file.
Intranet Crawling
This option is disabled from version 1.3.9 onwards since
HarvestMan can now intelligently figure out whether url is in
the intranet or internet by trying to resolve the host name in
the url. Hence the option is not required anymore.
From version 1.3.9, we can mix urls in the internet/intranet in
the same project.
Renaming of Dynamically Generated Files
Dynamically generated files (images/html) will usually have file
extensions that bear no connection to their actual content. You
will not be able to open these files correctly, especially on
the Windows platform which depends on file extensions to launch
applications. This option will tell HarvestMan to try to rename
these files by looking at their content. HarvestMan will also
appropriately rename any link which points to these files.
This option right now works well only for gif/jpeg/bmp files.
Disabled by default.
The variable for this option is download.rename.
Console Message Settings
HarvestMan prints out a lot of informational messages to the
console while it is running. These can be controlled by the
project.verbosity variable in the configuration file. This value
ranges from 0 to 5.
The default value is 2.
Here is each value and a description of its meaning to the
program.
0: Minimal messages, displays only the Program
Information/Copyright.
1: Basic messaging, displays above, plus information on the
current project including the statistics.
2: More messaging, displays above, plus information on each url
as it is being downloaded.
3: Extended messaging, displays above, plus information on each
thread that is downloading a certain file. Also displays thread
killing/joining information and directory creation, file
saving/deletion information.
4: Debug messaging, displays above, plus debugging information
for the programmer. Not recommended for the end-user.
5: Extended debug messaging, displays maximal messages,
including the debug information from the web page parser. (Use
this at your own risk!)
Please note that these guidelines are flexible and can change as
new versions are being developed, especially the behavior of
values from 3 - 5.
Filters
HarvestMan allows the user to refine downloads further by
specifying filtering options for urls. These are of two kinds:
1. Filters for urls (plain vanilla links), which are controlled
by the control.urlfilter variable.
2. Filters for external servers, which are controlled by the
control.serverfilter variable.
The filter strings are a kind of regular expression. They are
internally converted to python regular expressions by the
program.
Writing filter regular expressions
a. URL Filters (for the control.urlfilter setting)
URL filters supported by HarvestMan are of 3 types. These are:
1. Filename extensions 2. Servers/urls 3. Servers/urls +
filename extensions
An example of the first type is *.gif
Examples of the second type are,: www.yahoo.com, */advocacy/*,
*/images/sex/*, */avoid.gif, ad.doubleclick.net/*
Examples of the third type are,: /images/*.gif,
ad.doubleclick.net/images/*.jpg, yimg.yahoo.com/*.gif
You can build a ’no-pass’ (block) filter by prepending a regular
expression as described above with a ’-’ (minus) sign. (Example:
-*.gif).
You can build a ’go-through’ (allow) filter by prepending a
regular expression as described above with a ’+’ (plus) sign.
(Example: +*.gif).
You can concatenate regular expressions of the block/allow kind
and create custom url filters.
Example: (Block all jpeg images, as well as all urls containing
"/images/" in their path, but always allow the path
"’/preferred/images/"):
-*.jpg+*/preferred/images/*-*/images/*
Example: (Block all gif files from the server
"toomanygifs.com"):
-toomanygifs.com/*.gif
Example: (Block all files with the name "bad.jpg" from all
servers.)
-*/bad.jpg
Example: (Block all jpeg/gif/png/ images but allow pdf/doc/xls
files.):
-*.jpg-*.jpeg-*.gif-*.png+*.pdf+*.doc+*.xls
If there is a collision between the results of an inclusion
filter and an exclusion filter, the program gives precedence to
the decision of the filter which comes first in the filter
expression. If there is still ambiguity, the inclusion filter is
given precedence.
b. Server filters (for the control.serverfilter setting)
If you are enabling fetching links from external servers, you
can write a server filter in a similar way to url filters. This
also allows you to write no-pass and go-through filters. The
main difference is that in urlfilters, the character "*" is
ignored, whereas in server filters, this matches any character
or sequence of characters.
Example: Block all files from the server adserver.com:
-adserver.com/*
Example: Block all files from the server niceimages.com in the
path /advertising/, but allow all other paths.
-*niceimages.com/*/advertising/*
Note that the control.serverfilter if specified, is checked
before control.urlfilter. So any result of the
control.serverfilter setting takes precedence.
Retrieval of failed links
Tells the program whether to try refetching links that failed to
retrieve at the end. Retry will be attempted by the number of
times specified by this variable’s value.
Retry will be attempted after a gap of 0.5 seconds after the
first attempt for every url that failed due to a non-fatal
error. Also retry will be attempted for all failed links once
again at the end of the mirroring.
This option is controlled by the variable download.retryfailed.
The default value is 1. (Retry will be attempted once for every
failed link, and once again at the end of the download.)
To disable retry, set this variable to zero.
Localization of URLs
Tells the program whether to localize (Internet links modified
to file links) the links in all html files downloaded. This
helps user to browse the website as if it were local. HarvestMan
also converts any relative url links to absolute url links, if
their files were not downloaded.
This is enabled by default. It is a good idea to always enable
it.
Note that localization of links is done at the end of the
download.
Controlled by the variable indexer.localise.
From version 1.1.2, this option supports 3 values. A value of
zero of course disables it. A value of 1 will perform
localization by replacing url links with absolute file path
names.
A value of 2 will perform localization by replacing url links
with relative file path names. Relative localization helps you
to browse the downloaded website from different file systems
since the url paths are relative (to directory). Absolute
localization locks your downloaded website to the filesystem of
the machine where you ran HarvestMan. From version 1.1.2, the
default value of this option is 2, i.e it performs a relative
localization by default.
Another variable related to localization has been added in the
1.1.2 release. This allows you to perform JIT (Just In Time)
localization of html files, i.e, immediately after they are
downloaded, instead of at the end of download.
This option is described somewhere below.
URL List File
You can tell HarvestMan to dump a list of crawled urls to a file
by setting this option. The variable for this is
files.urlslistfile and is disabled by default.
Error log file
A file to write error logs into. This by default is
’errors.log’. This file will be created in the project
directory of the current project.
Variable: files.errorfile
Note: From version 1.2, this feature is disabled. Don’t use it.
Message Log File
From version 1.4 (this version), the message log file is named
<project>.log for a project ’project’ and is automatically
created in the project directory of the project. This is not a
configurable option anymore.
Browse Index Page
HarvestMan creates an html project browser page in the Project
Directory and appends the starting (index) files of each project
to this page, at the end of each project. This option can be
enabled or disabled by setting the variable display.browsepage
By default, this is enabled.
JIT Localization
HarvestMan, from version 1.1.2, has an option to localize HTML
files immediately after they are downloaded, instead of at the
end of the project. This option can be enabled by setting the
variable, indexer.jitlocalise, to a value greater than zero.
By default this is disabled.
Note: From version 1.2, this option is disabled. Don’t use it.
File Integrity Verification
HarvestMan verifies the integrity of downloaded files by
performing an md5 check summation check. From version 1.4 this
option is disabled and is not available in the configuration
file.
Cookie Support
From version 1.2, we have added support for Cookies. The support
is basic based on RFC 2109. By default cookies in web pages are
saved in a cookie file inside the project directory and read
back for pages which require these cookies. This can be
controlled by the variable download.cookies. The default value
is 1.
For disabling cookies, set this variable to zero (0).
Files Caching
From version 1.2, we support caching/update of downloaded files.
An binary cache file is created for every project. This file
contains an md5 checksum of the file, its location on the disk
and the url from which it was downloaded. Next time the project
is re-started, the program checks the urls against this cache
file. The files are downloaded only if their checksum differs
from the checksum of the cached file, otherwise they are
ignored.
This option is enabled by default. It is controlled by the
variable control.pagecache. To disable caching, set this
variable to zero (0).
From version 1.4, a sub-opton named control.datacache is
available. If set to 1(default), data of each url is also saved
in the cache file. So if you lose your original files, but the
cach is present, HarvestMan can recreate the files of the
project from the cache, if the cache files are not out of date.
You can enable data caching for small projects where the number
of files downloaded are not too much. If the project downloads a
lot of files, say > 5000, you might disable data caching.
Number of Simultaneous Network Connections
From version 1.2, the number of simultaneous network connections
can be controlled by modifying a config variable.
For all 1.0 (major) versions and the 1.2 alpha version,
HarvestMan had a global download lock that denied more than one
network connection at a given instant. This slowed down
downloads considerably.
From 1.2 onwards, many simultaneous downloads (network
connections) are possible apart from multiple threads. The
number of simultaneous connections by default is 5. The user can
change this by modifying the variable control.connections in the
config file. If set to a higher value, the many download threads
can use more connections at a given instant and download is
faster. If set to a lower value, the threads will have to wait
for a free connection slot, if the number of connections reach
the limit. You can set it to reasonable value depending on your
network bandwidth. A value below 10 is desirable for low-
bandwidth connections and above 10 for high-bandwidth
connections. If you have a broadband or DSL connection allowing
very high speeds, set this to a relatively large value like 20.
It the number of connections is much less when compared to the
number of url trackers, downloads will suffer. It is a good idea
to keep these two values approximately the same.
Project Timeout
From version 1.2 onwards, HarvestMan allows for a way to exit
projects which hang due to some network or system problems in
threading. The program monitors reads/writes from the url queue
and keeps a time difference value between now and the last
read/write operation on the queue. If no threads are writing
to/reading from the queue, the program exits automatically if
this time difference exceeds a certain timeout value. This value
can be controlled by the variable control.projtimeout in the
config file. Its value by default is 5 minutes (300 seconds).
Javascript retrieval
From version 1.2, HarvestMan can fetch javascript source files
(.js files) from webpages. This has been done by using an
enhanced HTML parser that can download javascript files and java
applets.
The variable for this is download.javascript. This option is
enabled by default.
For skipping javascript files, set this option to zero(0).
Java applets retrieval
From version 1.2, HarvestMan can fetch java applets(.class
files) from webpages. This has been done by using an enhanced
HTML parser that can download javascript files and java applets.
The variable for this is download.javaapplet. This option is
enabled by default.
For skipping java applet files, set this option to zero(0).
Keyword(s) Search ( Word Filtering )
This is a new feature from the 1.3 release. HarvestMan accepts
complex boolean regular expressions for word matches inside web
pages. HarvestMan will download only those pages which match the
word regular expressions.
For example, to download only those webpages containing the
words, HarvestMan and Crawler, you create the following regular
expression and pass it as the config option control.wordfilter.
control.wordfilter (HarvestMan & Crawler)
Only the webpages which contain both these words will be
spidered and downloaded. Note that the filter is not applied to
the starting page.
This feature is based on an ASPN recipe by Anand Pillai
available at the URL
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/252526
Subdomain Setting
New feature from 1.3.1 release. HarvestMan allows you to control
whether subdomains in a domain are treated as external servers
or not, using the variable control.subdomain. If this is set to
1, then subdomains will be considered as external servers.
If set to 0, which is the default, subdomains in a domain will
not be considered as external servers.
For example, if the starting server is http://www.yahoo.com,
then if this variable is disabled (set to zero), then the
domain, http://in.yahoo.com will be considered as part of this
domain and not as an external server.
Skipping query forms
To skip server side or cgi query forms, set this variable to 1.
The variable is named control.skipqueryforms and is set to 1
(enabled) by default.
This skips links of the form
http://server.com/formquery?key=value
To download these links set the variable to 0.
Controlling number of requests per server
This is a new feature in version 1.3.2. You can control the
number of simultaneous requests to the same server by editing
the config variable named control.requests. This is set to 10 by
default.
Html cleaning up (Tidy Interface)
From version 1.3.9, HarvestMan has an option to clean up html
pages before sending them to the parser. This allows to remove
errors from web pages so that they are parsed correctly by the
parser. This in turn helps to download web sites that otherwise
might not get downloaded due to the parser errors of the
starting html page, for example.
The tidylib source code is included along with HarvestMan
distribution, so you don’t need to install it separately.
This option is enabled by default and is controlled by the
variable "control.tidyhtml".
URL and Website Priorities
From this version onwards, HarvestMan allows the user to specify
priorities for urls and servers.
Every url has a default priority, assigned based on its
"generation". The generation of a url is a number based on the
level at which the url was generated, based on the starting url.
The starting url has a generation 0, all urls generated from it
have a generation 1, and so on.
URLs with a lower generation number are given higher priiority
when compared to urls with a higher generation. Also html /web
page urls get a higher priority than other urls in the same
generation.
User can specify his priority for urls by using the config
variable named "control.urlpriority". This works on the basis of
file extensions, and has a range from -5 to 5, -5 denoting
lowest priority and 5 denoting maximum priority.
For example, to specify that pdf files should have a higher
priority we can make the following entry in the config file.
control.urlpriority pdf+1
If you want to give word documetns a higher priority than pdf
files, you can give the following priority specification.
control.urlpriority pdf+1,doc+2
Priroty settings are separated by commas.
If you want to put gif images at the lowest priority and jpg
images at the highest priority,
control.urlpriority gif-5, jpg+5
Similar synatx can be used for setting server priorities. The
variable named control.serverpriority can be used to control
this.
Assume that you want to download files from the server
http://yahoo.com with a higher priority when compared to the
server http://www.cnn.com, in the same download project.
control.serverpriority yahoo.com+1, cnn.com-1
There can be other combinations also.
A priority which is lesser than -5 or greater than 5 is ignored
by the config parser.
Time Limits
From version 1.4, a project can specify a time limit in which to
complete downloads. When this time limit is reached HarvestMan
automatically terrminates the project by stopping all download
threads and cleaning up.
This option can be specified by using the variable
control.timelimit.
Asynchronous URL Server
From 1.4 version, another way of managing downloads is
available. This is an asynchronous url server, which serves
urls to the fetcher threads. Crawler threads send urls to the
server and fetcher threas receives them from it. The server is
based on the asyncore module in Python, hence it offers superior
performance and faster multiplexing of threads than the simple
Queue. The server uses an internal queue to store urls which
also increases performance.
If you enable the variable network.urlserver you can avail of
this feature. This option is disabled by default.
The server listens by default to the port 3081. You can change
it by modifying the variable network.urlport in the config file.
Locale Settings
From 1.4 version, you can set a specific locale for HarvestMan.
Sometimes when parsing non-English websites, the parser can fail
to report some pages, because the language is not set to the
language of the webpage. In such cases, you can manually change
the language and other settings by changing the locale of
HarvestMan.
Locale can be changed by modifying the variable system.locale .
This is set to the american locale by default on non-Windows
platforms and to the default locale (’C’) on Windows platforms.
For example, if you see lot of html parsing errors when browsing
a Russian site, you could try setting the locale to say
’russian’.
Maximum File Size
A new option from version 1.4. HarvestMan fixes the maximum size
of a single file as 1 MB. A url whose file size is more than
this will be skipped. This can be controlled by the variable
control.maxfilesize.
URL Tree File
From version 1.4, a url tree file ,i.e a file displaying the
relation of parent and child urls in a project can be saved at
the end of the project. This file can be saved in two formats,
in text or html. This option is controlled by the variable named
files.urltreefile. The program figures out which format to use
by looking at the file name extension.
Ad Filtering
A new feature from version 1.4. URLs which look like
adveritsement graphics or banners or pop-ups will be filtered by
HarvestMan. This works by using regular expressions. The logic
of this is borrowed from the Internet Junkbuster program. The
option is control.junkfilter.
This option is enabled by default.
OPTIONS
-h, --help
Show help message and exit
-v, --version
Print version information and exit
-p, --project=PROJECT
Set the (optional) project name to PROJECT.
-b, --basedir=BASEDIR
Set the (optional) base directory to BASEDIR.
-C, --configfile=CFGFILE
Read all options from the configuration file CFGFILE.
-P, --projectfile=PROJFILE
Load the project file PROJFILE.
-V, --verbosity=LEVEL
Set the verbosity level to LEVEL. Ranges from 0-5.
-f, --fetchlevel=LEVEL
Set the fetch-level of this project to LEVEL. Ranges from 0-4.
-N, --nocrawl
Only download the passed url (wget-like behaviour).
-l, --localize=yes/no
Localize urls after download.
-r, --retry=NUM
Set the number of retry attempts for failed urls to NUM.
-Y, --proxy=PROXYSERVER
Enable and set proxy to PROXYSERVER (host:port).
-U, --proxyuser=USERNAME
Set username for proxy server to USERNAME.
-W, --proxypass=PASSWORD
Set password for proxy server to PASSWORD.
-n, --connections=NUM
Limit number of simultaneous network connections to NUM.
-c, --cache=yes/no
Enable/disable caching of downloaded files. If enabled, files
won’t be downloaded unless their timestamp is newer than the
cache timestamp.
-d, --depth=DEPTH
Set the limit on the depth of urls to DEPTH.
-w, --workers=NUM
Enable worker threads and set the number of worker threads to
NUM.
-T, --maxthreads=NUM
Limit the number of tracker threads to NUM.
-M, --maxfiles=NUM
Limit the number of files downloaded to NUM.
-t, --timelimit=TIME
Run the program for the specified time TIME.
-s, --urlserver=yes/no
Enable/disable urlserver running on port 3081.
-S, --subdomain=yes/no
Enable/disable subdomain setting. If this is enabled, servers
with the same base server name such as http://img.foo.com and
http://pager.foo.com will be considered as distinct servers.
-R, --robots=yes/no
Enable/disable Robot Exclusion Protocol.
-u, --urlfilter=FILTER
Use regular expression FILTER for filtering urls.
--urlslist=FILE
Dump a list of urls to file FILE.
--urltree=FILE
Dump a file containing hierarchy of urls to FILE.
FILES
config.xml
SEE ALSO
python(1),
AUTHOR
harvestman was written by Anand Pillai <anandpillai@letterboxes.org>.
For latest info, visit http://harvestmanontheweb.com
This manual page was written by Kumar Appaiah <akumar@ee.iitm.ac.in>,
for the Debian project (but may be used by others).
February 5, 2006