harvestman - multithreaded desktop webcrawler written in Python

NAME

       harvestman - multithreaded desktop webcrawler written in Python

SYNOPSIS

       harvestman [options] [-C configfile]

DESCRIPTION

       HarvestMan  is  a  desktop  WebCrawler written completely in the python
       programming language. It allows you to download a  whole  website  from
       the Internet and mirror it to the disk for browsing offline. HarvestMan
       has many customizable options for the end-user.   HarvestMan  works  by
       scanning  a  web page for links that point to other web pages or files.
       It downloads  the  files  and  copies  them  to  the  disk.  HarvestMan
       maintains the directory structure of the remote website when it mirrors
       the website  to  the  disk.  Every  html  file  is  scanned  like  this
       recursively, till the whole website is downloaded.

       Once  the  download is complete, the links in downloaded html files are
       localized to point to the files on the disk. This makes sure that  when
       the  user  browses the downloaded pages, he does not need to connect to
       the Internet again. If any file  failed  to  get  downloaded  for  some
       reason,  HarvestMan will convert its relative Internet address to point
       to the complete Internet address, so that the user will be connected to
       the  Internet  when he clicks on the link, and does not get a dead-link
       error. (404 error).

       From version 1.2, HarvestMan uses two family of threads, the "Fetchers"
       and the "Getters", for downloading. The Fetchers are threads which have
       the responsibility of crawling  webpages  and  finding  links  and  the
       Getters are threads which download those links (the non-html files).

       HarvestMan,  as  of  latest version is a console application. It can be
       launched by running the HarvestMan script (HarvestMan.py)  if  you  are
       using  the  source code, or the HarvestMan executable, if you are using
       the executable (available on Win32 platforms).It  prints  informational
       messages to the console while it is working. These messages can be used
       to debug the program and locate any errors.

       HarvestMan works by reading its options either from the command line or
       from a configuration file. The configuration file is named "config.xml"
       by default.

       The is  a  major  change  from  HarvestMan  1.5  onwards  is  that  the
       configuration  is  now in an XML file called "config.xml". You can also
       use the convertconfig.py script, present in HarvestMan/tools/  of  your
       installation  to  convert  your configuration from text to XML and vice
       versa. For full details, see the Changes.txt file and see  the  website
       at http://harvestmanontheweb.com

       HarvestMan  writes  a  binary  project  file  using  the  python pickle
       protocol.  This  project  file  is  saved  under  the  HarvestMan  base
       directory with the extension .hbp. This is a complete record of all the
       settings which were used to start HarvestMan and can be read back later
       using the -- projectfile option to restart a HarvestMan project.

MODES OF OPERATION

       HarvestMan   has   two  major  modes  of  operation.  One  is  a  fully
       multithreaded mode, also called as a fast mode.

       Fast Mode
              Fast Mode is the most useful mode of HarvestMan. In  this  mode,
              HarvestMan  launches  multiple  threads  for  each url link, and
              stores them in an internal queue. Also, HarvestMan will launch a
              separate  download  thread  for  each non-html file encountered.
              This process is very fast and you  can  download  websites  very
              quickly  using this mode as multiple downloads occur at the same
              time.

              This mode is the default. You can use this mode if  you  have  a
              relatively  large  bandwidth,  and  a reliable connection to the
              Internet.

              Since HarvestMan is network-bound, using multiple threads speeds
              up the download.

       Slow Mode
              In  the  Slow  Mode,  download  of  websites  happen in a single
              thread, the main program thread.  Each  download  will  have  to
              wait  for  the  previous  one  to  get  completed,  so this is a
              relatively slow process. You can use this mode, if you  have  an
              unreliable  Internet connection or a relatively small bandwidth,
              which does not support opening of multiple sockets at  the  same
              time.

              This  mode  is disabled by default. You can enable it by setting
              the  variable  FASTMODE  in  the  configuration  file  to  zero.
              (Described somewhere in this document)

              If  you  see  a  lot  of  "Socket" type errors when you launch a
              HarvestMan project by using the default mode (fastmode),  switch
              to  this  mode.  This  would  give you a very reliable download,
              though a slow one.

USAGE

       As said earlier, HarvestMan reads its options from a configuration file
       or  from  the  command line. The configuration file by default is named
       "config.xml". You can pass  another  configuration  file  name  to  the
       program by using the command line options -configfile/-C.

       HarvestMan can also read options from the command line.

       From  version  1.1, HarvestMan would also be able to read back previous
       project files by using the command line option -projectfile.

       We will first discuss the structure of the configuration file  and  how
       it  can be used to create a HarvestMan project. For more information on
       the command line arguments, run  the  program  with  the  -help  or  -h
       option.

CONFIGURATION FILE

       The  configuration  file  is a simple text file with many options which
       are a pair of variable/value strings separated by tabs or spaces.  Each
       variable/value  pair  appears  in  a  separate line. Comments can be by
       adding the hash character ’#’, before any line.

       HarvestMan has three basic options and some 50 advanced options.

BASIC OPTIONS

       HarvestMan needs three basic configuration options to work.  These  are
       described below:

       project.name:  This  is  the  project  name  of  the  current download.
       HarvestMan creates a directory of  this  name  in  its  base  directory
       (described  below) where it keeps all the downloaded files. The project
       name needs to be a non-empty string. (Spaces are allowed.)

       project.url: This is the starting url for the  program  from  where  it
       starts  download.  HarvestMan supports the WWW/HTTP/HTTPS/FTP protocols
       in this url. If a url does not begin with any  of  these,  it  will  be
       considered   as   an  HTTP  url.  For  example,  http://www.python.org,
       www.yahoo.com, cnn.com

       project.basedir: This is the base directory for the  program  where  it
       creates  the  project  directories  and stores all downloaded files. If
       this directory does not exist, HarvestMan will attempt to create it.

ADVANCED OPTIONS

       For precisely configuring your download, HarvestMan supports  about  30
       advanced  options. You will need to use many of them, if you would like
       to control your download  exactly  the  way  you  want.  The  following
       section describes each of these settings and what they do. Read on.

       The Fetchlevel setting
              From Version 1.2, there is a change in this setting. Read on.

              This  is one of the most useful options to tweak in a HarvestMan
              project.   The   option   is   controlled   by   the    variable
              download.fetchlevel in the configuration file.

              Make sure you read the following documentation very carefully.

              When  you are downloading files from a website, you would prefer
              to limit your download to certain areas  of  the  Internet.  For
              example, you might want to download all links pointed by the url
              http://www.foo.com/bar (a hypothetical example), that come under
              the  www.foo.com  web  server. Or you might want to download all
              links under the directory path  http://www.foo.com/pics  and  no
              more. You can use this option to do exactly that.

              The  option download.fetchlevel has 5 possible values that range
              from 0 - 4.

              A value of 0 limits the download to a directory path from  where
              you  start  your download. For example, if your starting url was
              http://www.foo.com/bar/index.html, this option makes  sure  that
              all links downloaded will be belonging to the directory url path
              http://www.foo.com/bar and below it. Any web links  pointing  to
              directories outside or other web servers would be ignored.

              A  value  of  1  limits the download to the starting server, but
              does not limit it to paths below the starting directory.

              For     example,     if     your      starting      url      was
              http://www.foo.com/bar/index.html,   this   option   would  also
              download  files  from  the   http://www.foo/com/other/index.html
              page, since it belongs to the starting server.

              A  value  of  2  performs the next level fetching. It allows all
              paths in the starting server, and also all urls external to  the
              starting  server, but linked directly from pages in the starting
              server.    For     example,     if     your     starting     url
              http://www.foo.com/bar/index.html    contained    a    link   to
              http://www.foo2.com/bar2/index.html   (an   external    server),
              HarvestMan  will  try  to  download this link also. But all urls
              linked     linked     from     this     link,      i.e      from
              http://www.foo2.com/bar2/index.html, would be ignored.

              A  value  of  3  performs  a  fetching similar to above, but the
              difference is that it  does  not  get  files  which  are  linked
              outside the directory of the starting url, but gets the external
              links which are linked one level  from  the  starting  url.  For
              example,          if          your          starting         url
              <http://www.foo.com/bar/index.html>   contained   a   link    to
              http://www.foo2.com/bar2/index.html    (an   external   server),
              HarvestMan will try to download this link also. But a  url  like
              <http://www.foo.com/other/index.html>   (a   link   outside  the
              starting url’s directory) will be ignored.

              A value of 4 gives you no control to the  fetching  process.  It
              will  allow  all web pages to be downloaded, including web pages
              linked from external server links, encountered in  the  starting
              url’s  page.  Setting  this  option  will  mostly  result in the
              crawler trying to crawl the entire Internet, assuming that  your
              starting  url  has  links  to  other  outside  servers. Set this
              option, only if you are very sure of what  you  are  doing.  Any
              value above 4 has no special meaning, and would behave just like
              above.

              For most downloads, this value can be specified between 0 and 2.

       The Depth Setting

              This  is  another  setting  that  gives  you  control  over your
              download.  It is denoted by the variable  control.depth  in  the
              configuration file.

              This  value  specifies the distance of any url from the starting
              url’s directory in terms of the directory path offset.  This  is
              applicable  only  to  the  directories  (links)  in the starting
              server, below the starting url’s directory. The default value is
              10.

              If  a  directory  is found whose offset is more than this value,
              any links under it will not be downloaded.

              You can specify zero depths in which case the download  will  be
              limited to files just below the directory of the starting url.

              Examples:        If        the       starting       url       is
              http://www.foo.com/bar/foo.html,       then       the        url
              http://www.foo.com/bar/images/graphics/flowers/flower.jpg  has a
              depth of 3 relative to the starting url.

       The External Depth Setting

              This option also helps you to control downloads. It  is  denoted
              by variable control.extdepth in the configuration file.

              This  value specifies the distance of a url from its base server
              directory. This is applicable to urls which belong  to  external
              servers and to urls outside the directory of the starting url.

              If a directory is found whose distance from the base server path
              is more than this value, any files under it will be ignored.

              Note that this option does not support the notion of zero depth.
              A valid value for this has to be greater than or equal to one.

              Examples:  The  url  http://www.foo.com/bar/images.html  has  an
              external depth of 1  relative  to  the  base  server  directory,
              http://www.foo.com.

       The External Servers Setting

              This  option tells the program whether to follow links belonging
              to  outside  web  servers.  This  is  denoted  by  the  variable
              control.extserverlinks. By default, the program ignores external
              server links.

              The option has  lesser  precedence  to  the  download.fetchlevel
              setting.   If  download.fetchlevel  is  set  to  a value of 2 or
              above, this setting is conveniently ignored.

       The External Directories Setting

              This  option  tells  the  program  whether  to  download   files
              belonging  to  outside  directories ,i.e directories external to
              the directory of the starting url. This is denoted by the option
              control.extpagelinks in the configuration file.

              This  option tells the program whether to follow links belonging
              to outside directories.

              The  default  value  is  1  (Enabled).  The  download.fetchlevel
              setting  has  precedence over this value. If download.fetchlevel
              is set to a value of 1 or more,  this  setting  is  conveniently
              ignored.

       The Images Setting

              Specifies  the  program  whether  to  download  images linked to
              pages.  Enabled by  default.  This  option  is  denoted  by  the
              variable download.images in the configuration file.

       The Html Setting

              Tells  the  program  whether  to download html files. Enabled by
              default. Denoted by the variable download.html.

       Maximum limit of External Servers

              You can put a check on the number of external servers from which
              you  want  to  download  files from, by setting this option to a
              non-zero value. It takes precedence to  the  download.fetchlevel
              setting.   This   option   is   controlled   by   the   variable
              control.maxextservers in the  configuration  file.  The  default
              value is zero which means that this option is ignored.

              To enable this option, set it to a value greater than zero.

       Maximum limit on External Directories

              You  can  put a check on the number of external directories from
              which you want to download files from, by setting this option to
              a    non-zero    value.    It    takes   precedence   over   the
              download.fetchlevel setting. This option is  controlled  by  the
              variable control.maxextdirs, in the configuration file.

              The  default  value  is  zero  which  means  that this option is
              ignored.

              To enable this option, set it to a value greater than zero.

       Maximum limit on Number of Files

              You can precisely control the number of total files you want  to
              download  by setting this option. It is denoted by the variable,
              control.maxfiles. The default value is 3000.

       Default download of images

              This option tells the program to always fetch images linked from
              pages,    though   they   might   be   belonging   to   external
              servers/directories or might be violating the depth rules.

              This      option      takes      precedence       over       the
              control.extpagelinks/control.extserverlinks   settings  and  the
              control.depth/control.extdepth settings.

              The download.image setting has a  higher  precedence  than  this
              setting.

              This  option  is  enabled  by  default.  Denoted by the variable
              download.linkedimages.

       Default download of style sheets (.css files)

              Same as the above option, but only that this options checks  for
              stylesheet   (css)   links.  This  has  higher  precedence  over
              control.extpagelinks/control.extserverlinks       and        the
              control.depth/control.extdepth settings. Enabled by default.

              This      option      is     denoted     by     the     variable
              download.linkedstylesheets.

       Maximum thread setting

              This options  sets  the  number  of  separate  threads(trackers)
              launched  by  the  program  at  a  time. This is not an accurate
              setting. Note that a given time does not  really  mean  that  so
              many  connections  are  running  per  second  but only tells the
              program that it cannot launch threads above this limit.

              This option makes sense only in  multithreaded  downloads,  i.e,
              only  when the program is running in fastmode. In slowmode, this
              setting has no effect.

              Denoted by the variable system.maxtrackers. The default value is
              10.

       Separate threads for file download

              This  option  controls  the  ,multithreaded download of non-html
              files in the fastmode. In fastmode,  separate  download  threads
              are  launched  to  retrieve  non-html files. If you disable this
              option, these files will be downloaded in the main thread of the
              downloader thread.

              By  default,  this  option  is  enabled. You can tweak it by the
              variable system.usethreads.

       Mode Selection

              As  described  in  the  beginning,  there  are  two  modes   for
              HarvestMan,  the  fast  one and the slow one. This option allows
              you to choose your mode of operation.

              The variable for this option  is  system.fastmode.  The  default
              value  is  1,  which  means  that  the program uses fastmode. To
              disable fastmode, and switch to slowmode, set this  variable  to
              zero.

       Size of the thread pool

              This value controls the size of the thread pool used to download
              non-html  files  when  the  program   runs   in   fastmode   and
              system.usethreads is enabled. The default value is 10.

              This option is controlled by the variable system.threadpoolsize.
              It makes sense only if the program is running  in  fastmode  and
              the system.usethreads option is enabled.

       Timeout value for a thread

              This  specifies  the timeout value for a single download thread.
              The default value is 200 seconds.  Threads  which  overrun  this
              value are eventually killed and cleaned up.

              This  option is controlled by the variable system.threadtimeout.

              This value is ignored  when  you  are  running  the  program  in
              slowmode, without using multiple threads.

       Robot Exclusion Protocol

              The  Robot  Exclusion  Principle  control  flag.  This tells the
              spider whether to follow rules specified by the robots.txt  file
              on some web servers. Enabled by default.

              We advice you to always enabled this option, since it shows good
              Internet etiquette and respect for the download rules laid  down
              by  webmasters of sites. Disable it after reading any legalities
              laid down by the website, according to your discretion.  We  are
              not  responsible  for  any  eventuality  that arises from a user
              violating these rules. (See LICENSE.txt file.)

              The variable for this value is control.robots.

       Proxy Server Support

              HarvestMan is written taking into account corporate users  (like
              the    authors!)   who   connect   to   Internet   from   behind
              firewalls/proxies. Such users should set this option to  the  IP
              address/name  of their proxy server with the proxy port appended
              to it.

              The  variables  for  this  option  are  network.proxyserver  and
              network.proxyport.  Set  the first one to the ip address/name of
              your proxy server and the second one to its port number.

              Default values: proxy and 80.

              Note: If you are  creating  the  configuration  file  using  the
              script  provided for that purpose, the proxy server string would
              be  encrypted  and  does  not  appear  in  plain  text  in   the
              configuration file.

       Proxy Authentication Support

              HarvestMan    also    supports   proxies   that   require   user
              authentication.

              The   variables   for    this    are    network.proxyuser    and
              network.proxypasswd.

              Note:  If  you  are  creating  the  configuration file using the
              script  provided  for  that  purpose,  these  values  would   be
              encrypted and does not appear in plain text in the configuration
              file.

       Intranet Crawling

              This  option  is  disabled  from  version  1.3.9  onwards  since
              HarvestMan  can  now  intelligently figure out whether url is in
              the intranet or internet by trying to resolve the host  name  in
              the url. Hence the option is not required anymore.

              From  version 1.3.9, we can mix urls in the internet/intranet in
              the same project.

       Renaming of Dynamically Generated Files

              Dynamically generated files (images/html) will usually have file
              extensions  that bear no connection to their actual content. You
              will not be able to open these files  correctly,  especially  on
              the  Windows platform which depends on file extensions to launch
              applications. This option will tell HarvestMan to try to  rename
              these  files  by  looking at their content. HarvestMan will also
              appropriately rename any link which points to these files.

              This option right now works well only  for  gif/jpeg/bmp  files.
              Disabled by default.

              The variable for this option is download.rename.

       Console Message Settings

              HarvestMan  prints  out  a  lot of informational messages to the
              console while it is running. These  can  be  controlled  by  the
              project.verbosity variable in the configuration file. This value
              ranges from 0 to 5.

              The default value is 2.

              Here is each value and a  description  of  its  meaning  to  the
              program.

              0:    Minimal    messages,    displays    only    the    Program
              Information/Copyright.

              1: Basic messaging, displays  above,  plus  information  on  the
              current project including the statistics.

              2:  More messaging, displays above, plus information on each url
              as it is being downloaded.

              3: Extended messaging, displays above, plus information on  each
              thread  that is downloading a certain file. Also displays thread
              killing/joining  information  and   directory   creation,   file
              saving/deletion information.

              4:  Debug  messaging, displays above, plus debugging information
              for the programmer. Not recommended for the end-user.

              5:  Extended  debug  messaging,   displays   maximal   messages,
              including  the  debug information from the web page parser. (Use
              this at your own risk!)

              Please note that these guidelines are flexible and can change as
              new  versions  are  being  developed, especially the behavior of
              values from 3 - 5.

       Filters

              HarvestMan allows  the  user  to  refine  downloads  further  by
              specifying filtering options for urls. These are of two kinds:

              1.  Filters for urls (plain vanilla links), which are controlled
              by the control.urlfilter variable.

              2. Filters for external servers, which  are  controlled  by  the
              control.serverfilter variable.

              The  filter  strings  are a kind of regular expression. They are
              internally  converted  to  python  regular  expressions  by  the
              program.

              Writing filter regular expressions

              a. URL Filters (for the control.urlfilter setting)

              URL filters supported by HarvestMan are of 3 types. These are:

              1.   Filename  extensions  2.  Servers/urls  3.  Servers/urls  +
              filename extensions

              An example of the first type is *.gif

              Examples of the second type are,:  www.yahoo.com,  */advocacy/*,
              */images/sex/*, */avoid.gif, ad.doubleclick.net/*

              Examples    of    the    third    type    are,:   /images/*.gif,
              ad.doubleclick.net/images/*.jpg, yimg.yahoo.com/*.gif

              You can build a ’no-pass’ (block) filter by prepending a regular
              expression as described above with a ’-’ (minus) sign. (Example:
              -*.gif).

              You can build a ’go-through’  (allow)  filter  by  prepending  a
              regular  expression  as  described above with a ’+’ (plus) sign.
              (Example: +*.gif).

              You can concatenate regular expressions of the block/allow  kind
              and create custom url filters.

              Example:  (Block all jpeg images, as well as all urls containing
              "/images/"  in  their  path,   but   always   allow   the   path
              "’/preferred/images/"):

              -*.jpg+*/preferred/images/*-*/images/*

              Example:    (Block    all    gif    files    from   the   server
              "toomanygifs.com"):

              -toomanygifs.com/*.gif

              Example: (Block all files  with  the  name  "bad.jpg"  from  all
              servers.)

              -*/bad.jpg

              Example:  (Block  all jpeg/gif/png/ images but allow pdf/doc/xls
              files.):

              -*.jpg-*.jpeg-*.gif-*.png+*.pdf+*.doc+*.xls

              If there is a collision between  the  results  of  an  inclusion
              filter  and an exclusion filter, the program gives precedence to
              the decision of the filter  which  comes  first  in  the  filter
              expression. If there is still ambiguity, the inclusion filter is
              given precedence.

              b. Server filters (for the control.serverfilter setting)

              If you are enabling fetching links from  external  servers,  you
              can  write a server filter in a similar way to url filters. This
              also allows you to write no-pass  and  go-through  filters.  The
              main  difference  is  that  in  urlfilters, the character "*" is
              ignored, whereas in server filters, this matches  any  character
              or sequence of characters.

              Example:   Block   all   files  from  the  server  adserver.com:
              -adserver.com/*

              Example: Block all files from the server niceimages.com  in  the
              path /advertising/, but allow all other paths.

              -*niceimages.com/*/advertising/*

              Note  that  the  control.serverfilter  if  specified, is checked
              before    control.urlfilter.    So    any    result    of    the
              control.serverfilter setting takes precedence.

       Retrieval of failed links

              Tells the program whether to try refetching links that failed to
              retrieve at the end. Retry will be attempted by  the  number  of
              times specified by this variable’s value.

              Retry  will  be  attempted  after a gap of 0.5 seconds after the
              first attempt for every url  that  failed  due  to  a  non-fatal
              error.  Also  retry  will be attempted for all failed links once
              again at the end of the mirroring.

              This option is controlled by the variable  download.retryfailed.
              The  default value is 1. (Retry will be attempted once for every
              failed link, and once again at the end of the download.)

              To disable retry, set this variable to zero.

       Localization of URLs

              Tells the program whether to localize (Internet  links  modified
              to  file  links)  the  links  in all html files downloaded. This
              helps user to browse the website as if it were local. HarvestMan
              also  converts  any relative url links to absolute url links, if
              their files were not downloaded.

              This is enabled by default. It is a good idea to  always  enable
              it.

              Note  that  localization  of  links  is  done  at the end of the
              download.

              Controlled by the variable indexer.localise.

              From version 1.1.2, this option supports 3 values.  A  value  of
              zero   of  course  disables  it.  A  value  of  1  will  perform
              localization by replacing url  links  with  absolute  file  path
              names.

              A  value  of  2 will perform localization by replacing url links
              with relative file path names. Relative localization  helps  you
              to  browse  the  downloaded  website from different file systems
              since the  url  paths  are  relative  (to  directory).  Absolute
              localization  locks your downloaded website to the filesystem of
              the machine where you ran HarvestMan. From  version  1.1.2,  the
              default  value  of  this option is 2, i.e it performs a relative
              localization by default.

              Another variable related to localization has been added  in  the
              1.1.2  release.  This  allows  you to perform JIT (Just In Time)
              localization of html files,  i.e,  immediately  after  they  are
              downloaded, instead of at the end of download.

              This option is described somewhere below.

       URL List File

              You can tell HarvestMan to dump a list of crawled urls to a file
              by   setting   this   option.   The   variable   for   this   is
              files.urlslistfile and is disabled by default.

       Error log file

              A   file   to   write  error  logs  into.  This  by  default  is
              ’errors.log’.   This  file  will  be  created  in  the   project
              directory of the current project.

              Variable: files.errorfile

              Note:  From version 1.2, this feature is disabled. Don’t use it.

       Message Log File

              From version 1.4 (this version), the message log file  is  named
              <project>.log  for  a  project  ’project’  and  is automatically
              created in the project directory of the project. This is  not  a
              configurable option anymore.

       Browse Index Page

              HarvestMan  creates  an html project browser page in the Project
              Directory and appends the starting (index) files of each project
              to  this  page,  at  the end of each project. This option can be
              enabled or disabled by setting the  variable  display.browsepage
              By default, this is enabled.

       JIT Localization

              HarvestMan,  from  version 1.1.2, has an option to localize HTML
              files immediately after they are downloaded, instead of  at  the
              end  of  the  project. This option can be enabled by setting the
              variable, indexer.jitlocalise, to a value greater than zero.

              By default this is disabled.

              Note: From version 1.2, this option is disabled. Don’t use it.

       File Integrity Verification

              HarvestMan  verifies  the  integrity  of  downloaded  files   by
              performing  an  md5 check summation check. From version 1.4 this
              option is disabled and is not  available  in  the  configuration
              file.

       Cookie Support

              From version 1.2, we have added support for Cookies. The support
              is basic based on RFC 2109. By default cookies in web pages  are
              saved  in  a  cookie  file inside the project directory and read
              back  for  pages  which  require  these  cookies.  This  can  be
              controlled  by  the variable download.cookies. The default value
              is 1.

              For disabling cookies, set this variable to zero (0).

       Files Caching

              From version 1.2, we support caching/update of downloaded files.
              An  binary  cache  file  is created for every project. This file
              contains an md5 checksum of the file, its location on  the  disk
              and  the url from which it was downloaded. Next time the project
              is re-started, the program checks the urls  against  this  cache
              file.  The  files  are downloaded only if their checksum differs
              from the  checksum  of  the  cached  file,  otherwise  they  are
              ignored.

              This  option  is  enabled  by  default.  It is controlled by the
              variable  control.pagecache.  To  disable  caching,   set   this
              variable to zero (0).

              From   version  1.4,  a  sub-opton  named  control.datacache  is
              available.  If set to 1(default), data of each url is also saved
              in  the  cache file. So if you lose your original files, but the
              cach is present,  HarvestMan  can  recreate  the  files  of  the
              project  from the cache, if the cache files are not out of date.

              You can enable data caching for small projects where the  number
              of files downloaded are not too much. If the project downloads a
              lot of files, say > 5000, you might disable data caching.

       Number of Simultaneous Network Connections

              From version 1.2, the number of simultaneous network connections
              can be controlled by modifying a config variable.

              For  all  1.0  (major)  versions  and  the  1.2  alpha  version,
              HarvestMan had a global download lock that denied more than  one
              network   connection  at  a  given  instant.  This  slowed  down
              downloads considerably.

              From  1.2  onwards,   many   simultaneous   downloads   (network
              connections)  are  possible  apart  from  multiple  threads. The
              number of simultaneous connections by default is 5. The user can
              change this by modifying the variable control.connections in the
              config file. If set to a higher value, the many download threads
              can  use  more  connections  at  a given instant and download is
              faster. If set to a lower value, the threads will have  to  wait
              for  a  free connection slot, if the number of connections reach
              the limit. You can set it to reasonable value depending on  your
              network  bandwidth.  A  value  below  10  is  desirable for low-
              bandwidth  connections   and   above   10   for   high-bandwidth
              connections.  If you have a broadband or DSL connection allowing
              very high speeds, set this to a relatively large value like  20.

              It  the  number of connections is much less when compared to the
              number of url trackers, downloads will suffer. It is a good idea
              to keep these two values approximately the same.

       Project Timeout

              From  version  1.2  onwards, HarvestMan allows for a way to exit
              projects which hang due to some network or  system  problems  in
              threading.  The program monitors reads/writes from the url queue
              and keeps a time difference  value  between  now  and  the  last
              read/write  operation  on  the  queue. If no threads are writing
              to/reading from the queue, the program  exits  automatically  if
              this time difference exceeds a certain timeout value. This value
              can be controlled by the  variable  control.projtimeout  in  the
              config file. Its value by default is 5 minutes (300 seconds).

       Javascript retrieval

              From  version  1.2, HarvestMan can fetch javascript source files
              (.js files) from webpages.  This  has  been  done  by  using  an
              enhanced HTML parser that can download javascript files and java
              applets.

              The variable for this is  download.javascript.  This  option  is
              enabled by default.

              For skipping javascript files, set this option to zero(0).

       Java applets retrieval

              From  version  1.2,  HarvestMan  can  fetch  java applets(.class
              files) from webpages. This has been done by  using  an  enhanced
              HTML parser that can download javascript files and java applets.

              The variable for this is  download.javaapplet.  This  option  is
              enabled by default.

              For skipping java applet files, set this option to zero(0).

       Keyword(s) Search ( Word Filtering )

              This  is  a new feature from the 1.3 release. HarvestMan accepts
              complex boolean regular expressions for word matches inside  web
              pages. HarvestMan will download only those pages which match the
              word regular expressions.

              For example, to download  only  those  webpages  containing  the
              words,  HarvestMan and Crawler, you create the following regular
              expression and pass it as the config option  control.wordfilter.

              control.wordfilter           (HarvestMan & Crawler)

              Only  the  webpages  which  contain  both  these  words  will be
              spidered and downloaded. Note that the filter is not applied  to
              the starting page.

              This  feature  is  based  on  an  ASPN  recipe  by  Anand Pillai
              available               at                the                URL
              http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/252526

       Subdomain Setting

              New feature from 1.3.1 release. HarvestMan allows you to control
              whether subdomains in a domain are treated as  external  servers
              or  not, using the variable control.subdomain. If this is set to
              1, then subdomains will be considered as external servers.

              If set to 0, which is the default, subdomains in a  domain  will
              not be considered as external servers.

              For  example,  if  the  starting server is http://www.yahoo.com,
              then if this variable  is  disabled  (set  to  zero),  then  the
              domain,  http://in.yahoo.com  will be considered as part of this
              domain and not as an external server.

       Skipping query forms

              To skip server side or cgi query forms, set this variable to  1.
              The  variable  is  named  control.skipqueryforms and is set to 1
              (enabled) by default.

              This        skips        links        of        the         form
              http://server.com/formquery?key=value

              To download these links set the variable to 0.

       Controlling number of requests per server

              This  is  a  new  feature  in version 1.3.2. You can control the
              number of simultaneous requests to the same  server  by  editing
              the config variable named control.requests. This is set to 10 by
              default.

       Html cleaning up (Tidy Interface)

              From version 1.3.9, HarvestMan has an option to  clean  up  html
              pages  before  sending them to the parser. This allows to remove
              errors from web pages so that they are parsed correctly  by  the
              parser.  This in turn helps to download web sites that otherwise
              might not get  downloaded  due  to  the  parser  errors  of  the
              starting html page, for example.

              The  tidylib  source  code  is  included  along  with HarvestMan
              distribution, so you don’t need to install it separately.

              This option is enabled by  default  and  is  controlled  by  the
              variable "control.tidyhtml".

       URL and Website Priorities

              From this version onwards, HarvestMan allows the user to specify
              priorities for urls and servers.

              Every  url  has  a  default  priority,  assigned  based  on  its
              "generation".  The  generation of a url is a number based on the
              level at which the url was generated, based on the starting url.
              The  starting url has a generation 0, all urls generated from it
              have a generation 1, and so on.

              URLs with a lower generation number are given  higher  priiority
              when  compared  to urls with a higher generation. Also html /web
              page urls get a higher priority than  other  urls  in  the  same
              generation.

              User  can  specify  his  priority  for  urls by using the config
              variable named "control.urlpriority". This works on the basis of
              file  extensions,  and  has  a  range  from -5 to 5, -5 denoting
              lowest priority and 5 denoting maximum priority.

              For example, to specify that pdf  files  should  have  a  higher
              priority we can make the following entry in the config file.

              control.urlpriority         pdf+1

              If  you  want  to give word documetns a higher priority than pdf
              files, you can give the following priority specification.

              control.urlpriority         pdf+1,doc+2

              Priroty settings are separated by commas.

              If you want to put gif images at the  lowest  priority  and  jpg
              images at the highest priority,

              control.urlpriority         gif-5, jpg+5

              Similar  synatx  can  be used for setting server priorities. The
              variable named control.serverpriority can  be  used  to  control
              this.

              Assume   that  you  want  to  download  files  from  the  server
              http://yahoo.com with a higher priority  when  compared  to  the
              server http://www.cnn.com, in the same download project.

              control.serverpriority           yahoo.com+1, cnn.com-1

              There can be other combinations also.

              A  priority which is lesser than -5 or greater than 5 is ignored
              by the config parser.

       Time Limits

              From version 1.4, a project can specify a time limit in which to
              complete  downloads.  When this time limit is reached HarvestMan
              automatically terrminates the project by stopping  all  download
              threads and cleaning up.

              This   option   can   be   specified   by   using  the  variable
              control.timelimit.

       Asynchronous URL Server

              From  1.4  version,  another  way  of  managing   downloads   is
              available.   This  is  an  asynchronous url server, which serves
              urls to the fetcher threads. Crawler threads send  urls  to  the
              server  and  fetcher threas receives them from it. The server is
              based on the asyncore module in Python, hence it offers superior
              performance  and  faster multiplexing of threads than the simple
              Queue. The server uses an internal queue  to  store  urls  which
              also increases performance.

              If  you  enable  the variable network.urlserver you can avail of
              this feature. This option is disabled by default.

              The server listens by default to the port 3081. You  can  change
              it by modifying the variable network.urlport in the config file.

       Locale Settings

              From 1.4 version, you can set a specific locale for  HarvestMan.
              Sometimes when parsing non-English websites, the parser can fail
              to report some pages, because the language is  not  set  to  the
              language  of the webpage. In such cases, you can manually change
              the language and  other  settings  by  changing  the  locale  of
              HarvestMan.

              Locale  can be changed by modifying the variable system.locale .
              This is set to the american locale  by  default  on  non-Windows
              platforms  and to the default locale (’C’) on Windows platforms.

              For example, if you see lot of html parsing errors when browsing
              a  Russian  site,  you  could  try  setting  the  locale  to say
              ’russian’.

       Maximum File Size

              A new option from version 1.4. HarvestMan fixes the maximum size
              of  a  single  file  as 1 MB. A url whose file size is more than
              this will be skipped. This can be  controlled  by  the  variable
              control.maxfilesize.

       URL Tree File

              From  version  1.4,  a  url tree file ,i.e a file displaying the
              relation of parent and child urls in a project can be  saved  at
              the  end  of the project. This file can be saved in two formats,
              in text or html. This option is controlled by the variable named
              files.urltreefile.  The  program figures out which format to use
              by looking at the file name extension.

       Ad Filtering

              A  new  feature  from  version  1.4.  URLs   which   look   like
              adveritsement graphics or banners or pop-ups will be filtered by
              HarvestMan. This works by using regular expressions.  The  logic
              of  this  is  borrowed from the Internet Junkbuster program. The
              option is control.junkfilter.

              This option is enabled by default.

OPTIONS

       -h, --help
              Show help message and exit

       -v, --version
              Print version information and exit

       -p, --project=PROJECT
              Set the (optional) project name to PROJECT.

       -b, --basedir=BASEDIR
              Set the (optional) base directory to BASEDIR.

       -C, --configfile=CFGFILE
              Read all options from the configuration file CFGFILE.

       -P, --projectfile=PROJFILE
              Load the project file PROJFILE.

       -V, --verbosity=LEVEL
              Set the verbosity level to LEVEL. Ranges from 0-5.

       -f, --fetchlevel=LEVEL
              Set the fetch-level of this project to LEVEL. Ranges from 0-4.

       -N, --nocrawl
              Only download the passed url (wget-like behaviour).

       -l, --localize=yes/no
              Localize urls after download.

       -r, --retry=NUM
              Set the number of retry attempts for failed urls to NUM.

       -Y, --proxy=PROXYSERVER
              Enable and set proxy to PROXYSERVER (host:port).

       -U, --proxyuser=USERNAME
              Set username for proxy server to USERNAME.

       -W, --proxypass=PASSWORD
              Set password for proxy server to PASSWORD.

       -n, --connections=NUM
              Limit number of simultaneous network connections to NUM.

       -c, --cache=yes/no
              Enable/disable caching of downloaded files.  If  enabled,  files
              won’t  be  downloaded  unless  their timestamp is newer than the
              cache timestamp.

       -d, --depth=DEPTH
              Set the limit on the depth of urls to DEPTH.

       -w, --workers=NUM
              Enable worker threads and set the number of  worker  threads  to
              NUM.

       -T, --maxthreads=NUM
              Limit the number of tracker threads to NUM.

       -M, --maxfiles=NUM
              Limit the number of files downloaded to NUM.

       -t, --timelimit=TIME
              Run the program for the specified time TIME.

       -s, --urlserver=yes/no
              Enable/disable urlserver running on port 3081.

       -S, --subdomain=yes/no
              Enable/disable  subdomain  setting.  If this is enabled, servers
              with the same base server name such  as  http://img.foo.com  and
              http://pager.foo.com will be considered as distinct servers.

       -R, --robots=yes/no
              Enable/disable Robot Exclusion Protocol.

       -u, --urlfilter=FILTER
              Use regular expression FILTER for filtering urls.

       --urlslist=FILE
              Dump a list of urls to file FILE.

       --urltree=FILE
              Dump a file containing hierarchy of urls to FILE.

FILES

       config.xml

AUTHOR

       harvestman  was  written by Anand Pillai <anandpillai@letterboxes.org>.
       For latest info, visit http://harvestmanontheweb.com

       This manual page was written by Kumar  Appaiah  <akumar@ee.iitm.ac.in>,
       for the Debian project (but may be used by others).

                               February  5, 2006

NAME

SYNOPSIS

DESCRIPTION

MODES OF OPERATION

USAGE

CONFIGURATION FILE

BASIC OPTIONS

ADVANCED OPTIONS

OPTIONS

FILES

SEE ALSO

AUTHOR