Man Linux: Main Page and Category List

NAME

       webcheck - website link checker

SYNOPSIS

       webcheck [OPTION]...  URL

DESCRIPTION

       webcheck  will  check  the  document  at the specified URL for links to
       other documents, follow these links recursively and  generate  an  HTML
       report.

       -i,  --internal=PATTERN
              Mark URLs matching the PATTERN (perl-type regular expression) as
              an internal link.  Can be used multiple times.   Note  that  the
              PATTERN  is  matched  against  the full URL.  URLs matching this
              PATTERN will be considered internal, even if they match  one  of
              the --external PATTERNs.

       -x,  --external=PATTERN
              Mark URLs matching the PATTERN (perl-type regular expression) as
              an external link.  Can be used multiple times.   Note  that  the
              PATTERN is matched against the full URL.

       -y, --yank=PATTERN
              Do  not  check  URLs  matching  the  PATTERN  (perl-type regular
              expression).  Like the -x flag, though this  option  will  cause
              webcheck  to not check the link matched by regex whereas -x will
              check the link but not  its  children.   Can  be  used  multiple
              times.  Note that the PATTERN is matched against the full URL.

       -b, --base-only
              Consider  any URL not starting with the base URL to be external.
              For example, if you run
                  webcheck -b http://www.example.com/foo
              then http://www.example.com/foo/bar will be considered  internal
              whereas http://www.example.com/ will be considered external.  By
              default all the pages on the site will be considered internal.

       -a, --avoid-external
              Avoid external links.  Normally if webcheck is examining an HTML
              page and it finds a link that points to an external document, it
              will check to see if that external document exists.   This  flag
              disables that action.

       --ignore-robots
              Do   not  retrieve  and  parse  robots.txt  files.   By  default
              robots.txt files are retrieved and honored.  If you are sure you
              want to ignore and override the webmaster’s decision this option
              can be used.
              For more  information  on  robots.txt  handling  see  the  NOTES
              section below.

       -q, --quiet, --silent
              Do not print out progress as webcheck traverses a site.

       -d, --debug
              Print  debugging  information  while  crawling  the  site.  This
              option is mainly useful for developers.

       -o, --output=DIRECTORY
              Output directory. Use to specify the  directory  where  webcheck
              will  dump  its reports. The default is the current directory or
              as specified by config.py. If this directory does not  exist  it
              will be created for you (if possible).

       -c, --continue
              Try  to  continue  from  a  previous run. When using this option
              webcheck will look for a webcheck.dat in the  output  directory.
              This  file  is  read to restore the state from the previous run.
              This allows webcheck to continue a previously  interrupted  run.
              When  this option is used, the --internal, --external and --yank
              options will be ignored as  well  as  any  URL  arguments.   The
              --base-only  and  --avoid-external options should be the same as
              the previous run.
              Note that this option is experimental  and  it’s  semantics  may
              change  with  coming  releases  (especially in relation to other
              options).  Also note that the stored files are not guaranteed to
              be compatible between releases.

       -f, --force
              Overwrite  files  without  asking.   This option is required for
              running webcheck non-interactively.

       -r, --redirects=N
              Redirect depth. the number of redirects webcheck  should  follow
              when following a link. 0 implies to follow all redirects.

       -u, --userpass=URL
              Specify  a URL with username and password information to use for
              basic authentication when visiting the site.
              e.g. http://test:secret@example.com/
              This option may be specified multiple times.

       -w, --wait=SECONDS
              Wait SECONDS between document retrievals. Usually webcheck  will
              process  a  url  and immediately move on to the next. However on
              some loaded systems it may be desirable to have  webcheck  pause
              between  requests.   This  option can be set to any non-negative
              number.

       -v, --version
              Show version of program.

       -h, --help
              Show short summary of options.

URL CLASSES

       URLs are divided into two classes:

       Internal URLs are retrieved and  the  retrieved  item  is  checked  for
       syntax.   Also, the retrieved item is searched for links to other items
       (of any class) and these links are followed.

       External URLs are only retrieved to test whether they are valid and  to
       gather  some  basic  information  from them (title, size, content-type,
       etc).  The retrieved items are not inspected for links to other  items.

       Apart  from  their  class,  URLs  can  also  be  considered  yanked (as
       specified with the --yank or --avoid-external options).  The  URLs  can
       be  either internal or external and will not be retrieved or checked at
       all.  URLs of unsupported schemes are also considered yanked.

EXAMPLES

       Check the site www.example.com but consider any path  with  "/webcheck"
       in it to be external.
           webcheck http://www.example.com/ -x /webcheck

NOTES

       When  checking  internal  URLs  webcheck  honors  the  robots.txt file,
       identifying itself as user-agent webcheck. Disallowed links will not be
       checked at all as if the -y option was specified for that URL. To allow
       webcheck to crawl parts of a site that other robots are disallowed, use
       something like:
           User-agent: *
           Disallow: /foo

           User-agent: webcheck
           Allow: /foo

ENVIRONMENT

       <scheme>_proxy
              Proxy url for <scheme>.

REPORTING BUGS

       Bug    reports    shoult    be   sent   to   the   current   maintainer
       <arthur@ch.tudelft.nl>.  More information  on  reporting  bugs  can  be
       found on the webcheck homepage:
       http://ch.tudelft.nl/~arthur/webcheck/

COPYRIGHT

       Copyright © 1998, 1999 Albert Hopkins (marduk)
       Copyright © 2002 Mike W. Meyer
       Copyright © 2005, 2006, 2007, 2008 Arthur de Jong
       webcheck  is  free  software;  see  the  source for copying conditions.
       There is NO warranty; not even for MERCHANTABILITY  or  FITNESS  FOR  A
       PARTICULAR PURPOSE.
       The  files  produced  as  output from the software do not automatically
       fall under the copyright of  the  software,  unless  explicitly  stated
       otherwise.