linkchecker - check HTML documents and websites for broken links

NAME

       linkchecker - check HTML documents and websites for broken links

SYNOPSIS

       linkchecker [options] [file-or-url]...

DESCRIPTION

       LinkChecker  features  recursive  checking,  multithreading,  output in
       colored or normal text, HTML, SQL, CSV or a sitemap  graph  in  GML  or
       XML,  support  for  HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet
       and local  file  links,  restriction  of  link  checking  with  regular
       expression   filters   for   URLs,   proxy  support,  username/password
       authorization for HTTP and FTP, robots.txt exclusion protocol  support,
       i18n  support,  a  command line interface and a (Fast)CGI web interface
       (requires HTTP server)

EXAMPLES

       The most common use checks the given domain recursively, plus  any  URL
       pointing outside of the domain:
         linkchecker http://treasure.calvinsplayground.de/
       Beware  that  this  checks  the  whole site which can have thousands of
       URLs.  Use the -r option to restrict the recursion depth.
       Don't connect to mailto: hosts, only check their URL syntax. All  other
       links are checked as usual:
         linkchecker --ignore-url=^mailto: www.mysite.org
       Checking a local HTML file on Unix:
         linkchecker ../bla.html
       Checking from stdin:
         echo "bla.html" | linkchecker --stdin
       Checking a local HTML file on Windows:
         linkchecker c:\temp\test.html
       You can skip the http:// url part if the domain starts with www.:
         linkchecker www.myhomepage.de
       You can skip the ftp:// url part if the domain starts with ftp.:
         linkchecker -r0 ftp.linux.org
       Generate a sitemap graph and convert it with the graphviz dot utility:
         linkchecker -odot -v www.myhomepage.de | dot -Tps > sitemap.ps

OPTIONS

   General options
       -h, --help
              Help me! Print usage information for this program.

       -fFILENAME, --config=FILENAME
              Use FILENAME as configuration file. As default LinkChecker first
              searches      /etc/linkchecker/linkcheckerrc      and       then
              ~/.linkchecker/linkcheckerrc.

       -I, --interactive
              Ask for URL if none are given on the commandline.

       -tNUMBER, --threads=NUMBER
              Generate  no  more  than  the  given  number of threads. Default
              number of threads is 10. To disable  threading  specify  a  non-
              positive number.

       --priority
              Run   with   normal  thread  scheduling  priority.  Per  default
              LinkChecker runs with low thread priority to be  suitable  as  a
              background job.

       -V, --version
              Print version and exit.

       --allow-root
              Do  not  drop  privileges  when  running  as  root  user on Unix
              systems.

       --stdin
              Read list of white-space separated URLs to check from stdin.

   Output options
       -v, --verbose
              Log all checked URLs once. Default is to  log  only  errors  and
              warnings.

       --complete
              Log  all URLs, including duplicates. Default is to log duplicate
              URLs only once.

       --no-warnings
              Don't log warnings. Default is to log warnings.

       -WREGEX, --warning-regex=REGEX
              Define a regular expression which prints a warning if it matches
              any  content  of  the  checked link.  This applies only to valid
              pages, so we can get their content.
              Use this to check for pages that contain some form of error, for
              example  "This  page  has  moved"  or "Oracle Application Server
              error".

       --warning-size-bytes=NUMBER
              Print a warning if content size info is  available  and  exceeds
              the given number of bytes.

       --check-html
              Check syntax of HTML URLs with local library (HTML tidy).

       --check-html-w3
              Check syntax of HTML URLs with W3C online validator.

       --check-css
              Check syntax of CSS URLs with local library (cssutils).

       --check-css-w3
              Check syntax of CSS URLs with W3C online validator.

       --scan-virus
              Scan content of URLs for viruses with ClamAV.

       -q, --quiet
              Quiet operation, an alias for -o none.  This is only useful with
              -F.

       -oTYPE[/ENCODING], --output=TYPE[/ENCODING]
              Specify output type as text, html, sql, csv, gml, dot, xml, none
              or  blacklist.   Default  type is text. The various output types
              are documented below.
              The ENCODING specifies the output encoding, the default is  that
              of    your    locale.    Valid    encodings    are   listed   at
              http://docs.python.org/lib/standard-encodings.html.

       -FTYPE[/ENCODING][/FILENAME], --file-output=TYPE[/ENCODING][/FILENAME]
              Output       to       a        file        linkchecker-out.TYPE,
              $HOME/.linkchecker/blacklist  for  blacklist output, or FILENAME
              if specified.  The ENCODING specifies the output  encoding,  the
              default  is  that of your locale.  Valid encodings are listed at
              http://docs.python.org/lib/standard-encodings.html.          The
              FILENAME  and  ENCODING  parts  of  the none output type will be
              ignored,  else  if  the  file  already  exists,   it   will   be
              overwritten.   You can specify this option more than once. Valid
              file output types are text, html, sql, csv, gml, dot, xml,  none
              or blacklist Default is no file output. The various output types
              are documented below. Note that you  can  suppress  all  console
              output with the option -o none.

       --no-status
              Do not print check status messages.

       -DSTRING, --debug=STRING
              Print  debugging output for the given logger.  Available loggers
              are cmdline, checking, cache, gui, dns and all.  Specifying  all
              is  an  alias  for specifying all available loggers.  The option
              can be given multiple times to debug with more than one  logger.
                For  accurate results, threading will be disabled during debug
              runs.

       --trace
              Print tracing information.

       --profile
              Write profiling data into a file named linkchecker.prof  in  the
              current working directory. See also --viewprof.

       --viewprof
              Print   out   previously  generated  profiling  data.  See  also
              --profile.

   Checking options
       -rNUMBER, --recursion-level=NUMBER
              Check recursively all links up to given depth.  A negative depth
              will enable infinite recursion.  Default depth is infinite.

       --no-follow-url=REGEX
              Check  but  do  not recurse into URLs matching the given regular
              expression.
              This option can be given multiple times.

       --ignore-url=REGEX
              Only check syntax of URLs matching the given regular expression.
              This option can be given multiple times.

       -C, --cookies
              Accept and send HTTP cookies according to RFC 2109. Only cookies
              which are sent back to  the  originating  server  are  accepted.
              Sent  and  accepted  cookies  are provided as additional logging
              information.

       --cookiefile=FILENAME
              Read a file with initial cookie data. The cookie data format  is
              explained below.

       -a, --anchors
              Check  HTTP  anchor references. Default is not to check anchors.
              This option enables logging of the warning url-anchor-not-found.

       -uSTRING, --user=STRING
              Try  the given username for HTTP and FTP authorization.  For FTP
              the default username is anonymous. For HTTP there is no  default
              username. See also -p.

       -pSTRING, --password=STRING
              Try  the given password for HTTP and FTP authorization.  For FTP
              the default password is anonymous@. For HTTP there is no default
              password. See also -u.

       --timeout=NUMBER
              Set  the timeout for connection attempts in seconds. The default
              timeout is 60 seconds.

       -PNUMBER, --pause=NUMBER
              Pause  the  given  number  of  seconds  between  two  subsequent
              connection  requests  to  the  same  host.  Default  is no pause
              between requests.

       -NSTRING, --nntp-server=STRING
              Specify  an  NNTP  server  for  news:  links.  Default  is   the
              environment  variable NNTP_SERVER. If no host is given, only the
              syntax of the link is checked.

CONFIGURATION FILES

       Configuration files can  specify  all  options  above.  They  can  also
       specify  some  options  that  cannot  be  set on the command line.  See
       linkcheckerrc(5) for more info.

OUTPUT TYPES

       Note that by default only errors and warnings are logged.   You  should
       use  the --verbose option to get the complete URL list, especially when
       outputting a sitemap graph format.

       text   Standard text logger, logging URLs in keyword: argument fashion.

       html   Log  URLs  in  keyword:  argument  fashion,  formatted  as HTML.
              Additionally has links to the  referenced  pages.  Invalid  URLs
              have HTML and CSS syntax check links appended.

       csv    Log check result in CSV format with one URL per line.

       gml    Log  parent-child relations between linked URLs as a GML sitemap
              graph.

       dot    Log parent-child relations between linked URLs as a DOT  sitemap
              graph.

       gxml   Log check result as a GraphXML sitemap graph.

       xml    Log check result as machine-readable XML.

       sql    Log  check result as SQL script with INSERT commands. An example
              script  to  create  the  initial  SQL  table  is   included   as
              create.sql.

       blacklist
              Suitable  for  cron  jobs.  Logs  the  check  result into a file
              ~/.linkchecker/blacklist  which  only  contains   entries   with
              invalid URLs and the number of times they have failed.

       none   Logs  nothing. Suitable for debugging or checking the exit code.

REGULAR EXPRESSIONS

       Only Python regular  expressions  are  accepted  by  LinkChecker.   See
       http://www.amk.ca/python/howto/regex/  for  an  introduction in regular
       expressions.

       The only addition is  that  a  leading  exclamation  mark  negates  the
       regular expression.

COOKIE FILES

       A  cookie file contains standard RFC 805 header data with the following
       possible names:

       Scheme (optional)
              Sets the scheme the cookies are valid  for;  default  scheme  is
              http.

       Host (required)
              Sets the domain the cookies are valid for.

       Path (optional)
              Gives the path the cookies are value for; default path is /.

       Set-cookie (optional)
              Set cookie name/value. Can be given more than once.

       Multiple entries are separated by a blank line.  The example below will
       send two cookies to all URLs  starting  with  http://example.com/hello/
       and one to all URLs starting with https://example.org/:

        Host: example.com
        Path: /hello
        Set-cookie: ID="smee"
        Set-cookie: spam="egg"

        Scheme: https
        Host: example.org
        Set-cookie: baggage="elitist"; comment="hologram"

PROXY SUPPORT

       To  use a proxy on Unix or Windows set the $http_proxy, $https_proxy or
       $ftp_proxy environment variables to the proxy URL. The URL should be of
       the  form  http://[user:pass@]host[:port].   LinkChecker  also  detects
       manual proxy settings of Internet Explorer under Windows systems. On  a
       Mac  use  the  Internet  Config  to select a proxy.  You can also set a
       comma-separated domain list in the $no_proxy environment  variables  to
       ignore  any  proxy settings for these domains.  Setting a HTTP proxy on
       Unix for example looks like this:

         export http_proxy="http://proxy.example.com:8080"

       Proxy authentication is also supported:

         export http_proxy="http://user1:mypass@proxy.example.org:8081"

       Setting a proxy on the Windows command prompt:

         set http_proxy=http://proxy.example.com:8080

PERFORMED CHECKS

       All URLs have to pass a preliminary syntax test. Minor quoting mistakes
       will  issue  a  warning,  all  other  invalid syntax issues are errors.
       After the syntax  check  passes,  the  URL  is  queued  for  connection
       checking. All connection check types are described below.

       HTTP links (http:, https:)
              After  connecting  to  the  given  HTTP server the given path or
              query is  requested.  All  redirections  are  followed,  and  if
              user/password  is  given  it  will be used as authorization when
              necessary.  Permanently moved pages issue a warning.  All  final
              HTTP status codes other than 2xx are errors.  HTML page contents
              are checked for recursion.

       Local files (file:)
              A regular, readable file that can be opened is valid. A readable
              directory  is  also  valid.  All other files, for example device
              files, unreadable or non-existing files  are  errors.   HTML  or
              other parseable file contents are checked for recursion.

       Mail links (mailto:)
              A mailto: link eventually resolves to a list of email addresses.
              If one address fails, the whole list will fail.  For  each  mail
              address we check the following things:
                1) Check the adress syntax, both of the part before and after
                   the @ sign.
                2) Look up the MX DNS records. If we found no MX record,
                   print an error.
                3) Check if one of the mail hosts accept an SMTP connection.
                   Check hosts with higher priority first.
                   If no host accepts SMTP, we print a warning.
                4) Try to verify the address with the VRFY command. If we got
                   an answer, print the verified address as an info.

       FTP links (ftp:)

                For FTP links we do:

                1) connect to the specified host
                2) try to login with the given user and password. The default
                   user    is   ``anonymous``,   the   default   password   is
              ``anonymous@``.
                3) try to change to the given directory
                4) list the file with the NLST command

              - Telnet links (``telnet:``)

                We try to connect and if user/password are given, login to the
                given telnet server.

              - NNTP links (``news:``, ``snews:``, ``nntp``)

                We try to connect to the given NNTP server. If a news group or
                article is specified, try to request it from the server.

              - Ignored links (``javascript:``, etc.)

                An ignored link will only print a warning. No further checking
                will be made.

                Here  is a complete list of recognized, but ignored links. The
              most
                prominent of them should be JavaScript links.

                - ``acap:``      (application configuration access protocol)
                - ``afs:``       (Andrew File System global file names)
                - ``chrome:``    (Mozilla specific)
                - ``cid:``       (content identifier)
                - ``clsid:``     (Microsoft specific)
                - ``data:``      (data)
                - ``dav:``       (dav)
                - ``fax:``       (fax)
                - ``find:``      (Mozilla specific)
                - ``gopher:``    (Gopher)
                - ``imap:``      (internet message access protocol)
                - ``isbn:``      (ISBN (int. book numbers))
                - ``javascript:`` (JavaScript)
                - ``ldap:``      (Lightweight Directory Access Protocol)
                - ``mailserver:`` (Access to data available from mail servers)
                - ``mid:``       (message identifier)
                - ``mms:``       (multimedia stream)
                - ``modem:``     (modem)
                - ``nfs:``       (network file system protocol)
                - ``opaquelocktoken:`` (opaquelocktoken)
                - ``pop:``       (Post Office Protocol v3)
                - ``prospero:``  (Prospero Directory Service)
                - ``rsync:``     (rsync protocol)
                - ``rtsp:``      (real time streaming protocol)
                - ``service:``   (service location)
                - ``shttp:``     (secure HTTP)
                - ``sip:``       (session initiation protocol)
                - ``tel:``       (telephone)
                - ``tip:``       (Transaction Internet Protocol)
                - ``tn3270:``    (Interactive 3270 emulation sessions)
                - ``vemmi:``     (versatile multimedia interface)
                - ``wais:``      (Wide Area Information Servers)
                - ``z39.50r:``   (Z39.50 Retrieval)
                - ``z39.50s:``   (Z39.50 Session)

RECURSION

       Before  descending  recursively  into  a URL, it has to fulfill several
       conditions. They are checked in this order:

       1. A URL must be valid.

       2. A URL must be parseable. This currently includes HTML files,
          Opera bookmarks files, and directories. If a file type cannot
          be determined (for example it does not have a common HTML file
          extension, and the content does not look like HTML), it is assumed
          to be non-parseable.

       3. The URL content must be retrievable. This is usually the case
          except for example mailto: or unknown URL types.

       4. The maximum recursion level must not be exceeded. It is configured
          with the ``--recursion-level`` option and is unlimited per  default.

       5. It must not match the ignored URL list. This is controlled with
          the ``--ignore-url`` option.

       6. The Robots Exclusion Protocol must allow links in the URL to be
          followed recursively. This is checked by searching for a
          "nofollow" directive in the HTML header data.

       Note  that  the  directory recursion reads all files in that directory,
       not just a subset like ``index.htm*``.

NOTES

       URLs on the commandline starting with ftp. are treated like ftp://ftp.,
       URLs  starting  with  www.  are treated like http://www..  You can also
       give local files as arguments.

       If you  have  your  system  configured  to  automatically  establish  a
       connection  to  the  internet  (e.g.  with diald), it will connect when
       checking links not pointing to your local host.   Use  the  -s  and  -i
       options to prevent this.

       Javascript links are currently ignored.

       If  your  platform  does not support threading, LinkChecker disables it
       automatically.

       You can supply multiple user/password pairs in a configuration file.

       When checking news: links the given NNTP host doesn't need  to  be  the
       same as the host of the user browsing your pages.

ENVIRONMENT

       NNTP_SERVER - specifies default NNTP server
       http_proxy - specifies default HTTP proxy server
       ftp_proxy - specifies default FTP proxy server
       no_proxy  - comma-separated list of domains to not contact over a proxy
       server
       LC_MESSAGES, LANG, LANGUAGE - specify output language

RETURN VALUE

       The return value is non-zero when

       o      invalid links were found or

       o      link warnings were found and warnings are enabled

       o      a program error occurred.

LIMITATIONS

       LinkChecker  consumes  memory  for  each  queued  URL  to  check.  With
       thousands of queued URLs the amount of consumed memory can become quite
       large. This might slow down the program or even the whole system.

FILES

       /etc/linkchecker/linkcheckerrc, ~/.linkchecker/linkcheckerrc -  default
       configuration files
       ~/.linkchecker/blacklist - default blacklist logger output filename
       linkchecker-out.TYPE - default logger file output name
       http://docs.python.org/lib/standard-encodings.html   -   valid   output
       encodings
       http://www.amk.ca/python/howto/regex/     -     regular      expression
       documentation

AUTHOR

       Bastian Kleineidam <calvin@users.sourceforge.net>

COPYRIGHT

       Copyright (C) 2000-2010 Bastian Kleineidam

NAME

SYNOPSIS

DESCRIPTION

EXAMPLES

OPTIONS

CONFIGURATION FILES

OUTPUT TYPES

REGULAR EXPRESSIONS

COOKIE FILES

PROXY SUPPORT

PERFORMED CHECKS

RECURSION

NOTES

ENVIRONMENT

RETURN VALUE

LIMITATIONS

FILES

SEE ALSO

AUTHOR

COPYRIGHT