NAME
linkchecker - check HTML documents and websites for broken links
SYNOPSIS
linkchecker [options] [file-or-url]...
DESCRIPTION
LinkChecker features recursive checking, multithreading, output in
colored or normal text, HTML, SQL, CSV or a sitemap graph in GML or
XML, support for HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet
and local file links, restriction of link checking with regular
expression filters for URLs, proxy support, username/password
authorization for HTTP and FTP, robots.txt exclusion protocol support,
i18n support, a command line interface and a (Fast)CGI web interface
(requires HTTP server)
EXAMPLES
The most common use checks the given domain recursively, plus any URL
pointing outside of the domain:
linkchecker http://treasure.calvinsplayground.de/
Beware that this checks the whole site which can have thousands of
URLs. Use the -r option to restrict the recursion depth.
Don't connect to mailto: hosts, only check their URL syntax. All other
links are checked as usual:
linkchecker --ignore-url=^mailto: www.mysite.org
Checking a local HTML file on Unix:
linkchecker ../bla.html
Checking from stdin:
echo "bla.html" | linkchecker --stdin
Checking a local HTML file on Windows:
linkchecker c:\temp\test.html
You can skip the http:// url part if the domain starts with www.:
linkchecker www.myhomepage.de
You can skip the ftp:// url part if the domain starts with ftp.:
linkchecker -r0 ftp.linux.org
Generate a sitemap graph and convert it with the graphviz dot utility:
linkchecker -odot -v www.myhomepage.de | dot -Tps > sitemap.ps
OPTIONS
General options
-h, --help
Help me! Print usage information for this program.
-fFILENAME, --config=FILENAME
Use FILENAME as configuration file. As default LinkChecker first
searches /etc/linkchecker/linkcheckerrc and then
~/.linkchecker/linkcheckerrc.
-I, --interactive
Ask for URL if none are given on the commandline.
-tNUMBER, --threads=NUMBER
Generate no more than the given number of threads. Default
number of threads is 10. To disable threading specify a non-
positive number.
--priority
Run with normal thread scheduling priority. Per default
LinkChecker runs with low thread priority to be suitable as a
background job.
-V, --version
Print version and exit.
--allow-root
Do not drop privileges when running as root user on Unix
systems.
--stdin
Read list of white-space separated URLs to check from stdin.
Output options
-v, --verbose
Log all checked URLs once. Default is to log only errors and
warnings.
--complete
Log all URLs, including duplicates. Default is to log duplicate
URLs only once.
--no-warnings
Don't log warnings. Default is to log warnings.
-WREGEX, --warning-regex=REGEX
Define a regular expression which prints a warning if it matches
any content of the checked link. This applies only to valid
pages, so we can get their content.
Use this to check for pages that contain some form of error, for
example "This page has moved" or "Oracle Application Server
error".
--warning-size-bytes=NUMBER
Print a warning if content size info is available and exceeds
the given number of bytes.
--check-html
Check syntax of HTML URLs with local library (HTML tidy).
--check-html-w3
Check syntax of HTML URLs with W3C online validator.
--check-css
Check syntax of CSS URLs with local library (cssutils).
--check-css-w3
Check syntax of CSS URLs with W3C online validator.
--scan-virus
Scan content of URLs for viruses with ClamAV.
-q, --quiet
Quiet operation, an alias for -o none. This is only useful with
-F.
-oTYPE[/ENCODING], --output=TYPE[/ENCODING]
Specify output type as text, html, sql, csv, gml, dot, xml, none
or blacklist. Default type is text. The various output types
are documented below.
The ENCODING specifies the output encoding, the default is that
of your locale. Valid encodings are listed at
http://docs.python.org/lib/standard-encodings.html.
-FTYPE[/ENCODING][/FILENAME], --file-output=TYPE[/ENCODING][/FILENAME]
Output to a file linkchecker-out.TYPE,
$HOME/.linkchecker/blacklist for blacklist output, or FILENAME
if specified. The ENCODING specifies the output encoding, the
default is that of your locale. Valid encodings are listed at
http://docs.python.org/lib/standard-encodings.html. The
FILENAME and ENCODING parts of the none output type will be
ignored, else if the file already exists, it will be
overwritten. You can specify this option more than once. Valid
file output types are text, html, sql, csv, gml, dot, xml, none
or blacklist Default is no file output. The various output types
are documented below. Note that you can suppress all console
output with the option -o none.
--no-status
Do not print check status messages.
-DSTRING, --debug=STRING
Print debugging output for the given logger. Available loggers
are cmdline, checking, cache, gui, dns and all. Specifying all
is an alias for specifying all available loggers. The option
can be given multiple times to debug with more than one logger.
For accurate results, threading will be disabled during debug
runs.
--trace
Print tracing information.
--profile
Write profiling data into a file named linkchecker.prof in the
current working directory. See also --viewprof.
--viewprof
Print out previously generated profiling data. See also
--profile.
Checking options
-rNUMBER, --recursion-level=NUMBER
Check recursively all links up to given depth. A negative depth
will enable infinite recursion. Default depth is infinite.
--no-follow-url=REGEX
Check but do not recurse into URLs matching the given regular
expression.
This option can be given multiple times.
--ignore-url=REGEX
Only check syntax of URLs matching the given regular expression.
This option can be given multiple times.
-C, --cookies
Accept and send HTTP cookies according to RFC 2109. Only cookies
which are sent back to the originating server are accepted.
Sent and accepted cookies are provided as additional logging
information.
--cookiefile=FILENAME
Read a file with initial cookie data. The cookie data format is
explained below.
-a, --anchors
Check HTTP anchor references. Default is not to check anchors.
This option enables logging of the warning url-anchor-not-found.
-uSTRING, --user=STRING
Try the given username for HTTP and FTP authorization. For FTP
the default username is anonymous. For HTTP there is no default
username. See also -p.
-pSTRING, --password=STRING
Try the given password for HTTP and FTP authorization. For FTP
the default password is anonymous@. For HTTP there is no default
password. See also -u.
--timeout=NUMBER
Set the timeout for connection attempts in seconds. The default
timeout is 60 seconds.
-PNUMBER, --pause=NUMBER
Pause the given number of seconds between two subsequent
connection requests to the same host. Default is no pause
between requests.
-NSTRING, --nntp-server=STRING
Specify an NNTP server for news: links. Default is the
environment variable NNTP_SERVER. If no host is given, only the
syntax of the link is checked.
CONFIGURATION FILES
Configuration files can specify all options above. They can also
specify some options that cannot be set on the command line. See
linkcheckerrc(5) for more info.
OUTPUT TYPES
Note that by default only errors and warnings are logged. You should
use the --verbose option to get the complete URL list, especially when
outputting a sitemap graph format.
text Standard text logger, logging URLs in keyword: argument fashion.
html Log URLs in keyword: argument fashion, formatted as HTML.
Additionally has links to the referenced pages. Invalid URLs
have HTML and CSS syntax check links appended.
csv Log check result in CSV format with one URL per line.
gml Log parent-child relations between linked URLs as a GML sitemap
graph.
dot Log parent-child relations between linked URLs as a DOT sitemap
graph.
gxml Log check result as a GraphXML sitemap graph.
xml Log check result as machine-readable XML.
sql Log check result as SQL script with INSERT commands. An example
script to create the initial SQL table is included as
create.sql.
blacklist
Suitable for cron jobs. Logs the check result into a file
~/.linkchecker/blacklist which only contains entries with
invalid URLs and the number of times they have failed.
none Logs nothing. Suitable for debugging or checking the exit code.
REGULAR EXPRESSIONS
Only Python regular expressions are accepted by LinkChecker. See
http://www.amk.ca/python/howto/regex/ for an introduction in regular
expressions.
The only addition is that a leading exclamation mark negates the
regular expression.
COOKIE FILES
A cookie file contains standard RFC 805 header data with the following
possible names:
Scheme (optional)
Sets the scheme the cookies are valid for; default scheme is
http.
Host (required)
Sets the domain the cookies are valid for.
Path (optional)
Gives the path the cookies are value for; default path is /.
Set-cookie (optional)
Set cookie name/value. Can be given more than once.
Multiple entries are separated by a blank line. The example below will
send two cookies to all URLs starting with http://example.com/hello/
and one to all URLs starting with https://example.org/:
Host: example.com
Path: /hello
Set-cookie: ID="smee"
Set-cookie: spam="egg"
Scheme: https
Host: example.org
Set-cookie: baggage="elitist"; comment="hologram"
PROXY SUPPORT
To use a proxy on Unix or Windows set the $http_proxy, $https_proxy or
$ftp_proxy environment variables to the proxy URL. The URL should be of
the form http://[user:pass@]host[:port]. LinkChecker also detects
manual proxy settings of Internet Explorer under Windows systems. On a
Mac use the Internet Config to select a proxy. You can also set a
comma-separated domain list in the $no_proxy environment variables to
ignore any proxy settings for these domains. Setting a HTTP proxy on
Unix for example looks like this:
export http_proxy="http://proxy.example.com:8080"
Proxy authentication is also supported:
export http_proxy="http://user1:mypass@proxy.example.org:8081"
Setting a proxy on the Windows command prompt:
set http_proxy=http://proxy.example.com:8080
PERFORMED CHECKS
All URLs have to pass a preliminary syntax test. Minor quoting mistakes
will issue a warning, all other invalid syntax issues are errors.
After the syntax check passes, the URL is queued for connection
checking. All connection check types are described below.
HTTP links (http:, https:)
After connecting to the given HTTP server the given path or
query is requested. All redirections are followed, and if
user/password is given it will be used as authorization when
necessary. Permanently moved pages issue a warning. All final
HTTP status codes other than 2xx are errors. HTML page contents
are checked for recursion.
Local files (file:)
A regular, readable file that can be opened is valid. A readable
directory is also valid. All other files, for example device
files, unreadable or non-existing files are errors. HTML or
other parseable file contents are checked for recursion.
Mail links (mailto:)
A mailto: link eventually resolves to a list of email addresses.
If one address fails, the whole list will fail. For each mail
address we check the following things:
1) Check the adress syntax, both of the part before and after
the @ sign.
2) Look up the MX DNS records. If we found no MX record,
print an error.
3) Check if one of the mail hosts accept an SMTP connection.
Check hosts with higher priority first.
If no host accepts SMTP, we print a warning.
4) Try to verify the address with the VRFY command. If we got
an answer, print the verified address as an info.
FTP links (ftp:)
For FTP links we do:
1) connect to the specified host
2) try to login with the given user and password. The default
user is ``anonymous``, the default password is
``anonymous@``.
3) try to change to the given directory
4) list the file with the NLST command
- Telnet links (``telnet:``)
We try to connect and if user/password are given, login to the
given telnet server.
- NNTP links (``news:``, ``snews:``, ``nntp``)
We try to connect to the given NNTP server. If a news group or
article is specified, try to request it from the server.
- Ignored links (``javascript:``, etc.)
An ignored link will only print a warning. No further checking
will be made.
Here is a complete list of recognized, but ignored links. The
most
prominent of them should be JavaScript links.
- ``acap:`` (application configuration access protocol)
- ``afs:`` (Andrew File System global file names)
- ``chrome:`` (Mozilla specific)
- ``cid:`` (content identifier)
- ``clsid:`` (Microsoft specific)
- ``data:`` (data)
- ``dav:`` (dav)
- ``fax:`` (fax)
- ``find:`` (Mozilla specific)
- ``gopher:`` (Gopher)
- ``imap:`` (internet message access protocol)
- ``isbn:`` (ISBN (int. book numbers))
- ``javascript:`` (JavaScript)
- ``ldap:`` (Lightweight Directory Access Protocol)
- ``mailserver:`` (Access to data available from mail servers)
- ``mid:`` (message identifier)
- ``mms:`` (multimedia stream)
- ``modem:`` (modem)
- ``nfs:`` (network file system protocol)
- ``opaquelocktoken:`` (opaquelocktoken)
- ``pop:`` (Post Office Protocol v3)
- ``prospero:`` (Prospero Directory Service)
- ``rsync:`` (rsync protocol)
- ``rtsp:`` (real time streaming protocol)
- ``service:`` (service location)
- ``shttp:`` (secure HTTP)
- ``sip:`` (session initiation protocol)
- ``tel:`` (telephone)
- ``tip:`` (Transaction Internet Protocol)
- ``tn3270:`` (Interactive 3270 emulation sessions)
- ``vemmi:`` (versatile multimedia interface)
- ``wais:`` (Wide Area Information Servers)
- ``z39.50r:`` (Z39.50 Retrieval)
- ``z39.50s:`` (Z39.50 Session)
RECURSION
Before descending recursively into a URL, it has to fulfill several
conditions. They are checked in this order:
1. A URL must be valid.
2. A URL must be parseable. This currently includes HTML files,
Opera bookmarks files, and directories. If a file type cannot
be determined (for example it does not have a common HTML file
extension, and the content does not look like HTML), it is assumed
to be non-parseable.
3. The URL content must be retrievable. This is usually the case
except for example mailto: or unknown URL types.
4. The maximum recursion level must not be exceeded. It is configured
with the ``--recursion-level`` option and is unlimited per default.
5. It must not match the ignored URL list. This is controlled with
the ``--ignore-url`` option.
6. The Robots Exclusion Protocol must allow links in the URL to be
followed recursively. This is checked by searching for a
"nofollow" directive in the HTML header data.
Note that the directory recursion reads all files in that directory,
not just a subset like ``index.htm*``.
NOTES
URLs on the commandline starting with ftp. are treated like ftp://ftp.,
URLs starting with www. are treated like http://www.. You can also
give local files as arguments.
If you have your system configured to automatically establish a
connection to the internet (e.g. with diald), it will connect when
checking links not pointing to your local host. Use the -s and -i
options to prevent this.
Javascript links are currently ignored.
If your platform does not support threading, LinkChecker disables it
automatically.
You can supply multiple user/password pairs in a configuration file.
When checking news: links the given NNTP host doesn't need to be the
same as the host of the user browsing your pages.
ENVIRONMENT
NNTP_SERVER - specifies default NNTP server
http_proxy - specifies default HTTP proxy server
ftp_proxy - specifies default FTP proxy server
no_proxy - comma-separated list of domains to not contact over a proxy
server
LC_MESSAGES, LANG, LANGUAGE - specify output language
RETURN VALUE
The return value is non-zero when
o invalid links were found or
o link warnings were found and warnings are enabled
o a program error occurred.
LIMITATIONS
LinkChecker consumes memory for each queued URL to check. With
thousands of queued URLs the amount of consumed memory can become quite
large. This might slow down the program or even the whole system.
FILES
/etc/linkchecker/linkcheckerrc, ~/.linkchecker/linkcheckerrc - default
configuration files
~/.linkchecker/blacklist - default blacklist logger output filename
linkchecker-out.TYPE - default logger file output name
http://docs.python.org/lib/standard-encodings.html - valid output
encodings
http://www.amk.ca/python/howto/regex/ - regular expression
documentation
SEE ALSO
linkcheckerrc(5)
AUTHOR
Bastian Kleineidam <calvin@users.sourceforge.net>
COPYRIGHT
Copyright (C) 2000-2010 Bastian Kleineidam