NAME
indexer - indexing WWW space.
SYNOPSIS
indexer [ -a ] [ -b ] [ -n number ] [ -e ] [ -m ] [ -q ] [ -o ] [ -r ]
[ -i ] [ -w ] [ -R ] [ -N number ] [ -p seconds ] [ -t tag ] [ -u
pattern ] [ -s status ] [ -y content-type ] [ configfile ]
indexer -C [ -R ] [ -t tag ] [ -u pattern ] [ -s status ] [ -y content-
type ] [ configfile ]
indexer -S [ -R ] [ -t tag ] [ -u pattern ] [ -s status ] [ -y content-
type ] [ configfile ]
indexer -I [ -R ] [ -t tag ] [ -u pattern ] [ -s status ] [ -y content-
type ] [ configfile ]
indexer -h|-?
DESCRIPTION
indexer is a part of mnoGoSearch - search engine. The purpose of
indexer is to walk through HTTP, HTTPS, FTP, NEWS servers as well as
local file system, recursively grabbing all the documents and storing
metadata about documents into SQL or built-in database in a smart and
effective manner. Since every document is referenced by its
corresponding URL, metadata collected by indexer is used later in a
search process.
The behaviour of indexer is controlled mainly via configuration file
indexer.conf (5) , which it reads on startup. There is a compiled-in
default for configuration file name and location, so you don’t need to
specify it every time you run indexer , but you can specify alternative
configuration file as the last argument.
indexer supports HTML-formatted (text/html MIME type), XML-formated
(text/xml MIME type) and plain text (text/plain MIME type) documents.
Support for other data types is provided by using external programs,
which are called "parsers". Parser should get data of some type from
stdin and put text/html or text/plain data to stdout. See
indexer.conf(5) for details.
You may run indexer regularly from cron (8) to keep metadata up-to-
date.
indexer is also used to manipulate database. It may be used to clear
some data from database, to output some statistics and to calculate
popolarity ranking.
OPTIONS
Indexing
-a Reindex all documents even if not expired.
By default indexer reindex only whose documents that are
"expired", e.g. time since their last reindexing is greater
than "Period" from indexer.conf (5) file. This option disables
the feature, so all documents will be reindexed, irrelevant to
their state. To achieve this, indexer just first marks all URLs
as "expired". This gives the following side effect: if you start
indexer -a and then terminate it (for example, by pressing
Ctrl-C ) and start again, all URLs will be considered "expired"
and will be reindexed again.
-m This option force indexer to reindex documents, even if their
content has not been modified. It is achived by disabling If-
Modified-Since HTTP header and MD5 hash check. This is usable
if you have changed some Allow , Disallow , MaxHops or other
directives in your indexer.conf(5) file. Thus, there will be
different set of rules for storing document URLs and so
different set of URLs. To find out that URLs, there is a need to
reindex even-not-changed documents.
-n number Reindex only given number of URLs and exit.
-c seconds limit indexing time to a given number of seconds
-e Reindex most expired documents first. That option forces the
list of documents to reindex to be sorted by last reindexing
time. That means that most "expired" documents will be reindexed
first. You may or may not experience some minor delay with that
option, but at least in theory it should slow down indexer a
bit.
The combination of -e and -n number is seems to be of some
value. So, you can use indexer -e -n 100 to reindex just 100
most expired documents.
-q Quick startup. This mode is useful if you haven’t added or
modified Server commands. indexer will not insert URLs given in
Server commands into database which leads to some startup speed-
up.
-k skip locking (this option affects only MySQL and PostgreSQL
only).
-i Isert new URLs. New URL must be specified using -u or -f
options.
-p seconds Specifies time in seconds to pause after each URL.
-w Turns off warnings before clearing database.
-o Index documents with less depth (hops value) first.
-r Do not try to reduce remote servers load by randomising url
fetch list before indexing (recommended for very big number of
URLs).
-b Block start more than one indexer instances
-N number Run number threads, if multithreaded mnoGoSearch version
was compiled.
-R Calculate popularity rank before program exit.
Subsection control
-t tag
-u pattern
-s status
-g category
-y content-type
Set URL filters on tag , pattern , status , category and
content-type respectively.
tag is a server tag that you can arbitrary set in config file
indexer.conf (5)
pattern is a SQL LIKE wildcard for URL. In short, underscore ( _
) means "any symbol", and per cent ( % ) means "any symbols",
and the comparison is case insensitive. For example, indexer -u
%izhcom.ru% will reindex all documents that URLs contains string
"izhcom.ru".
status is a filter on document’s HTTP status obtained during
last reindexing. For example, -s 0 is a filter for all
documents that has not been indexed before. -s 200 is a filter
for all documents that was retrieved with "HTTP 200 Ok" status,
and -s 301 is a filter for all documents that was retrieved with
"HTTP 301 Redirect" status. See HTTP protocol specifications
for details on HTTP status codes and their respective meanings.
category is a filter for documents that match specific category.
Categories are almost like tags but nested.
content-type is a MIME type for documents with that Content-
Type.
You can freely combine any number of -t , -u , -s , -g and -y
options. The filters of the same class (tag, pattern, status)
are be combined using logical OR, and the filters of different
classes will be combined using logical AND. That means, if you
type indexer -u %izhcom.ru% -u %udm.net% -t 1 -s 200 the
documents-to-index will be those with tag 1 and HTTP status 200,
which URLs contains the strings "izhcom.ru" or "udm.net".
-f filename Read URL to be indexed/inserted/cleared from a file.
(With -a or -C option, it supports SQL LIKE wildcard % , has
no effect when combined with -m option.
-f - Use STDIN instead of a file to read URL list
Logging options
-l Do not log to stdout/stderr.
-v level Verbose level, can be set to 0-5.
Misc.
-C Clear databases.
This will erase data previously collected by indexer from the
mnoGoSearch databases. You can use options -t , -u and -s
described above to select what do you want to delete.
WARNING: Use this option with extreme caution!
-S Show statistics.
This option outputs a brief statistics of how many documents are
there in database, their HTTP status, and how many documents are
expired. You can use options -t , -u and -s described above to
select what documents do you want statistics on.
-I Show referrers.
This option shows you the referrers of URLs. Or, in other words,
all hyperlinks from the document. You can use options -t , -u
and -s described above to select what documents do you want to
show referrers on.
-h
-? Shows help screen with brief overall description of indexer
options.
BUGS
If you think you’ve found a bug in indexer, please report it to
mnoGoSearch bugreport system at http://www.mnogosearch.org/bugs/
(please post in English only).
COPYRIGHT
Copyright © 1998 - 2004 Lavtech.Com Corp.
(http://www.mnogosearch.org/).
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
SEE ALSO
indexer.conf(8)