NAME
indexer.conf - configuration file for indexer
DESCRIPTION
This is configuration file for indexer (1). Configuration file
consists of commands and their arguments. All commands are case-
insensitive. You can use # to comment out lines.
VARIABLES
Global parameters
These commands should be used only once and take global effect
for the whole configuration file.
DBType type
Database type, currently supported values are mysql, pgsql,
msql, solid, mssql, oracle, ibase, sqlite Actually it does not
matter for native libraries support, but ODBC users must specify
one of the supported values. If your database type is not
supported, use unknown instead.
DBHost host
SQL host name (Not required for ODBC)
Default: localhost
DBName mnogosearch
SQL database name or ODBC DSN
Default: mnogosearch
DBUser foo
Database username to connect to database
Default: no user
DBPass bar
Database password to connect to database
Default: no password
DBMode single/multi/crc/crc-multi
SQL database words storage mode. Does not apply for built-in
database. When single is specified, all words are stored in the
same table. multi means that words are stored in different
tables depending on wordlength. multi mode is usualy faster,
but it requires more tables in database. In case of crc mode,
mnoGoSearch will store 32 bit integer word ID’s calculated by
CRC32 algorythm instead of words. crc mode requires less
diskspace and is faster than single and multi modes. crc-multi
mode shares storage structure with crc mode, but stores words in
different tables depending on wordlength like multi mode.
Default DBMode value is single
LocalCharset charset
Defines charset for local file system. It is required if you are
using 8 bit characters and is not applicable for 7 bit
characters. This command is to be used once and takes global
effect for the whole configuration file.
Example:
LocalCharset windows-1250
CrossWords yes|no
Building CrossWords index. Crosswords are those, that are used
in a link to the present page. The default value is no
StopWordFile filename
This command indicates which file contains stopwords list to
load. You may specify either absolute file name, or filename
with a relative path to mnoGoSearch /etc directory. You may use
several StopWordsFile commands.
MinWordLength characters
MinWordLength characters With these commands you can change
default length range of words stored in database. By default
mnoGoSearch stores words that are longer than 1 and shorter than
32. Example: MaxWordLength 35
MaxDocSize bytes
Specify maximum size of a document in bytes that can be indexed.
The default value is 1048576 (1 Mb). This command take global
effect for the whole config file.
HTTPHeader header
You may add custom HTTP headers to indexer HTTP request. Do not
use "If-modified-since" and "Accept-Charset" headers, since they
are composed by indexer itself. "User-Agent:
mnoGoSearch/version" is sent too, although you may override it.
The command has global effect for the whole configuration file.
ServerTable table_name
This command works only with SQL database and is not applicable
for built-in database mode. Load servers with all their
parameters from the table table_name For an example of such
tables structure, please refer to the file
create/mysql/server.txt You may use several arguments with this
command: ServerTable my_servers1 my_servers2 my_servers3 or just
a single argument: ServerTable server
DeleteNoServer yes|no
Use this command to specify whether to delete the URL that have
no corresponding Server commands. Default value is yes
VarDir /path/to/my/var/dir
Specify a custom path to directory that indexer stores data to
when use with built-in database and in cache mode. By default
/var directory of mnoGoSearch installation is used.
URL Control Configuration
Allow [Match|NoMatch] {NoCase|Case] [String|Regex] <arg> [<arg> ...]
Use this command to allow URL’s that match (does not match)
given argument. First three optional parameters describe the
type of comparison. Default values are Match, NoCase, String Use
NoCase or Case values to to choose case insensitive or sensitive
comparison. Use Regex to choose regular expression comparison.
Use String to choose string with wildcards comparison. Wildcards
are * for any number of characters, and ? for one character.
Note that * and ? have special meaning in String match type.
Please use Regex to describe documents with ? and * signs in
URL. String match is much faster than Regex Better use String
where possible. You may use several arguments for one Allow
command and use this command any number of times. It takes
global effect for the config file. Note that mnoGoSearch
automatically adds one Allow regex .* command after reading
config file. That command means that everything is allowed that
is not disallowed
Disallow [Match|NoMatch] [Case|NoCase] [String|Regex] [<arg> ...]
Use this to disallow indexing documents with URLs that match
given argument. The meaning of the first three optional
parameters is exactly the same as with the Allow command. You
can use several arguments for one Disallow command. Takes global
effect for config file.
Example:
#Exclude cgi-bin and non-parsed-headers
Disallow /cgi-bin/ \.cgi /nph
#Exclude some known extensions
Disallow \.b$ \.sh$ \.md5$
Disallow \.arj$ \.tar$ \.zip$ \.tgz$ \.gz$
Disallow \.lha$ \.lzh$ \.tar\.Z$ \.rar$ \.zoo$
Disallow \.gif$ \.jpg$ \.jpeg$ \.bmp$ \.tiff$
Disallow \.vdo$ \.mpeg$ \.mpe$ \.mpg$ \.avi$ \.movie$
Disallow \.mid$ \.mp3$ \.rm$ \.ram$ \.wav$ \.aiff$ \.ra$
Disallow \.vrml$ \.wrl$
Disallow \.exe$ \.cab$ \.dll$ \.bin$ \.class$
Disallow \.tex$ \.texi$ \.xls$ \.doc$ \.texinfo$
Disallow \.rtf$ \.pdf$ \.cdf$ \.ps$
Disallow \.ai$ \.eps$ \.ppt$ \.hqx$
Disallow \.cpt$ \.bms$ \.oda$ \.tcl$
Disallow \.rpm$
#Exclude Apache directory list in different sort order
Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$
\?S=D$
#Exclude ./. and ./.. from Apache and Squid directory list
Disallow /[.]{1,2} /\%2e /\%2f
CheckOnly regexp [regexp [...] ]
Indexer will use HEAD instead of GET http method for URLs that
matches regexp. It means that file will be checked only and will
not be downloaded. Usefull for zip,exe,arj etc files. One can
use several arguments for one ’CheckOnly’ command. One can use
this command any times but not more than MAXFILTER in indexer.h
Takes global effect for config file.
Examples:
#Use HEAD method for some known non-text extensions:
CheckOnly \.b$ \.sh$ \.md5$
CheckOnly \.arj$ \.tar$ \.zip$ \.tgz$ \.gz$
CheckOnly \.lha$ \.lzh$ \.tar\.Z$ \.rar$ \.zoo$
CheckOnly \.gif$ \.jpg$ \.jpeg$ \.bmp$ \.tiff$
CheckOnly \.vdo$ \.mpeg$ \.mpe$ \.mpg$ \.avi$ \.movie$
CheckOnly \.mid$ \.mp3$ \.rm$ \.ram$ \.wav$ \.aiff$
CheckOnly \.vrml$ \.wrl$
CheckOnly \.exe$ \.cab$ \.dll$ \.bin$ \.class$
CheckOnly \.tex$ \.texi$ \.xls$ \.doc$ \.texinfo$
CheckOnly \.rtf$ \.pdf$ \.cdf$ \.ps$
CheckOnly \.ai$ \.eps$ \.ppt$ \.hqx$
CheckOnly \.cpt$ \.bms$ \.oda$ \.tcl$
CheckOnly \.rpm$
HrefOnly regexp [regexp [...] ]
Indexer scans html documents that match regexp as it would scan
any other URLs, except that it will not index the contents. It
will add any URLs it finds in html document to database. Usefull
when indexing mail list archives with big index pages which
contain mostly URLs. One can use several arguments for one
’HrefOnly’ command. One can use this command any times but not
more than MAXFILTER in indexer.h Takes global effect for config
file.
Examples:
#Scan these files for href tags only, but do not index there
contents.
HrefOnly mail.*\.html$ thr.*\.html$
MIME types and external parsers
UseRemoteContentType yes|no
This command specifies if the indexer should get content type
from HTTP server headers (yes) , or from its AddType settings
(no). If set to no , and the indexer could not determine
content-type with its AddType settings,
SyslogFacility facility
Useful only if indexer is compiled with syslog support and if
you do not like the default. Argument is the same as used in
syslog.conf file (for example: local7 , daemon ). For list of
possible facilities see syslog.conf(5) Takes global effect and
should be used only once ! Default: depends on compilation.
LogdAddr host[:port]
Use cachelogd at given host and port if specified. Required for
cache mode only. Default values are localhost and port 7000
FollowOutside yes|no
Allow/disallow indexer to walk outside current server. Should
be used carefully (see MaxHops command).
Default: no
Period seconds
Reindex period in seconds, 604800 = 1 week. May be used before
every Server command and takes effect till the end of config
file or till next Period command.
Tag number
Use this parameter for your own purposes. For example for
grouping some servers into one group, etc. May be used multiple
times before every Server command and takes effect till the end
of config file or till next Tag command.
MaxHops number
Maximum way in "mouse clicks" from start URL given in Server
command. May be used multiple times before every Server command
and takes effect till the end of config file or till next
MaxHops command.
Default: 256
MaxNetErrors number
Maximum network errors for each server. If there are too many
network errors on some server (server is down, host unreachable
etc.) indexer will try not to do more than number attempts to
connect to this server. May be used multiple times before
Server command and takes effect till the end of config file or
till next MaxNetErrors command.
Default: 16
TitleWeight number
Weight of the words in the <title>...</title> Can be set
multiple times before Server command and takes effect till the
end of config file or till next TitleWeight command.
Default: 2
BodyWeight number
Weight of the words in the <body>...</body> of the html
documents and in the contents of the text/plain documents. Can
be set multiple times before Server command and takes effect
till the end of config file or till next BodyWeight command.
Default: 1
DescWeight number
Weight of the words in the <META NAME="Description"
Content="..."> Can be set multiple times before Server command
and takes effect till the end of config file or till next
DescWeight command.
Default: 2
KeywordWeight number
Weight of the words in the <META NAME="Keywords" Content="...">
Can be set multiple times before Server command and takes effect
till the end of config file or till next KeywordWeight command.
Default: 2
UrlWeight number
Weight of the words in the URL of the documents. Can be set
multiple times before Server command and takes effect till the
end of config file or till next UrlWeight command.
Default: 0
DeleteBad yes|no
Prevent indexer from deleting bad (not found, forbidden etc)
URLs from database. Useful if you want to check ’integrity’ of
you server(s), so if you set it to no , that "bad" URLs will
remain in database. Can be set multiple times before Server
command and takes effect till the end of config file or till
next DeleteBad command.
Default: yes
Robots yes|no
Allows/disallows using robots.txt and <META NAME="robots">
exclusions. Useful if you want to check ’integrity’ of you
server(s). Can be set multiple times before Server command and
takes effect till the end of config file or till next Robots
command.
Default: yes.
Section <string> <number>
where <string> is a section name and <number> is section ID
between 0 and 255. Use 0 if you don’t want to index some of
these sections. It is better to use different sections IDs for
different documents parts. In this case during search time
you’ll be able to give different weight to each part or even
disallow some sections at a search time.
Index yes|no
Prevent indexer from storing words into database. Useful if you
want to check ’integrity’ of you server(s). Can be set multiple
times before "Server" command and takes effect till the end of
config file or till next Index command.
Note: Instead of Index no you can use the alternate form NoIndex
Default: yes
Follow yes|no
Allow/disallow indexer to store <a href="..."> into database.
Can be set multiple times before Server command and takes effect
till the end of config file or till next Follow command.
Note: Instead of Follow no you can use the alternate form
NoFollow
Default: yes
MaxDocSize size
Hope the name is self-explanatory, this command is to limit
maximum document size. size is in bytes. If there is document
with size more than size , indexer will parse only first size
bytes of documents.
Default: 1048576 (which is 1 megabyte)
Mime <from_mime> <to_mime>[;charset] ["command line [$1]"]
This is used to add support for parsing documents with mime
types other than text/plain and text/html. It can be done via
external parser (which should provide output in plain or html
text) or just by substituting mime type so indexer can
understand it directly.
<from_mime> and <to_mime> are standard mime types. <to_mime>
should be either text/plain or text/html , because these are the
only types that indexer understands.
We assume external parser generates results on stdout (if not,
you have to write a little script and cat results to stdout).
Optional charset parameter used to change charset if needed.
Command line parameter is optional. If there’s no command line,
this is used to change mime type. Command line could also have
$1 parameter which stands for temporary file name. Some parsers
could not operate on stdin, so indexer creates temporary file
for parser and its name passed instead of $1.
CharSet charset
Useful for 8 bit character sets. WWW-servers send data in
different character sets. charset is default character set of
server in next Server command(s). May be used before every
Server command and takes effect till the end of config file or
till next CharSet command.
By now indexer supports Cyrillic koi8-r, cp1251, cp866,
iso8859-5, x-mac-cyrillic, Arabic cp1256, Western iso-8859-1,
Central Europe iso-8859-2 and cp1250 character sets.
This parameter is default character set for "bad" servers that
do not send information about charset in header: just "Content-
type: text/html" instead of for example "Content-type:
text/html; charset=koi8-r" and do not send charset information
in META tags.
CharSet command.
Examples:
CharSet koi8-r
CharSet windows-1250
CharSet ISO-8859-1
ForceIISCharset1251 yes/no
This option is useful for users dealing with Cyrillic content
and broken (or misconfigured?) Microsoft IIS web servers, which
tends to report charset incorrectly. This is a really dirty
hack, but if this option is turned on it is assumed that all
servers that are reported as ’Microsoft’ or ’IIS’ have content
in Windows-1251 codepage. This command should be used only once
in configuration file and takes global effect.
Default: no
AuthBasic login:passwd
Use basic http authorization. Can be set before every Server
command and takes effect only for next Server command.
Examples:
AuthBasic somebody:something
If you have password protected directory(ies), but whole server
is open, use:
AuthBasic login1:passwd1
Server http://my.server.com/my/secure/directory1/
AuthBasic login2:passwd2
Server http://my.server.com/my/secure/directory2/
Server http://my.server.com/
ProxyAuthBasic login:passwd
Use http proxy basic authorisation. Can be used before every
Server command and taked effect only for the next one Server
command! It should be also before Proxy command.
Example:
ProxyAuthBasic somebody:smth
Proxy your.proxy.host[:port]
Connect ia proxy rather directly. You can index ftp servers
(only) when using proxy. If port is not specified, it is set to
default value of 3128 (Squid). If proxy host is not specified,
direct connection will be performed. Can be set before every
Server command and takes effect till the end of config file or
till next Proxy command.
Examples:
Proxy atoll.anywhere.com
- proxy on atoll.anywhere.com, port 3128
Proxy lota.anywhere.com:8090
- proxy on lota.anywhere.com, port 8090
Proxy
- turn off proxy usage (direct connection)
Server URL
It is the main configuration command. Use this to add start URL
of server to be indexed. You may use many Server commands in
the same indexer.conf file
Examples:
Server http://localhost/
Server http://www.yoursite.com/
Server http://www.yoursite.com/~yourname/
Server ftp://ftp.yourdomain.com/pub/
EXAMPLE
This is a minimal sample indexer config file
DBHost localhost
DBName udmsearch
DBUser foo
DBPass bar
Server http://localhost/
Disallow /cgi-bin/ \.cgi /nph
Disallow \.b$ \.sh$ \.md5$
Disallow \.arj$ \.tar$ \.zip$ \.tgz$ \.gz$
Disallow \.lha$ \.lzh$ \.tar\.Z$ \.rar$ \.zoo$
Disallow \.gif$ \.jpg$ \.jpeg$ \.bmp$ \.tiff$
Disallow \.vdo$ \.mpeg$ \.mpe$ \.mpg$ \.avi$ \.movie$
Disallow \.mid$ \.mp3$ \.rm$ \.ram$ \.wav$ \.aiff$ \.ra$
Disallow \.vrml$ \.wrl$
Disallow \.exe$ \.cab$ \.dll$ \.bin$ \.class$
Disallow \.tex$ \.texi$ \.xls$ \.doc$ \.texinfo$
Disallow \.rtf$ \.pdf$ \.cdf$ \.ps$
Disallow \.ai$ \.eps$ \.ppt$ \.hqx$
Disallow \.cpt$ \.bms$ \.oda$ \.tcl$
Disallow \.rpm$
Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$
\?S=D$
Disallow /[.]{1,2} /\%2e /\%2f
SEE ALSO
indexer(1), syslog.conf(5)