Man Linux: Main Page and Category List

NAME

       indexer.conf - configuration file for indexer

DESCRIPTION

       This  is  configuration  file  for  indexer  (1).   Configuration  file
       consists of commands and  their  arguments.   All  commands  are  case-
       insensitive.  You can use # to comment out lines.

VARIABLES

       Global parameters

              These  commands  should be used only once and take global effect
              for the whole configuration file.

       DBType type
              Database type, currently  supported  values  are  mysql,  pgsql,
              msql,  solid,  mssql, oracle, ibase, sqlite Actually it does not
              matter for native libraries support, but ODBC users must specify
              one  of  the  supported  values.   If  your database type is not
              supported, use unknown instead.

       DBHost host
              SQL host name (Not required for ODBC)

              Default: localhost

       DBName mnogosearch
              SQL database name or ODBC DSN

              Default: mnogosearch

       DBUser foo
              Database username to connect to database

              Default: no user

       DBPass bar
              Database password to connect to database

              Default: no password

       DBMode single/multi/crc/crc-multi
              SQL database words storage mode. Does  not  apply  for  built-in
              database.  When single is specified, all words are stored in the
              same table.  multi means that  words  are  stored  in  different
              tables  depending  on  wordlength.  multi mode is usualy faster,
              but it requires more tables in database.  In case of  crc  mode,
              mnoGoSearch  will  store  32 bit integer word ID’s calculated by
              CRC32 algorythm  instead  of  words.   crc  mode  requires  less
              diskspace  and is faster than single and multi modes.  crc-multi
              mode shares storage structure with crc mode, but stores words in
              different  tables  depending  on  wordlength  like  multi  mode.
              Default DBMode value is single

       LocalCharset charset
              Defines charset for local file system. It is required if you are
              using  8  bit  characters  and  is  not  applicable  for  7  bit
              characters.  This command is to be used once  and  takes  global
              effect for the whole configuration file.

              Example:
              LocalCharset windows-1250

       CrossWords yes|no
              Building  CrossWords  index. Crosswords are those, that are used
              in a link to the present page.  The default value is no

       StopWordFile filename
              This command indicates which file  contains  stopwords  list  to
              load.   You  may  specify either absolute file name, or filename
              with a relative path to mnoGoSearch /etc directory.  You may use
              several StopWordsFile commands.

       MinWordLength characters
              MinWordLength characters  With  these  commands  you  can change
              default length range of words stored  in  database.  By  default
              mnoGoSearch stores words that are longer than 1 and shorter than
              32.  Example: MaxWordLength 35

       MaxDocSize bytes
              Specify maximum size of a document in bytes that can be indexed.
              The  default  value  is 1048576 (1 Mb). This command take global
              effect for the whole config file.

       HTTPHeader header
              You may add custom HTTP headers to indexer HTTP request. Do  not
              use "If-modified-since" and "Accept-Charset" headers, since they
              are     composed     by     indexer     itself.     "User-Agent:
              mnoGoSearch/version"  is sent too, although you may override it.
              The command has global effect for the whole configuration  file.

       ServerTable table_name
              This  command works only with SQL database and is not applicable
              for  built-in  database  mode.   Load  servers  with  all  their
              parameters  from  the  table  table_name  For an example of such
              tables    structure,    please     refer     to     the     file
              create/mysql/server.txt  You may use several arguments with this
              command: ServerTable my_servers1 my_servers2 my_servers3 or just
              a single argument: ServerTable server

       DeleteNoServer yes|no
              Use  this command to specify whether to delete the URL that have
              no corresponding Server commands. Default value is yes

       VarDir /path/to/my/var/dir
              Specify a custom path to directory that indexer stores  data  to
              when  use  with built-in database and in cache mode.  By default
              /var directory of mnoGoSearch installation is used.

URL Control Configuration

       Allow [Match|NoMatch] {NoCase|Case] [String|Regex] <arg> [<arg> ...]
              Use this command to allow URL’s  that  match  (does  not  match)
              given  argument.  First  three  optional parameters describe the
              type of comparison. Default values are Match, NoCase, String Use
              NoCase or Case values to to choose case insensitive or sensitive
              comparison. Use Regex to choose regular  expression  comparison.
              Use String to choose string with wildcards comparison. Wildcards
              are * for any number of characters, and ?   for  one  character.
              Note  that  *  and ?  have special meaning in String match type.
              Please use Regex to describe documents with ?  and  *  signs  in
              URL.   String  match is much faster than Regex Better use String
              where possible. You may use  several  arguments  for  one  Allow
              command  and  use  this  command  any  number of times. It takes
              global effect  for  the  config  file.   Note  that  mnoGoSearch
              automatically  adds  one  Allow  regex .*  command after reading
              config file. That command means that everything is allowed  that
              is not disallowed

       Disallow [Match|NoMatch] [Case|NoCase] [String|Regex] [<arg> ...]
              Use  this  to  disallow  indexing documents with URLs that match
              given  argument.   The  meaning  of  the  first  three  optional
              parameters  is  exactly  the same as with the Allow command. You
              can use several arguments for one Disallow command. Takes global
              effect for config file.

       Example:
              #Exclude cgi-bin and non-parsed-headers
              Disallow /cgi-bin/ \.cgi /nph

              #Exclude some known extensions
              Disallow \.b$  \.sh$     \.md5$
              Disallow \.arj$  \.tar$  \.zip$  \.tgz$  \.gz$
              Disallow \.lha$ \.lzh$ \.tar\.Z$  \.rar$  \.zoo$
              Disallow \.gif$  \.jpg$  \.jpeg$ \.bmp$  \.tiff$
              Disallow \.vdo$  \.mpeg$ \.mpe$  \.mpg$  \.avi$  \.movie$
              Disallow \.mid$  \.mp3$  \.rm$   \.ram$  \.wav$  \.aiff$ \.ra$
              Disallow \.vrml$ \.wrl$
              Disallow \.exe$  \.cab$  \.dll$  \.bin$  \.class$
              Disallow \.tex$  \.texi$ \.xls$  \.doc$  \.texinfo$
              Disallow \.rtf$  \.pdf$  \.cdf$  \.ps$
              Disallow \.ai$   \.eps$  \.ppt$  \.hqx$
              Disallow \.cpt$  \.bms$  \.oda$  \.tcl$
              Disallow \.rpm$

              #Exclude Apache directory list in different sort order
              Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$
              \?S=D$

              #Exclude ./. and ./.. from Apache and Squid directory list
              Disallow /[.]{1,2} /\%2e /\%2f

       CheckOnly regexp [regexp [...] ]
              Indexer will use HEAD instead of GET http method for  URLs  that
              matches regexp. It means that file will be checked only and will
              not be downloaded. Usefull for zip,exe,arj etc files.   One  can
              use  several arguments for one ’CheckOnly’ command.  One can use
              this command any times but not more than MAXFILTER in  indexer.h
              Takes global effect for config file.

       Examples:
              #Use HEAD method for some known non-text extensions:
              CheckOnly \.b$ \.sh$     \.md5$
              CheckOnly \.arj$  \.tar$  \.zip$  \.tgz$  \.gz$
              CheckOnly \.lha$ \.lzh$ \.tar\.Z$  \.rar$  \.zoo$
              CheckOnly \.gif$  \.jpg$  \.jpeg$ \.bmp$  \.tiff$
              CheckOnly \.vdo$  \.mpeg$ \.mpe$  \.mpg$  \.avi$  \.movie$
              CheckOnly \.mid$  \.mp3$  \.rm$   \.ram$  \.wav$  \.aiff$
              CheckOnly \.vrml$ \.wrl$
              CheckOnly \.exe$  \.cab$  \.dll$  \.bin$  \.class$
              CheckOnly \.tex$  \.texi$ \.xls$  \.doc$  \.texinfo$
              CheckOnly \.rtf$  \.pdf$  \.cdf$  \.ps$
              CheckOnly \.ai$   \.eps$  \.ppt$  \.hqx$
              CheckOnly \.cpt$  \.bms$  \.oda$  \.tcl$
              CheckOnly \.rpm$

       HrefOnly regexp [regexp [...] ]
              Indexer  scans html documents that match regexp as it would scan
              any other URLs, except that it will not index the  contents.  It
              will add any URLs it finds in html document to database. Usefull
              when indexing mail list archives  with  big  index  pages  which
              contain  mostly  URLs.   One  can  use several arguments for one
              ’HrefOnly’ command.  One can use this command any times but  not
              more  than MAXFILTER in indexer.h Takes global effect for config
              file.

       Examples:
              #Scan these files for href tags only, but  do  not  index  there
              contents.
              HrefOnly mail.*\.html$ thr.*\.html$

MIME types and external parsers

       UseRemoteContentType yes|no
              This  command  specifies  if the indexer should get content type
              from HTTP server headers (yes) , or from  its  AddType  settings
              (no).  If  set  to  no  ,  and  the  indexer could not determine
              content-type with its AddType settings,

       SyslogFacility facility
              Useful only if indexer is compiled with syslog  support  and  if
              you  do  not  like  the default. Argument is the same as used in
              syslog.conf file (for example: local7 , daemon ).  For  list  of
              possible  facilities  see syslog.conf(5) Takes global effect and
              should be used only once !  Default: depends on compilation.

       LogdAddr host[:port]
              Use cachelogd at given host and port if specified. Required  for
              cache mode only. Default values are localhost and port 7000

       FollowOutside yes|no
              Allow/disallow  indexer  to walk outside current server.  Should
              be used carefully (see MaxHops command).

              Default: no

       Period seconds
              Reindex period in seconds, 604800 = 1 week.  May be used  before
              every  Server  command  and  takes effect till the end of config
              file or till next Period command.

       Tag number
              Use this parameter  for  your  own  purposes.  For  example  for
              grouping some servers into one group, etc.  May be used multiple
              times before every Server command and takes effect till the  end
              of config file or till next Tag command.

       MaxHops number
              Maximum  way  in  "mouse  clicks" from start URL given in Server
              command. May be used multiple times before every Server  command
              and  takes  effect  till  the  end  of  config file or till next
              MaxHops command.

              Default: 256

       MaxNetErrors number
              Maximum network errors for each server.  If there are  too  many
              network  errors on some server (server is down, host unreachable
              etc.)  indexer will try not to do more than number  attempts  to
              connect  to  this  server.   May  be  used multiple times before
              Server command and takes effect till the end of config  file  or
              till next MaxNetErrors command.

              Default: 16

       TitleWeight number
              Weight  of  the  words  in  the  <title>...</title>  Can  be set
              multiple times before Server command and takes effect  till  the
              end of config file or till next TitleWeight command.

              Default: 2

       BodyWeight number
              Weight  of  the  words  in  the  <body>...</body>  of  the  html
              documents and in the contents of the text/plain documents.   Can
              be  set  multiple  times  before Server command and takes effect
              till the end of config file or till next BodyWeight command.

              Default: 1

       DescWeight number
              Weight  of   the   words   in   the   <META   NAME="Description"
              Content="...">  Can  be set multiple times before Server command
              and takes effect till the  end  of  config  file  or  till  next
              DescWeight command.

              Default: 2

       KeywordWeight number
              Weight  of the words in the <META NAME="Keywords" Content="...">
              Can be set multiple times before Server command and takes effect
              till  the end of config file or till next KeywordWeight command.

              Default: 2

       UrlWeight number
              Weight of the words in the URL of the  documents.   Can  be  set
              multiple  times  before Server command and takes effect till the
              end of config file or till next UrlWeight command.

              Default: 0

       DeleteBad yes|no
              Prevent indexer from deleting bad  (not  found,  forbidden  etc)
              URLs  from  database. Useful if you want to check ’integrity’ of
              you server(s), so if you set it to no ,  that  "bad"  URLs  will
              remain  in  database.   Can  be set multiple times before Server
              command and takes effect till the end of  config  file  or  till
              next DeleteBad command.

              Default: yes

       Robots yes|no
              Allows/disallows   using  robots.txt  and  <META  NAME="robots">
              exclusions. Useful if you  want  to  check  ’integrity’  of  you
              server(s).   Can be set multiple times before Server command and
              takes effect till the end of config file  or  till  next  Robots
              command.

              Default: yes.

       Section <string> <number>
              where  <string>  is  a  section  name and <number> is section ID
              between 0 and 255. Use 0 if you don’t  want  to  index  some  of
              these  sections.  It is better to use different sections IDs for
              different documents parts.  In  this  case  during  search  time
              you’ll  be  able  to  give different weight to each part or even
              disallow some sections at a search time.

       Index yes|no
              Prevent indexer from storing words into database.  Useful if you
              want to check ’integrity’ of you server(s).  Can be set multiple
              times before "Server" command and takes effect till the  end  of
              config file or till next Index command.

              Note: Instead of Index no you can use the alternate form NoIndex

              Default: yes

       Follow yes|no
              Allow/disallow indexer to store <a  href="...">  into  database.
              Can be set multiple times before Server command and takes effect
              till the end of config file or till next Follow command.

              Note: Instead of Follow  no  you  can  use  the  alternate  form
              NoFollow

              Default: yes

       MaxDocSize size

              Hope  the  name  is  self-explanatory,  this command is to limit
              maximum document size.  size is in bytes.  If there is  document
              with  size  more  than size , indexer will parse only first size
              bytes of documents.

              Default: 1048576 (which is 1 megabyte)

       Mime   <from_mime> <to_mime>[;charset] ["command line [$1]"]

              This is used to add support  for  parsing  documents  with  mime
              types  other  than text/plain and text/html.  It can be done via
              external parser (which should provide output in  plain  or  html
              text)   or  just  by  substituting  mime  type  so  indexer  can
              understand it directly.

              <from_mime> and <to_mime> are standard  mime  types.   <to_mime>
              should be either text/plain or text/html , because these are the
              only types that indexer understands.

              We assume external parser generates results on stdout  (if  not,
              you have to write a little script and cat results to stdout).

              Optional charset parameter used to change charset if needed.

              Command  line parameter is optional. If there’s no command line,
              this is used to change mime type. Command line could  also  have
              $1  parameter which stands for temporary file name. Some parsers
              could not operate on stdin, so indexer  creates  temporary  file
              for parser and its name passed instead of $1.

       CharSet charset
              Useful  for  8  bit  character  sets.   WWW-servers send data in
              different character sets.  charset is default character  set  of
              server  in  next  Server  command(s).   May be used before every
              Server command and takes effect till the end of config  file  or
              till next CharSet command.

              By   now   indexer  supports  Cyrillic  koi8-r,  cp1251,  cp866,
              iso8859-5, x-mac-cyrillic, Arabic  cp1256,  Western  iso-8859-1,
              Central Europe iso-8859-2 and cp1250 character sets.

              This  parameter  is default character set for "bad" servers that
              do not send information about charset in header: just  "Content-
              type:   text/html"   instead   of   for  example  "Content-type:
              text/html; charset=koi8-r" and do not send  charset  information
              in META tags.

              CharSet command.

       Examples:

              CharSet koi8-r
              CharSet windows-1250
              CharSet ISO-8859-1

       ForceIISCharset1251 yes/no
              This  option  is  useful for users dealing with Cyrillic content
              and broken (or misconfigured?) Microsoft IIS web servers,  which
              tends  to  report  charset  incorrectly.  This is a really dirty
              hack, but if this option is turned on it  is  assumed  that  all
              servers  that  are reported as ’Microsoft’ or ’IIS’ have content
              in Windows-1251 codepage.  This command should be used only once
              in configuration file and takes global effect.

              Default: no

       AuthBasic login:passwd
              Use  basic  http  authorization.  Can be set before every Server
              command and takes effect only for next Server command.

       Examples:

              AuthBasic somebody:something

              If you have password protected directory(ies), but whole  server
              is open, use:

              AuthBasic login1:passwd1
              Server http://my.server.com/my/secure/directory1/
              AuthBasic login2:passwd2
              Server http://my.server.com/my/secure/directory2/
              Server http://my.server.com/

       ProxyAuthBasic login:passwd
              Use  http  proxy  basic  authorisation. Can be used before every
              Server command and taked effect only for  the  next  one  Server
              command! It should be also before Proxy command.

       Example:
              ProxyAuthBasic somebody:smth

       Proxy your.proxy.host[:port]
              Connect  ia   proxy  rather directly.  You can index ftp servers
              (only) when using proxy.  If port is not specified, it is set to
              default  value of 3128 (Squid).  If proxy host is not specified,
              direct connection will be performed.  Can be  set  before  every
              Server  command  and takes effect till the end of config file or
              till next Proxy command.

       Examples:
              Proxy atoll.anywhere.com
               - proxy on atoll.anywhere.com, port 3128

              Proxy lota.anywhere.com:8090
               - proxy on lota.anywhere.com, port 8090

              Proxy
               - turn off proxy usage (direct connection)

       Server URL
              It is the main configuration command.  Use this to add start URL
              of  server  to  be indexed.  You may use many Server commands in
              the same indexer.conf file

       Examples:

              Server http://localhost/
              Server http://www.yoursite.com/
              Server http://www.yoursite.com/~yourname/
              Server ftp://ftp.yourdomain.com/pub/

EXAMPLE

       This is a minimal sample indexer config file

              DBHost         localhost
              DBName         udmsearch
              DBUser         foo
              DBPass         bar
              Server         http://localhost/
              Disallow /cgi-bin/ \.cgi /nph
              Disallow \.b$  \.sh$     \.md5$
              Disallow \.arj$  \.tar$  \.zip$  \.tgz$  \.gz$
              Disallow \.lha$ \.lzh$ \.tar\.Z$  \.rar$  \.zoo$
              Disallow \.gif$  \.jpg$  \.jpeg$ \.bmp$  \.tiff$
              Disallow \.vdo$  \.mpeg$ \.mpe$  \.mpg$  \.avi$  \.movie$
              Disallow \.mid$  \.mp3$  \.rm$   \.ram$  \.wav$  \.aiff$ \.ra$
              Disallow \.vrml$ \.wrl$
              Disallow \.exe$  \.cab$  \.dll$  \.bin$  \.class$
              Disallow \.tex$  \.texi$ \.xls$  \.doc$  \.texinfo$
              Disallow \.rtf$  \.pdf$  \.cdf$  \.ps$
              Disallow \.ai$   \.eps$  \.ppt$  \.hqx$
              Disallow \.cpt$  \.bms$  \.oda$  \.tcl$
              Disallow \.rpm$
              Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$
              \?S=D$
              Disallow /[.]{1,2} /\%2e /\%2f

SEE ALSO

       indexer(1), syslog.conf(5)