indexer - indexing WWW space.

NAME

       indexer - indexing WWW space.

SYNOPSIS

       indexer  [ -a ] [ -b ] [ -n number ] [ -e ] [ -m ] [ -q ] [ -o ] [ -r ]
       [ -i ] [ -w ] [ -R ] [ -N number ] [ -p seconds ]  [  -t  tag  ]  [  -u
       pattern ] [ -s status ] [ -y content-type ] [ configfile ]

       indexer -C [ -R ] [ -t tag ] [ -u pattern ] [ -s status ] [ -y content-
       type ] [ configfile ]

       indexer -S [ -R ] [ -t tag ] [ -u pattern ] [ -s status ] [ -y content-
       type ] [ configfile ]

       indexer -I [ -R ] [ -t tag ] [ -u pattern ] [ -s status ] [ -y content-
       type ] [ configfile ]

       indexer -h|-?

DESCRIPTION

       indexer is a part of mnoGoSearch  -   search  engine.  The  purpose  of
       indexer  is  to  walk through HTTP, HTTPS, FTP, NEWS servers as well as
       local file system, recursively grabbing all the documents  and  storing
       metadata  about  documents into SQL or built-in database in a smart and
       effective  manner.  Since  every  document   is   referenced   by   its
       corresponding  URL,  metadata  collected  by indexer is used later in a
       search process.

       The behaviour of indexer is controlled mainly  via  configuration  file
       indexer.conf  (5)  ,  which it reads on startup. There is a compiled-in
       default for configuration file name and location, so you don’t need  to
       specify it every time you run indexer , but you can specify alternative
       configuration file as the last argument.

       indexer supports HTML-formatted  (text/html  MIME  type),  XML-formated
       (text/xml  MIME  type) and plain text (text/plain MIME type) documents.
       Support for other data types is provided by  using  external  programs,
       which  are  called  "parsers". Parser should get data of some type from
       stdin  and  put  text/html  or  text/plain   data   to   stdout.    See
       indexer.conf(5) for details.

       You  may  run  indexer  regularly from cron (8) to keep metadata up-to-
       date.

       indexer is also used to manipulate database. It may be  used  to  clear
       some  data  from  database,  to output some statistics and to calculate
       popolarity ranking.

OPTIONS

       Indexing

       -a     Reindex all documents even if not expired.

              By  default  indexer  reindex  only  whose  documents  that  are
              "expired",  e.g.   time  since  their last reindexing is greater
              than "Period" from indexer.conf (5) file. This  option  disables
              the  feature,  so all documents will be reindexed, irrelevant to
              their state.  To achieve this, indexer just first marks all URLs
              as "expired". This gives the following side effect: if you start
              indexer -a and then  terminate  it  (for  example,  by  pressing
              Ctrl-C  ) and start again, all URLs will be considered "expired"
              and will be reindexed again.

       -m     This option force indexer to reindex documents,  even  if  their
              content  has  not been modified.  It is achived by disabling If-
              Modified-Since HTTP header and MD5 hash check.  This  is  usable
              if  you  have  changed  some Allow , Disallow , MaxHops or other
              directives in your indexer.conf(5) file.  Thus,  there  will  be
              different  set  of  rules  for  storing  document  URLs  and  so
              different set of URLs. To find out that URLs, there is a need to
              reindex even-not-changed documents.

       -n     number Reindex only given number of URLs and exit.

       -c     seconds limit indexing time to a given number of seconds

       -e     Reindex  most  expired  documents first.  That option forces the
              list of documents to reindex to be  sorted  by  last  reindexing
              time. That means that most "expired" documents will be reindexed
              first. You may or may not experience some minor delay with  that
              option,  but  at  least  in theory it should slow down indexer a
              bit.

              The combination of -e and -n  number is  seems  to  be  of  some
              value.  So,  you  can use indexer -e -n  100 to reindex just 100
              most expired documents.

       -q     Quick startup. This mode is  useful  if  you  haven’t  added  or
              modified Server commands.  indexer will not insert URLs given in
              Server commands into database which leads to some startup speed-
              up.

       -k     skip  locking  (this  option  affects  only MySQL and PostgreSQL
              only).

       -i     Isert new URLs. New  URL  must  be  specified  using  -u  or  -f
              options.

       -p     seconds Specifies time in seconds to pause after each URL.

       -w     Turns off warnings before clearing database.

       -o     Index documents with less depth (hops value) first.

       -r     Do  not  try  to  reduce  remote servers load by randomising url
              fetch list before indexing (recommended for very big  number  of
              URLs).

       -b     Block start more than one indexer instances

       -N     number  Run number threads, if multithreaded mnoGoSearch version
              was compiled.

       -R     Calculate popularity rank before program exit.

              Subsection control

       -t tag
       -u pattern
       -s status
       -g category
       -y content-type

              Set URL filters on  tag  ,  pattern  ,  status  ,  category  and
              content-type respectively.

              tag  is  a  server tag that you can arbitrary set in config file
              indexer.conf (5)

              pattern is a SQL LIKE wildcard for URL. In short, underscore ( _
              )  means  "any  symbol", and per cent ( % ) means "any symbols",
              and the comparison is case insensitive. For example, indexer -u
              %izhcom.ru% will reindex all documents that URLs contains string
              "izhcom.ru".

              status is a filter on document’s  HTTP  status  obtained  during
              last  reindexing.   For  example,  -s   0  is  a  filter for all
              documents that has not been indexed before.  -s  200 is a filter
              for  all documents that was retrieved with "HTTP 200 Ok" status,
              and -s 301 is a filter for all documents that was retrieved with
              "HTTP  301  Redirect"  status.  See HTTP protocol specifications
              for details on HTTP status codes and their respective  meanings.

              category is a filter for documents that match specific category.
              Categories are almost like tags but nested.

              content-type is a MIME type for  documents  with  that  Content-
              Type.

              You  can  freely  combine any number of -t , -u , -s , -g and -y
              options. The filters of the same class  (tag,  pattern,  status)
              are  be  combined using logical OR, and the filters of different
              classes will be combined using logical AND. That means,  if  you
              type  indexer  -u  %izhcom.ru%  -u  %udm.net%  -t  1  -s 200 the
              documents-to-index will be those with tag 1 and HTTP status 200,
              which URLs contains the strings "izhcom.ru" or "udm.net".

       -f     filename  Read  URL  to be indexed/inserted/cleared from a file.
              (With -a or -C option, it supports SQL LIKE wildcard %  ,  has
              no effect when combined with -m option.

       -f     - Use STDIN instead of a file to read URL list

              Logging options

       -l     Do not log to stdout/stderr.

       -v     level Verbose level, can be set to 0-5.

              Misc.

       -C     Clear databases.

              This  will  erase  data previously collected by indexer from the
              mnoGoSearch databases. You can  use  options  -t  ,  -u  and  -s
              described above to select what do you want to delete.

              WARNING: Use this option with extreme caution!

       -S     Show statistics.

              This option outputs a brief statistics of how many documents are
              there in database, their HTTP status, and how many documents are
              expired.  You  can use options -t , -u and -s described above to
              select what documents do you want statistics on.

       -I     Show referrers.

              This option shows you the referrers of URLs. Or, in other words,
              all  hyperlinks  from  the document. You can use options -t , -u
              and -s described above to select what documents do you  want  to
              show referrers on.

       -h
       -?     Shows  help  screen  with  brief  overall description of indexer
              options.

BUGS

       If you think you’ve found  a  bug  in  indexer,  please  report  it  to
       mnoGoSearch   bugreport   system   at  http://www.mnogosearch.org/bugs/
       (please post in English only).

COPYRIGHT

       Copyright     ©      1998      -      2004       Lavtech.Com      Corp.
       (http://www.mnogosearch.org/).

       This program is free software; you can redistribute it and/or modify it
       under the terms of the GNU General Public License as published  by  the
       Free  Software Foundation; either version 2 of the License, or (at your
       option) any later version.

       This program is distributed in the hope that it  will  be  useful,  but
       WITHOUT   ANY   WARRANTY;   without   even   the  implied  warranty  of
       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

BUGS

COPYRIGHT

SEE ALSO