NAME
ra-index - index files for use with remembrance agent software
SYNOPSIS
ra-index [--version] [-v] [-d] [-s] <base-dir> <source1> [<source2>]
[...] [-e <excludee1> [<excludee2>] [...]]
DESCRIPTION
ra-index and ra-retrieve make up the Savant search engine, an
information retrieval engine designed as a back-end for the Remembrance
Agent (RA). Given a collection of the user’s accumulated email, usenet
news articles, papers, saved HTML files and other text notes, the RA
attempts to find those documents which are most relevant to the user’s
current context. That is, it searches this collection of text for the
documents which bear the highest word-for-word similarity to the text
the user is currently editing, in the hope that they will also bear
high conceptual similarity and thus be useful to the user’s current
work. With the Emacs front-end, these suggestions are continuously
displayed in a small buffer at the bottom of the user’s window. If a
suggestion looks useful, the full text can be retrieved with a single
command.
The Remembrance Agent works in two stages. First, the user’s
collection of text documents is indexed into a database saved in a
vector format. After the database is created, the other stage of the
Remembrance Agent is run from emacs, where it periodically takes a
sample of text from the working buffer and finds those documents from
the collection that are most similar. It summarizes the top documents
in a small emacs window and allows you to retrieve the entire text of
any one with a keystroke. See the README file for information on using
the Emacs front-end.
At its core Savant is a text-retrieval search-engine that uses a
standard TF/iDF algorithm, but it also uses a template system to
recognize different kinds of documents and extract various field
information. For example, ra-index can recognize subject lines and
address information from email files and file this information
separately. It can also pull apart file archives into separate
documents, e.g. RMAIL files are indexed as separate email documents.
Finally, there are filters defined for many document types to remove
extraneous information like HTML tags that might otherwise cause
problems in retrieval. These are all precompiled in a template
structure. It is not currently well documented, though if anyone wants
to play with it is all defined in the source file
templates/conftemplates.c.
The RA is primarily designed as a proactive information provider that
continually gives you information that might be relevant to your
current environment, but Savant can also be used as a standard text and
information retrieval search engine.
USAGE
To index, you must have a set of source text-files, and a directory
Savant can put database files into. The <source> arguments may be
files or directories. If a directory is in the list, Savant will use
all its contents, recursing into all subdirectories. Non-text files
and backup files (those appended with ~ or prepended with #) are
ignored. It also ignores dot-files (those starting with .) and
symbolic links. Any files or directories specified after the optional
-e flag will be excluded. Savant will use any files it finds to create
a database in the specified base directory, which must already exist.
The optional -v argument (verbose) will direct Savant to keep you
updated on its progress. So for example,
ra-index -v ~/RA-indexes/mail ~/RMAIL ~/Rmail-files -e ~/Rmail-
files/Old-files
will build a database in the ~/RA-indexes/mail directory, made up of
emails from my RMAIL file plus all files and subdirectories of ~/Rmail-
files, excluding files and directories in ~/Rmail-files/Old-files.
ra-index can build databases in any directory you like, but the emacs
interface for the Remembrance Agent expects a particular structure.
For each database you want to make, you should create a directory, and
all these directories should live in the same parent directory. For
example, for my own use I have a directory ~/RA-indexes/, and within
that are the directories ~/RA-indexes/mail/, ~/RA-indexes/papers/, etc.
which actually contain the database files.
OPTIONS
-v Verbose mode. Print useful information.
-d Debug mode. Print not-so-useful information.
-e Exclude all filenames and directories which follow
-s Follow symbolic links when indexing
--version
Print version information.
SEE ALSO
ra-retrieve(1)
AUTHOR
Bradley Rhodes, MIT Media Lab. Please send comments and questions to
ra-bugs@media.mit.edu. New versions and updates can be found at
http://www.media.mit.edu/~rhodes/RA/
COPYRIGHT
All code included in versions up to and including 2.09:
Copyright (C) 2001 Massachusetts Institute of Technology.
All modifications subsequent to version 2.09 are copyright Bradley
Rhodes or their respective authors.
Developed by Bradley Rhodes at the Media Laboratory, MIT, Cambridge,
Massachusetts, with support from British Telecom and Merrill Lynch.
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your
option) any later version. For commercial licensing under other terms,
please consult the MIT Technology Licensing Office.
This program may be subject to the following US and/or foreign patents
(pending): "Method and Apparatus for Automated, Context-Dependent
Retrieval of Information," MIT Case No. 7870TS. If any of these patents
are granted, royalty-free license to use this and derivative programs
under the GNU General Public License are hereby granted.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
BUGS
Dates are not currently indexed, so anything trying to do a date query
gets no suggestion back.
Requires GNU make to compile.
The template structure isn’t documented.