QDBM - quick database manager

NAME

       QDBM - quick database manager

OVERVIEW

       QDBM is a library of routines for managing a database.  The database is
       a simple data file containing records, each is a pair of a  key  and  a
       value.  Every key and value is serial bytes with variable length.  Both
       binary data and character string can be used as  a  key  and  a  value.
       There  is  neither  concept of data tables nor data types.  Records are
       organized in hash table or B+ tree.

       As for database of hash  table,  each  key  must  be  unique  within  a
       database,  so  it is impossible to store two or more records with a key
       overlaps.  The following access methods are provided to  the  database:
       storing  a  record  with a key and a value, deleting a record by a key,
       retrieving a record by a key.  Moreover, traversal access to every  key
       are  provided,  although  the order is arbitrary.  These access methods
       are similar to ones of DBM (or its followers: NDBM  and  GDBM)  library
       defined  in  the UNIX standard.  QDBM is an alternative for DBM because
       of its higher performance.

       As for database of B+ tree, records whose keys are  duplicated  can  be
       stored.   Access  methods  of  storing,  deleting,  and  retrieving are
       provided as with the database of hash table.   Records  are  stored  in
       order  by  a  comparing function assigned by a user.  It is possible to
       access each record with the cursor in ascending  or  descending  order.
       According  to  this  mechanism, forward matching search for strings and
       range search for  integers  are  realized.   Moreover,  transaction  is
       available in database of B+ tree.

EFFECTIVE IMPLEMENTATION OF HASH DATABASE

       QDBM  is  developed  referring to GDBM for the purpose of the following
       three points: higher processing speed, smaller size of a database file,
       and  simpler  API.   They  have been achieved.  Moreover, the following
       three restrictions of traditional DBM: a process can  handle  only  one
       database,  the size of a key and a value is bounded, a database file is
       sparse, are cleared.

       QDBM uses hash algorithm to retrieve records.  If a  bucket  array  has
       sufficient  number  of  elements,  the  time complexity of retrieval is
       ‘O(1)’.  That is, time required for retrieving a  record  is  constant,
       regardless  of  the  scale  of  a  database.  It is also the same about
       storing and deleting.  Collision of hash values is managed by  separate
       chaining.  Data structure of the chains is binary search tree.  Even if
       a bucket array has unusually scarce elements, the  time  complexity  of
       retrieval is ‘O(log n)’.

       QDBM  attains improvement in retrieval by loading RAM with the whole of
       a bucket array.  If a bucket array is on RAM, it is possible to  access
       a  region  of  a target record by about one path of file operations.  A
       bucket array saved in a file is not read into RAM with the ‘read’  call
       but   directly   mapped  to  RAM  with  the  ‘mmap’  call.   Therefore,
       preparation time on connecting to a database is very short, and two  or
       more processes can share the same memory map.

       If  the  number  of elements of a bucket array is about half of records
       stored within a database, although it depends on characteristic of  the
       input,  the  probability  of  collision  of  hash values is about 56.7%
       (36.8% if the same, 21.3% if twice, 11.5% if four times, 6.0% if  eight
       times).   In  such  case, it is possible to retrieve a record by two or
       less paths of file operations.  If it is made into a performance index,
       in  order  to  handle  a  database containing one million of records, a
       bucket array with half a million of elements is needed.   The  size  of
       each  element  is 4 bytes.  That is, if 2M bytes of RAM is available, a
       database containing one million records can be handled.

       QDBM provides  two  modes  to  connect  to  a  database:  ‘reader’  and
       ‘writer’.   A  reader  can  perform  retrieving but neither storing nor
       deleting.  A writer can perform all access methods.  Exclusion  control
       between  processes  is  performed when connecting to a database by file
       locking.  While a writer is connected to a  database,  neither  readers
       nor  writers  can  be  connected.   While  a  reader  is connected to a
       database, other readers can be connect, but writers can not.  According
       to  this  mechanism,  data  consistency is guaranteed with simultaneous
       connections in multitasking environment.

       Traditional DBM provides two modes of the storing operations:  ‘insert’
       and  ‘replace’.   In  the  case  a key overlaps an existing record, the
       insert mode keeps the existing value, while the replace mode transposes
       it to the specified value.  In addition to the two modes, QDBM provides
       ‘concatenate’ mode.  In the mode, the specified value  is  concatenated
       at  the  end  of the existing value and stored.  This feature is useful
       when adding a element to a value as an array.  Moreover,  although  DBM
       has  a  method to fetch out a value from a database only by reading the
       whole of a region of a record, QDBM has a method to fetch out a part of
       a region of a value.  When a value is treated as an array, this feature
       is also useful.

       Generally speaking, while  succession  of  updating,  fragmentation  of
       available  regions  occurs,  and  the size of a database grows rapidly.
       QDBM deal with this problem by coalescence of dispensable  regions  and
       reuse  of  them,  and  featuring  of  optimization of a database.  When
       overwriting a record with a  value  whose  size  is  greater  than  the
       existing  one, it is necessary to remove the region to another position
       of the file.  Because the time complexity of the operation  depends  on
       the  size  of  the region of a record, extending values successively is
       inefficient.  However, QDBM deal with this problem  by  alignment.   If
       increment  can  be  put  in  padding, it is not necessary to remove the
       region.

       As for many file systems, it is impossible to handle a file whose  size
       is more than 2GB.  To deal with this problem, QDBM provides a directory
       database containing multiple database files.  Due to this  feature,  it
       is  possible  to  handle  a  database  whose total size is up to 1TB in
       theory.  Moreover, because database files can be deployed  on  multiple
       disks,  the speed of updating operations can be improved as with RAID-0
       (striping).  It is also possible for the database files  to  deploy  on
       multiple file servers using NFS and so on.

USEFUL IMPLEMENTATION OF B+ TREE DATABASE

       Although  B+  tree  database  is slower than hash database, it features
       ordering access to each record.  The order can be  assigned  by  users.
       Records  of  B+  tree are sorted and arranged in logical pages.  Sparse
       index organized in B tree that is multiway balanced tree are maintained
       for  each  page.   Thus,  the time complexity of retrieval and so on is
       ‘O(log n)’.  Cursor is provided to access each record  in  order.   The
       cursor  can  jump to a position specified by a key and can step forward
       or backward from the current position.  Because each page  is  arranged
       as  double  linked  list,  the  time  complexity  of stepping cursor is
       ‘O(1)’.

       B+ tree database is implemented, based on above hash database.  Because
       each page of B+ tree is stored as each record of hash database, B+ tree
       database inherits efficiency of storage management  of  hash  database.
       Because the header of each record is smaller and alignment of each page
       is calculated statistically, in most cases, the size of  database  file
       is cut by half compared to one of hash database.  Although operation of
       many pages are required to update B+ tree, QDBM expedites  the  process
       by  caching pages and reducing file operations.  In most cases, because
       whole of the sparse index is  cached  on  memory,  it  is  possible  to
       retrieve a record by one or less path of file operations.

       B+  tree  database  features  transaction mechanism.  It is possible to
       commit a series of operations between the beginning and the end of  the
       transaction in a lump, or to abort the transaction and perform rollback
       to the state before  the  transaction.   Even  if  the  process  of  an
       application  is crushed while the transaction, the database file is not
       broken.

       In case that QDBM is built with ZLIB, LZO, or BZIP2 enabled, a lossless
       data-compression  library,  the  content  of  each  page  of B+ tree is
       compressed and stored in a file.  Because each record  in  a  page  has
       similar patterns, high efficiency of compression is expected due to the
       Lempel-Ziv algorithm and the like.  In case  handling  text  data,  the
       size of a database is reduced to about 25%.  If the scale of a database
       is large and disk I/O is the bottleneck,  featuring  compression  makes
       the processing speed improved to a large extent.

SIMPLE BUT VARIOUS INTERFACES

       QDBM  provides very simple APIs.  You can perform database I/O as usual
       file I/O with ‘FILE’ pointer defined in ANSI C.  In the  basic  API  of
       QDBM,  entity  of  a database is recorded as one file.  In the extended
       API, entity  of  a  database  is  recorded  as  several  files  in  one
       directory.   Because  the  two  APIs  are very similar with each other,
       porting an application from one to the other is easy.

       APIs which are compatible with NDBM and GDBM  are  also  provided.   As
       there  are a lot of applications using NDBM or GDBM, it is easy to port
       them onto QDBM.  In most cases, it is completed only by replacement  of
       header  including  (#include)  and re-compiling.  However, QDBM can not
       handle database files made by the original NDBM or GDBM.

       In order to handle  records  on  memory  easily,  the  utility  API  is
       provided.    It   implements   memory   allocating  functions,  sorting
       functions, extensible datum, array list, hash map, and  so  on.   Using
       them,  you  can  handle records in C language cheaply as in such script
       languages as Perl or Ruby.

       B+ tree database is used with the advanced API.  The  advanced  API  is
       implemented  using  the  basic  API  and  the utility API.  Because the
       advanced API is also similar to the basic API and the extended API,  it
       is easy to learn how to use it.

       In  order to handle an inverted index which is used by full-text search
       systems, the inverted API is provided.  If it  is  easy  to  handle  an
       inverted   index  of  documents,  an  application  can  focus  on  text
       processing and natural language processing.  Because this API does  not
       depend  on character codes nor languages, it is possible to implement a
       full-text search system which can  respond  to  various  requests  from
       users.

       Along  with  APIs  for  C,  QDBM provides APIs for C++, Java, Perl, and
       Ruby.  APIs for C are composed of  seven  kinds:  the  basic  API,  the
       extended  API,  the  NDBM-compatible  API, the GDBM-compatible API, the
       utility API, the advanced API, and  the  inverted  API.   Command  line
       interfaces  corresponding  to  each  API  are  also provided.  They are
       useful for prototyping, testing, debugging, and so  on.   The  C++  API
       encapsulates database handling functions of the basic API, the extended
       API, and the advanced API with class mechanism of C++.   The  Java  API
       has  native  methods  calling  the basic API, the extended API, and the
       advanced API with Java Native Interface.   The  Perl  API  has  methods
       calling  the  basic API, the extended API, and the advanced API with XS
       language.  The Ruby API has method calling the basic API, the  extended
       API,  and  the  advanced API as modules of Ruby.  Moreover, CGI scripts
       for administration of databases and full-text search are provided.

WIDE PORTABILITY

       QDBM is implemented being based on syntax of ANSI  C  (C89)  and  using
       only  APIs  defined  in ANSI C or POSIX.  Thus, QDBM works on most UNIX
       and its compatible OSs.  As for C API, checking  operations  have  been
       done  at least on Linux 2.2, Linux 2.4, FreeBSD 4.8, FreeBSD 5.0, SunOS
       5.7, SunOS 5.8, SunOS 5.9, HP-UX 11.00, Cygwin 1.3.10, Mac OS  X  10.2,
       and  RISC OS 5.03.  Although a database file created by QDBM depends on
       byte order of the processor, to do with it, utilities to dump  data  in
       format which is independent to byte orders are provided.

BUILDING

       For  building a program using QDBM, the program should be linked with a
       library file ‘libqdbm.a’ or ‘libqdbm.so’.  For example,  the  following
       command is executed to build ‘sample’ from ‘sample.c’.

              gcc  -I/usr/local/include  -o  sample  sample.c -L/usr/local/lib
              -lqdbm

AUTHOR

       QDBM is written by Mikio Hirabayashi.  You can contact  the  author  by
       e-mail  to <mikio@users.sourceforge.net>.  Any suggestion or bug report
       is welcome to the author.

COPYRIGHT

       Copyright(c) 2000-2003 Mikio Hirabayashi

       QDBM is free software; you can redistribute it and/or modify  it  under
       the  terms of the GNU Lesser General Public License as published by the
       Free Software Foundation; either version 2 of the License, or any later
       version.

       QDBM is distributed in the hope that it will be useful, but WITHOUT ANY
       WARRANTY; without even  the  implied  warranty  of  MERCHANTABILITY  or
       FITNESS  FOR  A  PARTICULAR PURPOSE.  See the GNU Lesser General Public
       License for more details.

       You should have received a  copy  of  the  GNU  Lesser  General  Public
       License along with QDBM; if not, write to the Free Software Foundation,
       Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.

NAME

OVERVIEW

EFFECTIVE IMPLEMENTATION OF HASH DATABASE

USEFUL IMPLEMENTATION OF B+ TREE DATABASE

SIMPLE BUT VARIOUS INTERFACES

WIDE PORTABILITY

BUILDING

AUTHOR

COPYRIGHT

SEE ALSO