NAME
tokyocabinet - a modern implementation of DBM
INTRODUCTION
Tokyo Cabinet is a library of routines for managing a database. The
database is a simple data file containing records, each is a pair of a
key and a value. Every key and value is serial bytes with variable
length. Both binary data and character string can be used as a key and
a value. There is neither concept of data tables nor data types.
Records are organized in hash table, B+ tree, or fixed-length array.
As for database of hash table, each key must be unique within a
database, so it is impossible to store two or more records with a key
overlaps. The following access methods are provided to the database:
storing a record with a key and a value, deleting a record by a key,
retrieving a record by a key. Moreover, traversal access to every key
are provided, although the order is arbitrary. These access methods
are similar to ones of DBM (or its followers: NDBM and GDBM) library
defined in the UNIX standard. Tokyo Cabinet is an alternative for DBM
because of its higher performance.
As for database of B+ tree, records whose keys are duplicated can be
stored. Access methods of storing, deleting, and retrieving are
provided as with the database of hash table. Records are stored in
order by a comparison function assigned by a user. It is possible to
access each record with the cursor in ascending or descending order.
According to this mechanism, forward matching search for strings and
range search for integers are realized.
As for database of fixed-length array, records are stored with unique
natural numbers. It is impossible to store two or more records with a
key overlaps. Moreover, the length of each record is limited by the
specified length. Provided operations are the same as ones of hash
database.
Table database is also provided as a variant of hash database. Each
record is identified by the primary key and has a set of named columns.
Although there is no concept of data schema, it is possible to search
for records with complex conditions efficiently by using indices of
arbitrary columns.
Tokyo Cabinet is written in the C language, and provided as API of C,
Perl, Ruby, Java, and Lua. Tokyo Cabinet is available on platforms
which have API conforming to C99 and POSIX. Tokyo Cabinet is a free
software licensed under the GNU Lesser General Public License.
THE DINOSAUR WING OF THE DBM FORKS
Tokyo Cabinet is developed as the successor of GDBM and QDBM on the
following purposes. They are achieved and Tokyo Cabinet replaces
conventional DBM products.
improves space efficiency : smaller size of database file.
improves time efficiency : faster processing speed.
improves parallelism : higher performance in multi-thread
environment.
improves usability : simplified API.
improves robustness : database file is not corrupted even under
catastrophic situation.
supports 64-bit architecture : enormous memory space and
database file are available.
As with QDBM, the following three restrictions of traditional DBM: a
process can handle only one database, the size of a key and a value is
bounded, a database file is sparse, are cleared. Moreover, the
following three restrictions of QDBM: the size of a database file is
limited to 2GB, environments with different byte orders can not share a
database file, only one thread can search a database at the same time,
are cleared.
Tokyo Cabinet runs very fast. For example, elapsed time to store 1
million records is 0.7 seconds for hash database, and 1.6 seconds for
B+ tree database. Moreover, the size of database of Tokyo Cabinet is
very small. For example, overhead for a record is 16 bytes for hash
database, and 5 bytes for B+ tree database. Furthermore, scalability
of Tokyo Cabinet is great. The database size can be up to 8EB (9.22e18
bytes).
EFFECTIVE IMPLEMENTATION OF HASH DATABASE
Tokyo Cabinet uses hash algorithm to retrieve records. If a bucket
array has sufficient number of elements, the time complexity of
retrieval is "O(1)". That is, time required for retrieving a record is
constant, regardless of the scale of a database. It is also the same
about storing and deleting. Collision of hash values is managed by
separate chaining. Data structure of the chains is binary search tree.
Even if a bucket array has unusually scarce elements, the time
complexity of retrieval is "O(log n)".
Tokyo Cabinet attains improvement in retrieval by loading RAM with the
whole of a bucket array. If a bucket array is on RAM, it is possible
to access a region of a target record by about one path of file
operations. A bucket array saved in a file is not read into RAM with
the ‘read’ call but directly mapped to RAM with the ‘mmap’ call.
Therefore, preparation time on connecting to a database is very short,
and two or more processes can share the same memory map.
If the number of elements of a bucket array is about half of records
stored within a database, although it depends on characteristic of the
input, the probability of collision of hash values is about 56.7%
(36.8% if the same, 21.3% if twice, 11.5% if four times, 6.0% if eight
times). In such case, it is possible to retrieve a record by two or
less paths of file operations. If it is made into a performance index,
in order to handle a database containing one million of records, a
bucket array with half a million of elements is needed. The size of
each element is 4 bytes. That is, if 2M bytes of RAM is available, a
database containing one million records can be handled.
Traditional DBM provides two modes of the storing operations: "insert"
and "replace". In the case a key overlaps an existing record, the
insert mode keeps the existing value, while the replace mode transposes
it to the specified value. In addition to the two modes, Tokyo Cabinet
provides "concatenate" mode. In the mode, the specified value is
concatenated at the end of the existing value and stored. This feature
is useful when adding an element to a value as an array.
Generally speaking, while succession of updating, fragmentation of
available regions occurs, and the size of a database grows rapidly.
Tokyo Cabinet deal with this problem by coalescence of dispensable
regions and reuse of them. When overwriting a record with a value
whose size is greater than the existing one, it is necessary to remove
the region to another position of the file. Because the time
complexity of the operation depends on the size of the region of a
record, extending values successively is inefficient. However, Tokyo
Cabinet deal with this problem by alignment. If increment can be put
in padding, it is not necessary to remove the region.
The "free block pool" to reuse dispensable regions efficiently is also
implemented. It keeps a list of dispensable regions and reuse the
"best fit" region, that is the smallest region in the list, when a new
block is requested. Because fragmentation is inevitable even then, two
kinds of optimization (defragmentation) mechanisms are implemented.
The first is called static optimization which deploys all records into
another file and then writes them back to the original file at once.
The second is called dynamic optimization which gathers up dispensable
regions by replacing the locations of records and dispensable regions
gradually.
USEFUL IMPLEMENTATION OF B+ TREE DATABASE
Although B+ tree database is slower than hash database, it features
ordering access to each record. The order can be assigned by users.
Records of B+ tree are sorted and arranged in logical pages. Sparse
index organized in B tree that is multiway balanced tree are maintained
for each page. Thus, the time complexity of retrieval and so on is
"O(log n)". Cursor is provided to access each record in order. The
cursor can jump to a position specified by a key and can step forward
or backward from the current position. Because each page is arranged
as double linked list, the time complexity of stepping cursor is
"O(1)".
B+ tree database is implemented, based on above hash database. Because
each page of B+ tree is stored as each record of hash database, B+ tree
database inherits efficiency of storage management of hash database.
Because the header of each record is smaller and alignment of each page
is adjusted according to the page size, in most cases, the size of
database file is cut by half compared to one of hash database.
Although operation of many pages are required to update B+ tree, QDBM
expedites the process by caching pages and reducing file operations.
In most cases, because whole of the sparse index is cached on memory,
it is possible to retrieve a record by one or less path of file
operations.
Each pages of B+ tree can be stored with compressed. Two compression
method; Deflate of ZLIB and Block Sorting of BZIP2, are supported.
Because each record in a page has similar patterns, high efficiency of
compression is expected due to the Lempel-Ziv or the BWT algorithms.
In case handling text data, the size of a database is reduced to about
25%. If the scale of a database is large and disk I/O is the
bottleneck, featuring compression makes the processing speed improved
to a large extent.
NAIVE IMPLEMENTATION OF FIXED-LENGTH DATABASE
Fixed-length database has restrictions that each key should be a
natural number and that the length of each value is limited. However,
time efficiency and space efficiency are higher than the other data
structures as long as the use case is within the restriction.
Because the whole region of the database is mapped on memory by the
‘mmap’ call and referred as a multidimensional array, the overhead
related to the file I/O is minimized. Due to this simple structure,
fixed-length database works faster than hash database, and its
concurrency in multi-thread environment is prominent.
The size of the database is proportional to the range of keys and the
limit size of each value. That is, the smaller the range of keys is or
the smaller the length of each value is, the higher the space
efficiency is. For example, if the maximum key is 1000000 and the
limit size of the value is 100 bytes, the size of the database will be
about 100MB. Because regions around referred records are only loaded
on the RAM, you can increase the size of the database to the size of
the virtual memory.
FLEXIBLE IMPLEMENTATION OF TABLE DATABASE
Table database does not express simple key/value structure but
expresses a structure like a table of relational database. Each record
is identified by the primary key and has a set of multiple columns
named with arbitrary strings. For example, a stuff in your company can
be expressed by a record identified by the primary key of the employee
ID number and structured by columns of his name, division, salary, and
so on. Unlike relational database, table database does not need to
define any data schema and can contain records of various structures
different from each other.
Table database supports query functions with not only the primary key
but also with conditions about arbitrary columns. Each column
condition is composed of the name of a column and a condition
expression. Operators of full matching, forward matching, regular
expression matching, and so on are provided for the string type.
Operators of full matching, range matching and so on are provided for
the number type. Operators for tag search and full-text search are
also provided. A query can contain multiple conditions for logical
intersection. Search by multiple queries for logical union is also
available. The order of the result set can be specified as the
ascending or descending order of strings or numbers.
You can create indices for arbitrary columns to improve performance of
search and sorting. Although columns do not have data types, indices
have types for strings or numbers. Inverted indices for space
separated tokens and character N-gram tokens are also supported. The
query optimizer uses indices in suitable way according to each query.
Indices are implemented as different files of B+ tree database.
PRACTICAL FUNCTIONALITY
Databases on the filesystem feature transaction mechanisms. It is
possible to commit a series of operations between the beginning and the
end of the transaction in a lump, or to abort the transaction and
perform rollback to the state before the transaction. Two isolation
levels are supported; serializable and read uncommitted. Durability is
secured by write ahead logging and shadow paging.
Tokyo Cabinet provides two modes to connect to a database: "reader" and
"writer". A reader can perform retrieving but neither storing nor
deleting. A writer can perform all access methods. Exclusion control
between processes is performed when connecting to a database by file
locking. While a writer is connected to a database, neither readers
nor writers can be connected. While a reader is connected to a
database, other readers can be connect, but writers can not. According
to this mechanism, data consistency is guaranteed with simultaneous
connections in multitasking environment.
Functions of API of Tokyo cabinet are reentrant and available in
multi-thread environment. Discrete database object can be operated in
parallel entirely. For simultaneous operations of the same database
object, read-write lock is used for exclusion control. That is, while
a writing thread is operating the database, other reading threads and
writing threads are blocked. However, while a reading thread is
operating the database, reading threads are not blocked. The locking
granularity of hash database and fixed-length database is per record,
and that of the other databases is per file.
SIMPLE BUT VARIOUS INTERFACES
Tokyo Cabinet provides simple API based on the object oriented design.
Every operation for database is encapsulated and published as lucid
methods as ‘open’ (connect), ‘close’ (disconnect), ‘put’ (insert),
‘out’ (remove), ‘get’ (retrieve), and so on. Because the three of
hash, B+ tree, and fixed-length array database APIs are very similar
with each other, porting an application from one to the other is easy.
Moreover, the abstract API is provided to handle these databases with
the same interface. Applications of the abstract API can determine the
type of the database in runtime.
The utility API is also provided. Such fundamental data structure as
list and map are included. And, some useful features; memory pool,
string processing, encoding, are also included.
Six kinds of API; the utility API, the hash database API, the B+ tree
database API, the fixed-length database API, the table database API,
and the abstract database API, are provided for the C language.
Command line interfaces are also provided corresponding to each API.
They are useful for prototyping, test, and debugging. Except for C,
Tokyo Cabinet provides APIs for Perl, Ruby, Java, and Lua. APIs for
other languages will hopefully be provided by third party.
In cases that multiple processes access a database at the same time or
some processes access a database on a remote host, the remote service
is useful. The remote service is composed of a database server and its
access library. Applications can access the database server by using
the remote database API. The server implements HTTP and the memcached
protocol partly so that client programs on almost all platforms can
access the server easily.
HOW TO USE THE LIBRARY
Tokyo Cabinet provides API of the C language and it is available by
programs conforming to the C89 (ANSI C) standard or the C99 standard.
As the header files of Tokyo Cabinet are provided as ‘tcutil.h’,
‘tchdb.h’, and ‘tcbdb.h’, applications should include one or more of
them accordingly to use the API. As the library is provided as
‘libtokyocabinet.a’ and ‘libtokyocabinet.so’ and they depends
‘libz.so’, ‘librt.so’, ‘libpthread.so’, ‘libm.so’, and ‘libc.so’,
linker options ‘-ltokyocabinet’, ‘-lz’, ‘-lbz2’, ‘-lrt’, ‘-lpthread’,
‘-lm’, and ‘-lc’ are required for build command. A typical build
command is the following.
gcc -I/usr/local/include tc_example.c -o tc_example \
-L/usr/local/lib -ltokyocabinet -lz -lbz2 -lrt -lpthread -lm
-lc
You can also use Tokyo Cabinet in programs written in C++. Because
each header is wrapped in C linkage (‘extern "C"’ block), you can
simply include them into your C++ programs.
LICENSE
Tokyo Cabinet is free software; you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as
published by the Free Software Foundation; either version 2.1 of the
License or any later version.
Tokyo Cabinet is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with Tokyo Cabinet (See the file ‘COPYING’); if not,
write to the Free Software Foundation, Inc., 59 Temple Place, Suite
330, Boston, MA 02111-1307 USA.
Tokyo Cabinet was written by Mikio Hirabayashi. You can contact the
author by e-mail to ‘hirarin@gmail.com’.
SEE ALSO
tcutil(3), tchdb(3), tcbdb(3), tcfdb(3), tctdb(3), tcadb(3)
Please see http://1978th.net/tokyocabinet/ for detail.