NAME
bulk_loader - PgQ consumer that loads urlencoded records to slow
databases
SYNOPSIS
bulk_loader.py [switches] config.ini
DESCRIPTION
bulk_loader is PgQ consumer that reads url encoded records from source
queue and writes them into tables according to configuration file. It
is targeted to slow databases that cannot handle applying each row as
separate statement. Originally written for BizgresMPP/greenplumDB which
have very high per-statement overhead, but can also be used to load
regular PostgreSQL database that cannot manage regular replication.
Behaviour properties: - reads urlencoded "logutriga" records. - does
not do partitioning, but allows optionally redirect table events. -
does not keep event order. - always loads data with COPY, either
directly to main table (INSERTs) or to temp tables (UPDATE/COPY) then
applies from there.
Events are usually procuded by pgq.logutriga(). Logutriga adds all the
data of the record into the event (also in case of updates and
deletes).
QUICK-START
Basic bulk_loader setup and usage can be summarized by the following
steps:
1. pgq and logutriga must be installed in source databases. See
pgqadm man page for details. target database must also have pgq_ext
schema.
2. edit a bulk_loader configuration file, say bulk_loader_sample.ini
3. create source queue
$ pgqadm.py ticker.ini create <queue>
4. Tune source queue to have big batches:
$ pgqadm.py ticker.ini config <queue> ticker_max_count="10000" ticker_max_lag="10 minutes" ticker_idle_period="10 minutes"
5. create target database and tables in it.
6. launch bulk_loader in daemon mode
$ bulk_loader.py -d bulk_loader_sample.ini
7. start producing events (create logutriga trggers on tables) CREATE
OR REPLACE TRIGGER trig_bulk_replica AFTER INSERT OR UPDATE ON
some_table FOR EACH ROW EXECUTE PROCEDURE pgq.logutriga(<queue>)
CONFIG
Common configuration parameters
job_name
Name for particulat job the script does. Script will log under this
name to logdb/logserver. The name is also used as default for PgQ
consumer name. It should be unique.
pidfile
Location for pid file. If not given, script is disallowed to
daemonize.
logfile
Location for log file.
loop_delay
If continuisly running process, how long to sleep after each work
loop, in seconds. Default: 1.
connection_lifetime
Close and reconnect older database connections.
use_skylog
foo.
Common PgQ consumer parameters
pgq_queue_name
Queue name to attach to. No default.
pgq_consumer_id
Consumers ID to use when registering. Default: %(job_name)s
Config options specific to bulk_loader
src_db
Connect string for source database where the queue resides.
dst_db
Connect string for target database where the tables should be
created.
remap_tables
Optional parameter for table redirection. Contains comma-separated
list of <oldname>:<newname> pairs. Eg: oldtable1:newtable1,
oldtable2:newtable2.
load_method
Optional parameter for load method selection. Available options:
0 UPDATE as UPDATE from temp
table. This is default.
1 UPDATE as DELETE+COPY from
temp table.
2 merge INSERTs with
UPDATEs, then do
DELETE+COPY from temp
table.
LOGUTRIGA EVENT FORMAT
PgQ trigger function pgq.logutriga() sends table change event into
queue in following format:
ev_type
(op || ":" || pkey_fields). Where op is either "I", "U" or "D",
corresponging to insert, update or delete. And pkey_fields is
comma-separated list of primary key fields for table. Operation
type is always present but pkey_fields list can be empty, if table
has no primary keys. Example: I:col1,col2
ev_data
Urlencoded record of data. It uses db-specific urlecoding where
existence of = is meaningful - missing = means NULL, present =
means literal value. Example: id=3&name=str&nullvalue&emptyvalue=
ev_extra1
Fully qualified table name.
COMMAND LINE SWITCHES
Following switches are common to all skytools.DBScript-based Python
programs.
-h, --help
show help message and exit
-q, --quiet
make program silent
-v, --verbose
make program more verbose
-d, --daemon
make program go background
Following switches are used to control already running process. The
pidfile is read from config then signal is sent to process id specified
there.
-r, --reload
reload config (send SIGHUP)
-s, --stop
stop program safely (send SIGINT)
-k, --kill
kill program immidiately (send SIGTERM)
09/22/2008