DjVu - DjVu and DjVuLibre.

NAME

       DjVu - DjVu and DjVuLibre.

INTRODUCTION

       Although  the Internet has given us a worldwide infrastructure on which
       to build the universal library, much of the world  knowledge,  history,
       and  literature  is  still  trapped  on  paper  in the basements of the
       world's traditional libraries. Many libraries and content owners are in
       the  process  of digitizing their collections.  While many such efforts
       involve the  painstaking  process  of  converting  paper  documents  to
       computer-friendly  form,  such  as SGML based formats, the high cost of
       such  conversions  limits  their  extent.   Scanning   documents,   and
       distributing   the   resulting   images   electronically  is  not  only
       considerably cheaper, but also more faithful to the  original  document
       because it preserves its visual aspect.

       Despite   the  quickly  improving  speed  of  network  connections  and
       computers, the number of scanned document images accessible on the  Web
       today is relatively small. There are several reasons for this.

       The  first reason is the relatively high cost of scanning anything else
       but unbound sheets in black and white. This  problem  is  slowly  going
       away with the appearance of fast and low-cost color scanners with sheet
       feeders.

       The second reason is that long-established image compression  standards
       and  file  formats  have  proved  inadequate  for  distributing scanned
       documents at high resolution, particularly color documents.   Not  only
       are  the  file  sizes  and download times impractical, the decoding and
       rendering times are also prohibitive.  A typical magazine page  scanned
       in  color  at 100 dpi in JPEG would typically occupy 100 KB to 200 KB ,
       but the text would be hardly readable: insufficient for screen  viewing
       and  totally  unacceptable for printing. The same page at 300 dpi would
       have sufficient quality for viewing and printing,  but  the  file  size
       would  be  300  KB  to 1000 KB at best, which is impractical for remote
       access. Another major problem is that a fully  decoded  300  dpi  color
       images of a letter-size page occupies 24 MB of memory and easily causes
       disk swapping.

       The third reason is  that  digital  documents  are  more  than  just  a
       collection of individual page images. Pages in a scanned documents have
       a natural serial order. Special provision must be made to  ensure  that
       flipping pages be instantaneous and effortless so as to maintain a good
       user experience. Even more important, most  existing  document  formats
       force  users  to download the entire document first before displaying a
       chosen page.  However, users often want to jump to individual pages  of
       the  document  without  waiting  for  the  entire document to download.
       Efficient  browsing  requires  efficient  random  page   access,   fast
       sequential  page  flipping,  and  quick rendering. This can be achieved
       with a combination of advanced compression, pre-fetching, pre-decoding,
       caching,  and  progressive  rendering.  DjVu  decomposes each page into
       multiple components (text, backgrounds,  images,  libraries  of  common
       shapes...)   that  may  be  shared  by  several pages and downloaded on
       demand.  All these requirements  call  for  a  very  sophisticated  but
       parsimonious  control  mechanism  to handle on-demand downloading, pre-
       fetching, decoding, caching, and  progressive  rendering  of  the  page
       images.   What  is  being  considered here is not just a document image
       compression technique, but a whole platform for document delivery.

       DjVu is an image  compression  technique,  a  document  format,  and  a
       software  platform  for  delivering  documents images over the Internet
       that fulfills the above requirements.

DJVU IMAGE COMPRESSION

       The DjVu image compression is based on three technologies:

   DjVuPhoto
       DjVuPhoto, also known as IW44, is a wavelet-based continuous-tone image
       compression  technique with progressive decoding/rendering.  It is best
       used for encoding photographic images in colors or in shades  of  gray.
       Images are typically half the size as JPEG for the same distortion.

   DjVuBitonal
       DjVuBitonal,  also  known  as  JB2, is a bitonal image compression that
       takes advantage of repetitions of nearly identical shapes on  the  page
       (such  as  characters) to efficiently compress text images.  It is best
       used to compress black and white images representing  text  and  simple
       drawings.  A typical 300 dpi page in DjVuBitonal occupies 5 to 25 KB (3
       to 8 times better than TIFF-G4 or PDF ).

   DjVuDocument
       DjVuDocument is a compression technique specifically designed for color
       digital  documents  images containing both pictures and text, such as a
       page of a magazine.  DjVuDocument  represents  images  into  separately
       compressed  layers.   The  foreground  layer is usually compressed with
       DjVu Bitonal and contains the text and drawings.  The background  layer
       is  usually  compressed  with  DjVuPhoto  and  contains  the background
       texture and the pictures at lower resolution.

DJVU DOCUMENT DELIVERY PLATFORM

       The DjVu technology is designed from  the  ground  up  to  support  the
       efficient delivery of digital documents over the Internet.  It provides
       various ways to deal with multi-page documents,  and  various  ways  to
       enrich the content with hyper-links, meta-data, searchable text, etc.

   MIME types
       The  DjVu  format has an official MIME type of image/vnd.djvu, which is
       the preferred content-type to be given by http servers for DjVu  files.
       Unofficial  mime  types used historically are image/x.djvu and image/x-
       djvu, which may still  be  encountered.   Ideally,  clients  should  be
       configured  to  handle  all three.  (For web server configuration help,
       see  http://www.djvuzone.org/support/tutorial/chapter-authorin.html.)

   Bundled multi-page documents
       Bundled  multi-page  DjVu  document uses a single file to represent the
       entire document.  This single file contains all the pages  as  well  as
       ancillary  information (e.g. the page directory, data shared by several
       pages,  thumbnails,  etc.).   Using  a  single  file  format  is   very
       convenient for storing documents or for sending email attachments.

       When you type the URL of a multi-page document, the DjVu browser plugin
       starts downloading the whole file, but displays the first page as  soon
       as  it is available.  You can immediately navigate to other pages using
       the DjVu toolbar.  Suppose however that the document  is  stored  on  a
       remote  web  server.  You can easily access the first page and see that
       this is not the document you wanted.  Although you will  never  display
       the other pages the browser is transferring data for these pages and is
       wasting the bandwidth of your server (and the bandwidth of the Internet
       too).  You could also see the summary of the document on the first page
       and jump to page 100.  But page 100 cannot be displayed until data  for
       pages  1  to  99  has  been  received.   You  may  have to wait for the
       transmission of  unnecessary  page  data.   This  second  problem  (the
       unnecessary  wait)  can be solved using the ``byte serving'' options of
       the HTTP/1.1 protocol.  This option has to  be  supported  by  the  web
       server,  the proxies, the caches and the browser.  Byte serving however
       does not solve the first problem (the waste of bandwidth).

   Indirect multi-page documents
       Indirect multi-page DjVu documents solve both  problems.   An  indirect
       multi-page  DjVu  document is composed of several files.  The main file
       is named the index file.  You can browse a document using  the  URL  of
       the  index  file,  just like you do with a bundled multi-page document.
       The index file however is very small.  It simply contains the  document
       directory  and  the  URLs  of secondary files containing the page data.
       When you browse an  indirect  multi-page  document,  the  browser  only
       accesses  data  for  the  pages you are viewing.  This can be done at a
       reasonable speed because the browser maintains a  cache  of  pages  and
       sometimes  pre-fetches  a  few  pages  ahead of the current page.  This
       model uses the web serving bandwidth much more  effectively.   It  also
       eliminates  unnecessary  delays  when  jumping  ahead  to pages located
       anywhere in a long document.

   Annotations
       Every DjVu image optionally includes so-called annotation chunks.   The
       annotation  chunk is often used to define hyper-links to other document
       pages or to arbitrary web pages.  Annotation chunks can  also  be  used
       for  other purposes such as setting the initial viewing mode of a page,
       defining highlighted zones, or storing arbitrary  meta-data  about  the
       page or the document.

   Hidden text
       Every   DjVu  image  optionally  includes  a  hidden  text  layer  that
       associated graphical features with the corresponding text.  The  hidden
       text  layer  is  usually  generated  by  running  an  Optical Character
       Recognition software.  This textual information provides  for  indexing
       DjVu documents and copying/pasting text from DjVu page images.

   Thumbnails
       DjVu documents sometimes contain pre-computed page thumbnails.

   Outline
       DjVu  documents  sometimes  contain  a  navigation  chunk containing an
       outline, that is, a hierarchical table of contents with pointers to the
       corresponding document pages.

DJVUZONE AND DJVULIBRE

       The  DjVu technology was initially created by a few researchers in AT&T
       Labs    between    1995    and    1999.      Lizardtech,     Inc.     (
       http://www.lizardtech.com  )  then  obtained  a commercial license from
       AT&T and continued  the  development.   They  have  now  a  variety  of
       solutions  for  producing  and  distributing  documents  using the DjVu
       technology.

       The DjVuZone web site ( http://www.djvuzone.org ) is managed by the few
       AT&T  Labs  researchers  who  created  the DjVu technology in the first
       place.  We promote the DjVu  technology  by  providing  an  independent
       source of information about DjVu.

       Understanding  how  little  room  there  is  for a proprietary document
       format, Lizardtech released the DjVu Reference Library  under  the  GNU
       Public  License  in  December  2000.  This library entirely defines the
       compression  format  and  the  elementary  codecs.   Six  month  later,
       Lizardtech  released  an  updated DjVu Reference Library as well as the
       source code of the Unix viewer.

       These two releases form the basis of our  initial  DjVuLibre  software.
       We  modified  the  build  system to comply with the expectations of the
       open source community.  Various bugs and portability issues  have  been
       fixed.   We  also  tried  to  make it simpler to use and install, while
       preserving the essential structure of the Lizardtech releases.

       The DjVuLibre software contains the following components:

       bzz(1) A  general  purpose  compression  command  line  program.   Many
              internal   DjVu   data  structures  are  compressed  using  this
              technique.

       c44(1) A DjVuPhoto command line encoder. This state-of-the-art  wavelet
              compressor produces DjVuPhoto images from PPM or JPEG images.

       cjb2(1)
              A  DjVuBitonal  command line encoder. This soft-pattern-matching
              compressor produces DjVuBitonal images from PBM images.  It  can
              encode  images without loss, or introduce small changes in order
              to improve the compression ratio.  The lossless encoding mode is
              competitive with that of the Lizardtech commercial encoders.

       cpaldjvu(1)
              A  DjVuDocument command line encoder for images with few colors.
              This encoder is well suited to compressing images with  a  small
              number  of  distinct  colors  (e.g. screen-shots).  The dominant
              color is encoded by the background layer.  The other colors  are
              encoded by the foreground layer.

       csepdjvu(1)
              A  DjVuDocument command line encoder for separated images.  This
              encoder takes a file  containing  pre-segmented  foreground  and
              background images and produces a DjVuDocument image.

       ddjvu(1)
              A command line decoder for DjVu images.  This program produces a
              PNM image representing  any  segment  of  any  page  of  a  DjVu
              document at any resolution.

       djview(1)
              A stand-alone viewer for DjVu images.  This sophisticated viewer
              displays DjVu documents.  It implements document  navigation  as
              well as fast zooming and panning.

       nsdejavu(1)
              A web browser plugin for viewing DjVu images.  This small plugin
              allows  for  viewing  DjVu  documents  from  web  browsers.   It
              internally uses djview to perform the actual work.

       djvups(1)
              A   command   line  tool  for  converting  DjVu  documents  into
              PostScript .

       djvm(1)
              A command line tool for  manipulating  bundled  multi-page  DjVu
              documents.   This  program  is  often used to collect individual
              pages and produce a bundled document.

       djvmcvt(1)
              A command line tool for converting bundled documents to indirect
              documents and conversely.

       djvused(1)
              A   powerful  command  line  tool  for  manipulating  multi-page
              documents, creating or editing annotation  chunks,  creating  or
              editing  hidden text layers, pre-computing thumbnail images, and
              more...

       djvutxt(1)
              A command line  tool  to  extract  the  hidden  text  from  DjVu
              documents.

       djvudump(1)
              A  command  line  tool  for inspecting DjVu files and displaying
              their internal structure.

       djvuextract(1)
              A command line tool for dis-assembling DjVu image files.

       djvumake(1)
              A command line tool for assembling DjVu image files.

       djvuserve(1)
              A CGI program for generating indirect multi-page DjVu  documents
              on the fly.

       djvutoxml(1), djvuxmlparser(1)
              Command line tools to edit DjVu metadata as XML files.

DJVU ENCODERS AND ANY2DJVU

       DjVuLibre  comes  with  a  variety  of specialized encoders, c44(1) for
       photographic images, cjb2(1) for bitonal images,  and  cpaldjvu(1)  for
       images  with few distinct colors.  Although these encoders perform well
       in their specialized domain, they cannot handle complex tasks involving
       segmentation and multipage encoding.

       The         Lizardtech         commercial         products         (see
       http://www.lizardtech.com/solutions/document) can perform these complex
       encoding tasks

       Another   solution   is   provided   by   the   compression  server  at
       (http://any2djvu.djvuzone.org).   This  machine   uses   pre-lizardtech
       prototype  encoders  from  AT&T Labs and performs almost as well as the
       commercial  Lizardtech  encoders.   Please  note  that   the   Any2DjVu
       compression  server  comes  with  no guarantee, that nothing is done to
       ensure that your documents will remain confidential, and that there  is
       only one computer working for the whole planet.

CREDITS

       Numerous  people  have  contributed  to the DjVu source code during the
       last five years.  Please submit a sourceforge bug report to update  the
       following list.

          Yoshua Bengio, Leon Bottou, Chakradhar Chandaluri, Regis M. Chaplin,
          Ming Chen, Parag Deshmukh, Royce Edwards,  Andrew  Erofeev,  Praveen
          Guduru, Patrick Haffner, Paul G. Howard, Orlando Keise, Yann Le Cun,
          Artem Mikheev, Florin Nicsa, Joseph M. Orost,  Steven  Pigeon,  Bill
          Riemers,   Patrice  Simard,  Jeffery  Triggs,  Luc  Vincent,  Pascal
          Vincent.