Man Linux: Main Page and Category List

NAME

       pstotext - extract ASCII text from a PostScript or PDF file

SYNTAX

       pstotext [option|pathname]...

       where option includes:

       -cork
       -landscape
       -landscapeOther
       -portrait
       -
       -output file
       -gs command
       -debug
       -bboxes

DESCRIPTION

       pstotext  reads  one  or  more  PostScript  or PDF files, and writes to
       standard output a representation  of  the  plain  text  that  would  be
       displayed  if the PostScript file were printed.  As is described in the
       DETAILS section below, this representation is  only  an  approximation.
       Nevertheless,  it  is  often  useful  for  information retrieval (e.g.,
       running grep(1) or building a full-text index) or to recover  the  text
       from a PostScript file whose source you have lost.

       pstotext  calls  Ghostscript,  and requires Aladdin Ghostscript version
       3.51 or newer.  Ghostscript must be invokable  on  the  current  search
       path  as  gs.  Alternatively, you can use the -gs option to specify the
       command (pathname and options) to run  Ghostscript.   For  example,  on
       Windows you might use -gs "c:\gs\gswin32c.exe -Ic:\gs;c:\gs\fonts".

       pstotext  reads  and  processes  its  command  line from left to right,
       ignoring the case of options.  When it encounters a pathname, it  opens
       the  file  and  expects  to  find  a  PostScript job or PDF document to
       process.  The option - means to read and process a PostScript job  from
       standard  input.   If  no  -  or  pathname  arguments  are encountered,
       pstotext reads a PostScript job from  standard  input.  (PDF  documents
       require  random  access, hence cannot be read from standard input.) You
       can use the -output option to  specify  an  output  file  (remember  to
       invoke it before the input file); otherwise pstotext writes to standard
       output.

       The option -cork is only relevant  for  PostScript  files  produced  by
       dvips  from  TeX  or LaTeX documents; it tells pstotext to use the Cork
       encoding (known as T1 in LaTeX) rather than the old TeX  text  encoding
       (known  as  OT1  in LaTeX). Unfortunately files produced by dvips don’t
       distinguish which font encodings were used.

       The options -landscape and -landscapeOther should be used for documents
       that   must  be  rotated  90  degrees  clockwise  or  counterclockwise,
       respectively, in order to be readable.

       The options -debug and -bboxes are mostly of use for the maintainers of
       pstotext.   -debug shows Ghostscript output and error messages. -bboxes
       outputs one word per line with bounding box information.

DETAILS

       pstotext does its work by telling  Ghostscript  to  load  a  PostScript
       library  that  causes  it  to  write to its standard output information
       about each string rendered by a PostScript job or PDF  document.   This
       information   includes   the  characters  of  the  string,  and  enough
       additional information to approximate the string’s bounding  rectangle.
       pstotext  post-processes  this  information  and  outputs a sequence of
       words delimited by space, newline, and formfeed.

       pstotext outputs words in the same sequence as they are rendered by the
       document.  This usually, but not always, follows the order that a human
       would read the words on  a  page.   Within  this  sequence,  words  are
       separated  by  either space or newline depending on whether or not they
       fall on the same line.  Each page is terminated with  a  formfeed.   If
       you  use  the  incorrect  option  from  the set {-portrait, -landscape,
       -landscapeOther}, pstotext is likely to substitute newline for space.

       A PostScript job or PDF document often  renders  one  word  as  several
       strings  in  order  to  get correct spacing between particular pairs of
       characters.  pstotext does its best to assemble these strings back into
       words,  using  a  simple  heuristic: strings separated by a distance of
       less than 0.3 times the minimum of the average character widths in  the
       two strings are considered to be part of the same word.  Note that this
       typically causes leading and  trailing  punctuation  characters  to  be
       included with a word.

       The  PostScript  language  provides a flexible encoding scheme by which
       character codes in strings select specific characters (symbols),  so  a
       PostScript  job  is free to use any character code.  On the other hand,
       pstotext always translates to the ISO 8859-1 (Latin-1) character  code,
       which  is  an  extension to ASCII covering most of the Western European
       languages.  When a character isn’t present in ISO 8859-1, pstotext uses
       a  sequence  of  characters,  e.g.,  "---"  for  em dash or "A\226" for
       Abreve.  pstotext can be fooled by a font whose Encoding vector doesn’t
       follow  Adobe’s  conventions, but it contains heuristics allowing it to
       handle a wide variety of misbehaving fonts.

       (pstotext no longer translates hyphen (\255) to minus (\055).)

AUTHOR

       Andrew Birrell  (PostScript  libraries),  Paul  McJones  (application),
       Russell  Lang  (Windows  and  OS/2 adaptation), and Hunter Goatley (VMS
       adaptation).

SEE ALSO

       pstotext incorporates technology originally developed for  the  Virtual
       Paper             project             at            SRC;            see
       http://www.research.digital.com/SRC/virtualpaper/.

       As  mentioned  above,  pstotext  invokes  Ghostscript.   See  gs(1)  or
       http://www.cs.wisc.edu/~ghost/.

COPYRIGHT

       Copyright 1995-8 Digital Equipment Corporation.
       Distributed only by permission.
       See file /usr/share/doc/pstotext/copyright for details.

       Last modified on Sat Feb  5 21:00:00 AEST 2000 by rjl
            modified on Fri Jun  5 14:02:37 PDT 1998 by mcjones
            modified on Wed Jun  7 17:47:56 PDT 1995 by birrell

       This  file  was  generated automatically by mtex software; see the mtex
       home page at http://www.research.digital.com/SRC/mtex/.

                                                                   pstotext(1)