hadoop - software platform to process vast amounts of data

NAME

       hadoop - software platform to process vast amounts of data

SYNOPSIS

       Usage: hadoop [--config confdir] COMMAND

DESCRIPTION

       Hereâ.INDENT 0.0

       Scalable
              Hadoop can reliably store and process petabytes.

       Economical
              It  distributes  the  data  and  processing  across  clusters of
              commonly available computers. These clusters can number into the
              thousands of nodes.

       Efficient
              By  distributing  the data, Hadoop can process it in parallel on
              the nodes where the data is located.  This  makes  it  extremely
              rapid.

       Reliable

              Hadoop  automatically  maintains  multiple  copies  of  data and
              automatâically redeploys computing tasks based on failures.
(see  figure  below.) MapReduce divides applications into many small blocks of
http://wiki.apache.org/hadoop/.

OPTIONS

       --config configdir

              Overrides   the  "HADOOP_CONF_DIR"  environment  variable.   See
              "ENVIâRONMENT" section below.

COMMANDS

       namenode -format
              format the DFS filesystem

       secondarynamenode
              run the DFS secondary namenode

       namenode
              run the DFS namenode

       datanode
              run a DFS datanode

       dfsadmin
              run a DFS admin client

       fsck   run a DFS filesystem checking utility

       fs     run a generic filesystem user client

       balancer
              run a cluster balancing utility

       jobtracker
              run the MapReduce job Tracker node

       pipes  run a Pipes job

       tasktracker
              run a MapReduce task Tracker node

       job    manipulate MapReduce jobs

       version
              print the version

       jar <jar>
              run a jar file

       distcp <srcurl> <desturl>
              copy file or directories recursively

       archive -archiveName NAME <src>* <dest>
              create a hadoop archive

       daemonlog
              get/set the log level for each daemon

       CLASSNAME
              run the class named CLASSNAME

       Most commands print help when invoked w/o parameters.

FILESYSTEM COMMANDS

       The following commands can be used with the fs command like  hadoop  fs
       [filesystem command]

       · -ls <path>

       · -lsr <path>

       · -du <path>

       · -dus <path>

       · -count[-q] <path>

       · -mv <src> <dst>

       · -cp <src> <dst>

       · -rm [-skipTrash] <path>

       · -rmr [-skipTrash] <path>

       · -expunge

       · -put <localsrc> ... <dst>

       · -copyFromLocal <localsrc> ... <dst>

       · -moveFromLocal <localsrc> ... <dst>

       · -get [-ignoreCrc] [-crc] <src> <localdst>

       · -getmerge <src> <localdst> [addnl]

       · -cat <src>

       · -text <src>

       · -copyToLocal [-ignoreCrc] [-crc] <src> <localdst>

       · -moveToLocal [-crc] <src> <localdst>

       · -mkdir <path>

       · -setrep [-R] [-w] <rep> <path/file>

       · -touchz <path>

       · -test -[ezd] <path>

       · -text <src>

       · -copyToLocal [-ignoreCrc] [-crc] <src> <localdst>

       · -moveToLocal [-crc] <src> <localdst>

       · -mkdir <path>

       · -setrep [-R] [-w] <rep> <path/file>

       · -touchz <path>

       · -test -[ezd] <path>

       · -stat [format] <path>

       · -tail [-f] <file>

       · -chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...

       · -chown [-R] [OWNER][:[GROUP]] PATH...

       · -chgrp [-R] GROUP PATH...

       · -help [cmd]

       Generic options supported are

       -conf <configuration file>
              specify an application configuration file

       -D <property=value>
              use value for given property

       -fs <local|namenode:port>
              specify a namenode

       -jt <local|jobtracker:port>
              specify a job tracker

       -files <comma separated list of files>
              specify  comma  separated  files  to be copied to the map reduce
              cluster

       -libjars <comma separated list of jars>
              specify comma separated jar files to include in the classpath.

       -archives <comma separated list of archives>
              specify comma separated archives to be unarchived on the compute
              machines.

FILES

       /etc/hadoop/conf

              This  symbolic  link  points  to  the  currently  active  Hadoop
              configuraâtion directory.

       Note to Hadoop System Admins

       The "/etc/hadoop/conf" link is managed by the alternaâtives(8)  command
       so you should not change this symlink directly.

       To  see what current alternative(8) Hadoop configurations you have, run
       the following command:

       # alternatives --display hadoop
       hadoop - status is auto.
        link currently points to /etc/hadoop/conf.pseudo
       /etc/hadoop/conf.empty - priority 10
       /etc/hadoop/conf.pseudo - priority 30
       Current 'best' version is /etc/hadoop/conf.pseudo.

       This shows that the link point to  "/etc/hadoop/conf.pseudo"  (for  the
       Hadoop Pseudo-Distributed configuration).

       To add a new custom configuration, run the following comâmands as root:

       # cp -r /etc/hadoop/conf.empty /etc/hadoop/conf.my

       This will create a new configuration directory,  "/etc/hadoop/conf.my",
       that  serves  as  a  starting  point for a new configuration.  Edit the
       configuration  files  in  "/etc/hadoop/conf.my"  until  you  have   the
       configuration you want.

       To activate your new configuration and see the new configuâration list:

       # alternatives --install /etc/hadoop/conf hadoop /etc/hadoop/conf.my 90

       You can  verify  your  new  configuration  is  active  by  running  the
       following:

       # alternatives --display hadoop
       hadoop - status is auto.
        link currently points to /etc/hadoop/conf.my
       /etc/hadoop/conf.empty - priority 10
       /etc/hadoop/conf.pseudo - priority 30
       /etc/hadoop/conf.my - priority 90
       Current 'best' version is /etc/hadoop/conf.my.

       At  this  point, it might be a good idea to restart your serâvices with
       the new configuration, e.g.,

           # /etc/init.d/hadoop-namenode restart

       /etc/hadoop/conf/hadoop-site.xml

              This  is  the  path  to  the  currently  deployed  Hadoop   site
              configuraâtion.  See "/etc/hadoop/conf" above.

       /usr/bin/hadoop-config.sh
              This  script  searches  for  a  useable  "JAVA_HOME" location if

              "JAVA_HOME" is not already set.  It  also  sets  up  environment
              variâables   that   Hadoop   components  need  at  startup  (see
              "ENVIRONMENT" section).

       /etc/init.d/hadoop-namenode
              Service script for starting and stopping the Hadoop NameNode

       /etc/init.d/hadoop-datanode
              Service script for starting and stopping the Hadoop DataNode

       /etc/init.d/hadoop-secondarynamenode
              Service script for starting and stopping  the  Hadoop  Secondary
              NameNode

       /etc/init.d/hadoop-jobtracker
              Service script for starting and stopping the Hadoop JobTracker

       /etc/init.d/hadoop-tasktracker
              Service script for starting and stopping the Hadoop TaskTracker

ENVIRONMENT

       HADOOP_CONF_DIR
              The  location  of  the  Hadoop configuration files.  Defaults to
              "/etc/hadoop/conf".  For more details, see the "FILES"  section.

       HADOOP_LOG_DIR
              All  Hadoop  services  log to "/var/log/hadoop" by default.  You
              can change the location with this environment variable.

       HADOOP_ROOT_LOGGER
              Setting for  log4j.  Defaults  to  ERROR,console.  You  can  try
              INFO,console for more verbose output.

EXAMPLES

       $ mkdir input
       $ cp <txt files> input
       $ hadoop jar /usr/lib/hadoop/*example*.jar input output 'grep string'
       $ cat output/*

BUGS

       The Debian package of Hadoop is still in beta state. Use it at your own
       risk!

AUTHOR

       Cloudera, Thomas Koch <thomas.koch@ymc.ch>

COPYRIGHT

       2008 The Apache Software Foundation. All rights reserved.

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

COMMANDS

FILESYSTEM COMMANDS

FILES

ENVIRONMENT

EXAMPLES

BUGS

SEE ALSO

AUTHOR

COPYRIGHT