Wednesday, April 22, 2009

More on Hadoop Part 1

Finally get into more details of Hadoop DFS ...

To check the file information on a file, do this


<$HADOOP_INSTALLATION_DIR>/bin/hadoop fsck



To change the replication factor of a file, do this


<$HADOOP_INSTALLATION_DIR>/bin/hadoop fs -setrep [-R]

the -R is for recursive for a directory



To start a NameNode/JobTracker on a node, do this

<$HADOOP_INSTALLATION_DIR>/bin/hadoop namenode
<$HADOOP_INSTALLATION_DIR>/bin/hadoop jobtracker



To start a DataNode/TaskTracker on a slave node, do this


<$HADOOP_INSTALLATION_DIR>/bin/hadoop datanode
<$HADOOP_INSTALLATION_DIR>/bin/hadoop tasktracker


To rebalance the block replication in a cluster, do this

<$HADOOP_INSTALLATION_DIR>/bin/hadoop balancer



Hadoop works in a "rack"-aware context, i.e. it assumed that nodes are a subset of a rack and a deployment will have multiple racks. This explained the policy of dfs.replication = 3 stating 'one replica on a node in the rack, another replica on a different node in the same rack, and the third on a different node in a different rack'. If not specify, the rackid is 'defaultrack'. Hadoop lets the cluster administrators decide which rack a node belongs to through configuration variable dfs.network.script. When this script is configured, each node runs the script to determine its rackid. See Hadoop JIRA HADOOP-692. Some reference material here: Rack_aware_HDFS_proposal.pdf

That's all for now. Continue to Hadoop ...

0 comments:

Post a Comment