Wednesday, April 22, 2009

More on Hadoop Part 1

Finally get into more details of Hadoop DFS ...

To check the file information on a file, do this


<$HADOOP_INSTALLATION_DIR>/bin/hadoop fsck



To change the replication factor of a file, do this


<$HADOOP_INSTALLATION_DIR>/bin/hadoop fs -setrep [-R]

the -R is for recursive for a directory



To start a NameNode/JobTracker on a node, do this

<$HADOOP_INSTALLATION_DIR>/bin/hadoop namenode
<$HADOOP_INSTALLATION_DIR>/bin/hadoop jobtracker



To start a DataNode/TaskTracker on a slave node, do this


<$HADOOP_INSTALLATION_DIR>/bin/hadoop datanode
<$HADOOP_INSTALLATION_DIR>/bin/hadoop tasktracker


To rebalance the block replication in a cluster, do this

<$HADOOP_INSTALLATION_DIR>/bin/hadoop balancer



Hadoop works in a "rack"-aware context, i.e. it assumed that nodes are a subset of a rack and a deployment will have multiple racks. This explained the policy of dfs.replication = 3 stating 'one replica on a node in the rack, another replica on a different node in the same rack, and the third on a different node in a different rack'. If not specify, the rackid is 'defaultrack'. Hadoop lets the cluster administrators decide which rack a node belongs to through configuration variable dfs.network.script. When this script is configured, each node runs the script to determine its rackid. See Hadoop JIRA HADOOP-692. Some reference material here: Rack_aware_HDFS_proposal.pdf

That's all for now. Continue to Hadoop ...

More on HBase Part 1

To view the metadata of the 'tables' stored in the HBase, use this command in the <$HBASE_INSTALLATION_DIR>/bin/hbase shell


# hbase(main): xx>scan '.META.'



look at how the tables are distributed over the regionservers from the column=info:server

To put value into a column family without label do this:

# hbase(main): xx>put 'table_name', 'row_key', 'column_family_name:', 'value'

To put value into a column family with a label do this:

# hbase(main): xx>put 'table_name', 'row_key', 'column_family_name:label_name', 'value'

To work with HBase, you have to throw away all SQL concepts. It is just not relational, it is distributed and scalable. One can dynamically add label to a column family as and when required. That means the rows in the 'table' are not of equal length.

Tuesday, April 21, 2009

Moving on to HBase

Tried HBase (a Google bigtable alike distributed database, is it a database ...?) on my 3 nodes Hadoop cluster.

Download the HBase pacakge (I used version 0.19) in this experiment. I think it only works with Hadoop 0.19 and above.

It is as easy as Hadoop setup. Only a few configuration files to play with:

HBase Configuration Files


<$HBASE_INSTALLATION_DIR>\conf\hbase-env.sh (add JAVA_HOME and HBASE_HOME environment variable)
<$HBASE_INSTALLATION_DIR>\conf\hbase-site.xml
<$HBASE_INSTALLATION_DIR>\conf\regionservers


This is my hbase-site.xml. The hbase.rootdir is where HBase store the data. In this case, in my 3 nodes HDFS. The hbase.master specify where the HBase master server runs and the port it uses. Simple!

<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://master.testnet.net:54310/hbase</value>
<description>The directory shared by region servers.</description>
</property>
<property>
<name>hbase.master</name>
<value>master.testnet.net:60000</value>
<description>The host and port that the HBase master runs at.</description>
</property>
</configuration>

This is my regionservers file:


master.testnet.net
slave1.testnet.net
slave2.testnet.net



HBase requires more file handles, so the default 1024 is not enough. To allow for more file handle, edit /etc/security/limits.conf on all nodes and restart your cluster. Please see below:



# Each line describes a limit for a user in the form:
#
# domain type item value
#
hbase - nofile 32768


To start/stop the HBase, use the following command:


<$HBASE_INSTALLATION_DIR>\bin\start-hbase.sh
<$HBASE_INSTALLATION_DIR>\bin\stop-hbase.sh


After the HBase is started, you can interact with the HBase using the HBase Shell. Invoke the HBase shell using this command:


<$HBASE_INSTALLATION_DIR>\bin\hbase shell


Some useful HBase shell command:


create ‘blogposts’, ‘post’, ‘image’

put ‘blogposts’, ‘post1′, ‘post:title’, ‘Hello World’
put ‘blogposts’, ‘post1′, ‘post:author’, ‘The Author’
put ‘blogposts’, ‘post1′, ‘post:body’, ‘This is a blog post’
put ‘blogposts’, ‘post1′, ‘image:header’, ‘image1.jpg’
put ‘blogposts’, ‘post1′, ‘image:bodyimage’, ‘image2.jpg’

get ‘blogposts’, ‘post1′


How do we retrieve the data using Java? See example below.

import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.io.RowResult;

import java.util.HashMap;
import java.util.Map;
import java.io.IOException;

public class HBaseConnector {

public static Map retrievePost(String postId) throws IOException {
HTable table = new HTable(new HBaseConfiguration(), "blogposts");
Map post = new HashMap();

RowResult result = table.getRow(postId);

for (byte[] column : result.keySet()) {
post.put(new String(column), new String(result.get(column).getValue ()));
}

return post;
}

public static void main(String[] args) throws IOException {
Map blogpost = HBaseConnector.retrievePost("post1");
System.out.println(blogpost.get("post:title"));
System.out.println(blogpost.get("post:author"));
}
}

To this point, we are HBased! Amazed, aren't you?

Monday, April 20, 2009

Hadoop Basic Test Result

Tried testing my 3 VM Fedora 9 nodes Hadoop with the wordcount example.

All done with default settings.

Results:


7 files (3,078,793 bytes in hdfs) total 56,361 lines took 1 min 23 sec (83 sec) .

37 files (31,535,813 bytes in hdfs) totaling 582,146 lines took 5 min 52 sec (352 sec) .

Sunday, April 19, 2009

Getting Hadoop'ed

This is the official site of Apache Hadoop!.

I started off with the Hadoop Core latest available beta version 0.19.1. The latest stable version is 0.18.3 at the time of this experiment.

I have tried setting up on a Windows Vista box at home thinking that it should be easier, but it is not. For Hadoop to run on Windows, you need cygwin and got tons of access control issues on Vista. Have not gotten it to run YET!

Concurrently tried it on Fedora 9 on a VM, and it works like a breeze (after a while).

A few configuration files to play with:


/conf/hadoop-env.sh (set JAVA_HOME environment variable)
/conf/hadoop-default.xml
/conf/hadoop-site.xml
/conf/master
/conf/slaves


The following commands will come in handy.

SSH:

To check sshd (SSH Daemon) is running:
$chkconfig --list sshd

To set sshd to run at level 2345:
$chkconfig --level 2345 sshd on

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

On Fedora 11, you will need to activate the sshd to allow host key authentication in the SELinux Management. Else ssh will still prompt for password even with the above keygen configuration.

Check that the authorized_keys permission is as follows:
$ chmod 644 ~/.ssh/authorized_keys




Firewall (iptables):


$ /etc/init.d/iptables save

$ /etc/init.d/iptables stop

$ /etc/init.d/iptables start


Java:


$ /usr/sbin/alternatives --install /usr/bin/java java /usr/java/jdk1.6.014/bin/java 2

$ /usr/sbin/alternatives --config java


Hadoop:

$ /bin/start-dfs.sh

$ /bin/start-mapred.sh

$ /bin/start-all.sh

$ /bin/hadoop namenode -format

$ /bin/hadoop fs -put

$ /bin/hadoop fs -get

$ /bin/hadoop fs -cat


Hadoop Web Interface


NameNode - http://localhost:50070/

JobTracker - http://localhost:50030/