Passing Time by ...: 2009

Wednesday, November 18, 2009

It's time to do something ...

I thought I would have more time in Edinburgh to try out new stuff, but it ended up I am actually loosing touch with IT. How should i find time in the midst of these endless assignments and homework.

Friday, August 21, 2009

About Hadoop Rack Awareness feature

It was stated in the Hadoop documentation that a Hadoop cluster is rack-aware. Basically it involves writing a small script that accept DNS name (and IP) and print the desired rack_id to the stdout. This script should be configrued in the hadoop-site.xml using the property 'topology.script.name'.

This feature is not well documented like many open source projects. The following is an example of such script written in Python by Vadim Zaliva in one of the discussion:



#!/usr/bin/env python

'''
This script used by hadoop to determine network/rack topology.  It
should be specified in hadoop-site.xml via topology.script.file.name
Property.

topology.script.file.name
/home/hadoop/topology.py

'''

import sys
from string import join

DEFAULT_RACK = '/default/rack0';

RACK_MAP = { '10.72.10.1' : '/datacenter0/rack0',

'10.112.110.26' : '/datacenter1/rack0',
'10.112.110.27' : '/datacenter1/rack0',
'10.112.110.28' : '/datacenter1/rack0',

'10.2.5.1' : '/datacenter2/rack0',
'10.2.10.1' : '/datacenter2/rack1'
}

if len(sys.argv)==1:
print DEFAULT_RACK
else:
print join([RACK_MAP.get(i, DEFAULT_RACK) for i in sys.argv[1:]]," ")

Remember to include the following in the hadoop-site.xml:



<property>
 <name>topology.script.file.name</name>
 <value>/home/hadoop/topology.py</value>
</property>

You can find more details here : How to kick-off Hadoop rack awareness

Wednesday, April 22, 2009

More on Hadoop Part 1

Finally get into more details of Hadoop DFS ...

To check the file information on a file, do this

<$HADOOP_INSTALLATION_DIR>/bin/hadoop fsck

To change the replication factor of a file, do this

<$HADOOP_INSTALLATION_DIR>/bin/hadoop fs -setrep [-R]

the -R is for recursive for a directory

To start a NameNode/JobTracker on a node, do this

<$HADOOP_INSTALLATION_DIR>/bin/hadoop namenode
<$HADOOP_INSTALLATION_DIR>/bin/hadoop jobtracker

To start a DataNode/TaskTracker on a slave node, do this

<$HADOOP_INSTALLATION_DIR>/bin/hadoop datanode
<$HADOOP_INSTALLATION_DIR>/bin/hadoop tasktracker

To rebalance the block replication in a cluster, do this

<$HADOOP_INSTALLATION_DIR>/bin/hadoop balancer

Hadoop works in a "rack"-aware context, i.e. it assumed that nodes are a subset of a rack and a deployment will have multiple racks. This explained the policy of dfs.replication = 3 stating 'one replica on a node in the rack, another replica on a different node in the same rack, and the third on a different node in a different rack'. If not specify, the rackid is 'defaultrack'. Hadoop lets the cluster administrators decide which rack a node belongs to through configuration variable dfs.network.script. When this script is configured, each node runs the script to determine its rackid. See Hadoop JIRA HADOOP-692. Some reference material here: Rack_aware_HDFS_proposal.pdf

That's all for now. Continue to Hadoop ...

More on HBase Part 1

To view the metadata of the 'tables' stored in the HBase, use this command in the <$HBASE_INSTALLATION_DIR>/bin/hbase shell

# hbase(main): xx>scan '.META.'

look at how the tables are distributed over the regionservers from the column=info:server

To put value into a column family without label do this:

# hbase(main): xx>put 'table_name', 'row_key', 'column_family_name:', 'value'

To put value into a column family with a label do this:

# hbase(main): xx>put 'table_name', 'row_key', 'column_family_name:label_name', 'value'

To work with HBase, you have to throw away all SQL concepts. It is just not relational, it is distributed and scalable. One can dynamically add label to a column family as and when required. That means the rows in the 'table' are not of equal length.

Tuesday, April 21, 2009

Moving on to HBase

Tried HBase (a Google bigtable alike distributed database, is it a database ...?) on my 3 nodes Hadoop cluster.

Download the HBase pacakge (I used version 0.19) in this experiment. I think it only works with Hadoop 0.19 and above.

It is as easy as Hadoop setup. Only a few configuration files to play with:

HBase Configuration Files

<$HBASE_INSTALLATION_DIR>\conf\hbase-env.sh (add JAVA_HOME and HBASE_HOME environment variable)
<$HBASE_INSTALLATION_DIR>\conf\hbase-site.xml
<$HBASE_INSTALLATION_DIR>\conf\regionservers

This is my hbase-site.xml. The hbase.rootdir is where HBase store the data. In this case, in my 3 nodes HDFS. The hbase.master specify where the HBase master server runs and the port it uses. Simple!

<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://master.testnet.net:54310/hbase</value>
<description>The directory shared by region servers.</description>
</property>
<property>
<name>hbase.master</name>
<value>master.testnet.net:60000</value>
<description>The host and port that the HBase master runs at.</description>
</property>
</configuration>

This is my regionservers file:

master.testnet.net
slave1.testnet.net
slave2.testnet.net

HBase requires more file handles, so the default 1024 is not enough. To allow for more file handle, edit /etc/security/limits.conf on all nodes and restart your cluster. Please see below:

# Each line describes a limit for a user in the form:
#
# domain type item value
#
hbase - nofile 32768

To start/stop the HBase, use the following command:

<$HBASE_INSTALLATION_DIR>\bin\start-hbase.sh
<$HBASE_INSTALLATION_DIR>\bin\stop-hbase.sh

After the HBase is started, you can interact with the HBase using the HBase Shell. Invoke the HBase shell using this command:

<$HBASE_INSTALLATION_DIR>\bin\hbase shell

Some useful HBase shell command:

create ‘blogposts’, ‘post’, ‘image’

put ‘blogposts’, ‘post1′, ‘post:title’, ‘Hello World’
put ‘blogposts’, ‘post1′, ‘post:author’, ‘The Author’
put ‘blogposts’, ‘post1′, ‘post:body’, ‘This is a blog post’
put ‘blogposts’, ‘post1′, ‘image:header’, ‘image1.jpg’
put ‘blogposts’, ‘post1′, ‘image:bodyimage’, ‘image2.jpg’

get ‘blogposts’, ‘post1′

How do we retrieve the data using Java? See example below.

import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.io.RowResult;

import java.util.HashMap;
import java.util.Map;
import java.io.IOException;

public class HBaseConnector {

public static Map retrievePost(String postId) throws IOException {
HTable table = new HTable(new HBaseConfiguration(), "blogposts");
Map post = new HashMap();

RowResult result = table.getRow(postId);

for (byte[] column : result.keySet()) {
post.put(new String(column), new String(result.get(column).getValue ()));
}

return post;
}

public static void main(String[] args) throws IOException {
Map blogpost = HBaseConnector.retrievePost("post1");
System.out.println(blogpost.get("post:title"));
System.out.println(blogpost.get("post:author"));
}
}

To this point, we are HBased! Amazed, aren't you?

Monday, April 20, 2009

Hadoop Basic Test Result

Tried testing my 3 VM Fedora 9 nodes Hadoop with the wordcount example.

All done with default settings.

Results:

7 files (3,078,793 bytes in hdfs) total 56,361 lines took 1 min 23 sec (83 sec) .

37 files (31,535,813 bytes in hdfs) totaling 582,146 lines took 5 min 52 sec (352 sec) .

Sunday, April 19, 2009

Getting Hadoop'ed

This is the official site of Apache Hadoop!.

I started off with the Hadoop Core latest available beta version 0.19.1. The latest stable version is 0.18.3 at the time of this experiment.

I have tried setting up on a Windows Vista box at home thinking that it should be easier, but it is not. For Hadoop to run on Windows, you need cygwin and got tons of access control issues on Vista. Have not gotten it to run YET!

Concurrently tried it on Fedora 9 on a VM, and it works like a breeze (after a while).

A few configuration files to play with:

/conf/hadoop-env.sh (set JAVA_HOME environment variable)
/conf/hadoop-default.xml
/conf/hadoop-site.xml
/conf/master
/conf/slaves

The following commands will come in handy.

SSH:

To check sshd (SSH Daemon) is running:
$chkconfig --list sshd
To set sshd to run at level 2345:
$chkconfig --level 2345 sshd on
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

On Fedora 11, you will need to activate the sshd to allow host key authentication in the SELinux Management. Else ssh will still prompt for password even with the above keygen configuration.

Check that the authorized_keys permission is as follows:
$ chmod 644 ~/.ssh/authorized_keys

Firewall (iptables):

$ /etc/init.d/iptables save

$ /etc/init.d/iptables stop

$ /etc/init.d/iptables start

Java:

$ /usr/sbin/alternatives --install /usr/bin/java java /usr/java/jdk1.6.014/bin/java 2

$ /usr/sbin/alternatives --config java

Hadoop:

$ /bin/start-dfs.sh

$ /bin/start-mapred.sh

$ /bin/start-all.sh

$ /bin/hadoop namenode -format

$ /bin/hadoop fs -put

$ /bin/hadoop fs -get

$ /bin/hadoop fs -cat

Hadoop Web Interface

NameNode - http://localhost:50070/

JobTracker - http://localhost:50030/

Passing Time by ...