Wednesday, November 18, 2009
It's time to do something ...
I thought I would have more time in Edinburgh to try out new stuff, but it ended up I am actually loosing touch with IT. How should i find time in the midst of these endless assignments and homework.
Friday, August 21, 2009
About Hadoop Rack Awareness feature
It was stated in the Hadoop documentation that a Hadoop cluster is rack-aware. Basically it involves writing a small script that accept DNS name (and IP) and print the desired rack_id to the stdout. This script should be configrued in the hadoop-site.xml using the property 'topology.script.name'.
This feature is not well documented like many open source projects. The following is an example of such script written in Python by Vadim Zaliva in one of the discussion:
Remember to include the following in the hadoop-site.xml:
You can find more details here : How to kick-off Hadoop rack awareness
This feature is not well documented like many open source projects. The following is an example of such script written in Python by Vadim Zaliva in one of the discussion:
#!/usr/bin/env python
'''
This script used by hadoop to determine network/rack topology. It
should be specified in hadoop-site.xml via topology.script.file.name
Property.
topology.script.file.name
/home/hadoop/topology.py
'''
import sys
from string import join
DEFAULT_RACK = '/default/rack0';
RACK_MAP = { '10.72.10.1' : '/datacenter0/rack0',
'10.112.110.26' : '/datacenter1/rack0',
'10.112.110.27' : '/datacenter1/rack0',
'10.112.110.28' : '/datacenter1/rack0',
'10.2.5.1' : '/datacenter2/rack0',
'10.2.10.1' : '/datacenter2/rack1'
}
if len(sys.argv)==1:
print DEFAULT_RACK
else:
print join([RACK_MAP.get(i, DEFAULT_RACK) for i in sys.argv[1:]]," ")
Remember to include the following in the hadoop-site.xml:
<property>
<name>topology.script.file.name</name>
<value>/home/hadoop/topology.py</value>
</property>
You can find more details here : How to kick-off Hadoop rack awareness
Wednesday, April 22, 2009
More on Hadoop Part 1
Finally get into more details of Hadoop DFS ...
To check the file information on a file, do this
To change the replication factor of a file, do this
To start a NameNode/JobTracker on a node, do this
To start a DataNode/TaskTracker on a slave node, do this
To rebalance the block replication in a cluster, do this
Hadoop works in a "rack"-aware context, i.e. it assumed that nodes are a subset of a rack and a deployment will have multiple racks. This explained the policy of dfs.replication = 3 stating 'one replica on a node in the rack, another replica on a different node in the same rack, and the third on a different node in a different rack'. If not specify, the rackid is 'defaultrack'. Hadoop lets the cluster administrators decide which rack a node belongs to through configuration variable dfs.network.script. When this script is configured, each node runs the script to determine its rackid. See Hadoop JIRA HADOOP-692. Some reference material here: Rack_aware_HDFS_proposal.pdf
That's all for now. Continue to Hadoop ...
To check the file information on a file, do this
<$HADOOP_INSTALLATION_DIR>/bin/hadoop fsck
To change the replication factor of a file, do this
<$HADOOP_INSTALLATION_DIR>/bin/hadoop fs -setrep [-R]
the -R is for recursive for a directory
To start a NameNode/JobTracker on a node, do this
<$HADOOP_INSTALLATION_DIR>/bin/hadoop namenode
<$HADOOP_INSTALLATION_DIR>/bin/hadoop jobtracker
To start a DataNode/TaskTracker on a slave node, do this
<$HADOOP_INSTALLATION_DIR>/bin/hadoop datanode
<$HADOOP_INSTALLATION_DIR>/bin/hadoop tasktracker
To rebalance the block replication in a cluster, do this
<$HADOOP_INSTALLATION_DIR>/bin/hadoop balancer
Hadoop works in a "rack"-aware context, i.e. it assumed that nodes are a subset of a rack and a deployment will have multiple racks. This explained the policy of dfs.replication = 3 stating 'one replica on a node in the rack, another replica on a different node in the same rack, and the third on a different node in a different rack'. If not specify, the rackid is 'defaultrack'. Hadoop lets the cluster administrators decide which rack a node belongs to through configuration variable dfs.network.script. When this script is configured, each node runs the script to determine its rackid. See Hadoop JIRA HADOOP-692. Some reference material here: Rack_aware_HDFS_proposal.pdf
That's all for now. Continue to Hadoop ...
More on HBase Part 1
To view the metadata of the 'tables' stored in the HBase, use this command in the <$HBASE_INSTALLATION_DIR>/bin/hbase shell
look at how the tables are distributed over the regionservers from the column=info:server
To put value into a column family without label do this:
# hbase(main): xx>put 'table_name', 'row_key', 'column_family_name:', 'value'
To put value into a column family with a label do this:
# hbase(main): xx>put 'table_name', 'row_key', 'column_family_name:label_name', 'value'
To work with HBase, you have to throw away all SQL concepts. It is just not relational, it is distributed and scalable. One can dynamically add label to a column family as and when required. That means the rows in the 'table' are not of equal length.
# hbase(main): xx>scan '.META.'
look at how the tables are distributed over the regionservers from the column=info:server
To put value into a column family without label do this:
# hbase(main): xx>put 'table_name', 'row_key', 'column_family_name:', 'value'
To put value into a column family with a label do this:
# hbase(main): xx>put 'table_name', 'row_key', 'column_family_name:label_name', 'value'
To work with HBase, you have to throw away all SQL concepts. It is just not relational, it is distributed and scalable. One can dynamically add label to a column family as and when required. That means the rows in the 'table' are not of equal length.
Tuesday, April 21, 2009
Moving on to HBase
Tried HBase (a Google bigtable alike distributed database, is it a database ...?) on my 3 nodes Hadoop cluster.
Download the HBase pacakge (I used version 0.19) in this experiment. I think it only works with Hadoop 0.19 and above.
It is as easy as Hadoop setup. Only a few configuration files to play with:
HBase Configuration Files
This is my hbase-site.xml. The hbase.rootdir is where HBase store the data. In this case, in my 3 nodes HDFS. The hbase.master specify where the HBase master server runs and the port it uses. Simple!
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://master.testnet.net:54310/hbase</value>
<description>The directory shared by region servers.</description>
</property>
<property>
<name>hbase.master</name>
<value>master.testnet.net:60000</value>
<description>The host and port that the HBase master runs at.</description>
</property>
</configuration>
HBase requires more file handles, so the default 1024 is not enough. To allow for more file handle, edit /etc/security/limits.conf on all nodes and restart your cluster. Please see below:
To start/stop the HBase, use the following command:
After the HBase is started, you can interact with the HBase using the HBase Shell. Invoke the HBase shell using this command:
Some useful HBase shell command:
How do we retrieve the data using Java? See example below.
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.io.RowResult;
import java.util.HashMap;
import java.util.Map;
import java.io.IOException;
public class HBaseConnector {
public static Map retrievePost(String postId) throws IOException {
HTable table = new HTable(new HBaseConfiguration(), "blogposts");
Map post = new HashMap();
RowResult result = table.getRow(postId);
for (byte[] column : result.keySet()) {
post.put(new String(column), new String(result.get(column).getValue ()));
}
return post;
}
public static void main(String[] args) throws IOException {
Map blogpost = HBaseConnector.retrievePost("post1");
System.out.println(blogpost.get("post:title"));
System.out.println(blogpost.get("post:author"));
}
}
To this point, we are HBased! Amazed, aren't you?
Download the HBase pacakge (I used version 0.19) in this experiment. I think it only works with Hadoop 0.19 and above.
It is as easy as Hadoop setup. Only a few configuration files to play with:
HBase Configuration Files
<$HBASE_INSTALLATION_DIR>\conf\hbase-env.sh (add JAVA_HOME and HBASE_HOME environment variable)
<$HBASE_INSTALLATION_DIR>\conf\hbase-site.xml
<$HBASE_INSTALLATION_DIR>\conf\regionservers
This is my hbase-site.xml. The hbase.rootdir is where HBase store the data. In this case, in my 3 nodes HDFS. The hbase.master specify where the HBase master server runs and the port it uses. Simple!
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://master.testnet.net:54310/hbase</value>
<description>The directory shared by region servers.</description>
</property>
<property>
<name>hbase.master</name>
<value>master.testnet.net:60000</value>
<description>The host and port that the HBase master runs at.</description>
</property>
</configuration>
This is my regionservers file:
master.testnet.net
slave1.testnet.net
slave2.testnet.net
HBase requires more file handles, so the default 1024 is not enough. To allow for more file handle, edit /etc/security/limits.conf on all nodes and restart your cluster. Please see below:
# Each line describes a limit for a user in the form:
#
# domain type item value
#
hbase - nofile 32768
To start/stop the HBase, use the following command:
<$HBASE_INSTALLATION_DIR>\bin\start-hbase.sh
<$HBASE_INSTALLATION_DIR>\bin\stop-hbase.sh
After the HBase is started, you can interact with the HBase using the HBase Shell. Invoke the HBase shell using this command:
<$HBASE_INSTALLATION_DIR>\bin\hbase shell
Some useful HBase shell command:
create ‘blogposts’, ‘post’, ‘image’
put ‘blogposts’, ‘post1′, ‘post:title’, ‘Hello World’
put ‘blogposts’, ‘post1′, ‘post:author’, ‘The Author’
put ‘blogposts’, ‘post1′, ‘post:body’, ‘This is a blog post’
put ‘blogposts’, ‘post1′, ‘image:header’, ‘image1.jpg’
put ‘blogposts’, ‘post1′, ‘image:bodyimage’, ‘image2.jpg’
get ‘blogposts’, ‘post1′
How do we retrieve the data using Java? See example below.
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.io.RowResult;
import java.util.HashMap;
import java.util.Map;
import java.io.IOException;
public class HBaseConnector {
public static Map retrievePost(String postId) throws IOException {
HTable table = new HTable(new HBaseConfiguration(), "blogposts");
Map post = new HashMap();
RowResult result = table.getRow(postId);
for (byte[] column : result.keySet()) {
post.put(new String(column), new String(result.get(column).getValue ()));
}
return post;
}
public static void main(String[] args) throws IOException {
Map blogpost = HBaseConnector.retrievePost("post1");
System.out.println(blogpost.get("post:title"));
System.out.println(blogpost.get("post:author"));
}
}
To this point, we are HBased! Amazed, aren't you?
Monday, April 20, 2009
Hadoop Basic Test Result
Tried testing my 3 VM Fedora 9 nodes Hadoop with the wordcount example.
All done with default settings.
Results:
All done with default settings.
Results:
7 files (3,078,793 bytes in hdfs) total 56,361 lines took 1 min 23 sec (83 sec) .
37 files (31,535,813 bytes in hdfs) totaling 582,146 lines took 5 min 52 sec (352 sec) .
Sunday, April 19, 2009
Getting Hadoop'ed
This is the official site of Apache Hadoop!.
I started off with the Hadoop Core latest available beta version 0.19.1. The latest stable version is 0.18.3 at the time of this experiment.
I have tried setting up on a Windows Vista box at home thinking that it should be easier, but it is not. For Hadoop to run on Windows, you need cygwin and got tons of access control issues on Vista. Have not gotten it to run YET!
Concurrently tried it on Fedora 9 on a VM, and it works like a breeze (after a while).
A few configuration files to play with:
The following commands will come in handy.
SSH:
Firewall (iptables):
Java:
Hadoop Web Interface
I started off with the Hadoop Core latest available beta version 0.19.1. The latest stable version is 0.18.3 at the time of this experiment.
I have tried setting up on a Windows Vista box at home thinking that it should be easier, but it is not. For Hadoop to run on Windows, you need cygwin and got tons of access control issues on Vista. Have not gotten it to run YET!
Concurrently tried it on Fedora 9 on a VM, and it works like a breeze (after a while).
A few configuration files to play with:
/conf/hadoop-env.sh (set JAVA_HOME environment variable) /conf/hadoop-default.xml /conf/hadoop-site.xml /conf/master /conf/slaves
The following commands will come in handy.
SSH:
To check sshd (SSH Daemon) is running:
$chkconfig --list sshdTo set sshd to run at level 2345:
$chkconfig --level 2345 sshd on$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
On Fedora 11, you will need to activate the sshd to allow host key authentication in the SELinux Management. Else ssh will still prompt for password even with the above keygen configuration.
Check that the authorized_keys permission is as follows:
$ chmod 644 ~/.ssh/authorized_keys
Firewall (iptables):
$ /etc/init.d/iptables save
$ /etc/init.d/iptables stop
$ /etc/init.d/iptables start
Java:
$ /usr/sbin/alternatives --install /usr/bin/java java /usr/java/jdk1.6.014/bin/java 2
$ /usr/sbin/alternatives --config java
Hadoop:
$/bin/start-dfs.sh
$/bin/start-mapred.sh
$/bin/start-all.sh
$/bin/hadoop namenode -format
$/bin/hadoop fs -put
$/bin/hadoop fs -get
$/bin/hadoop fs -cat
Hadoop Web Interface
NameNode - http://localhost:50070/
JobTracker - http://localhost:50030/
Subscribe to:
Posts (Atom)