Friday, August 21, 2009

About Hadoop Rack Awareness feature

It was stated in the Hadoop documentation that a Hadoop cluster is rack-aware. Basically it involves writing a small script that accept DNS name (and IP) and print the desired rack_id to the stdout. This script should be configrued in the hadoop-site.xml using the property 'topology.script.name'.

This feature is not well documented like many open source projects. The following is an example of such script written in Python by Vadim Zaliva in one of the discussion:



#!/usr/bin/env python

'''
This script used by hadoop to determine network/rack topology.  It
should be specified in hadoop-site.xml via topology.script.file.name
Property.

topology.script.file.name
/home/hadoop/topology.py

'''

import sys
from string import join

DEFAULT_RACK = '/default/rack0';

RACK_MAP = { '10.72.10.1' : '/datacenter0/rack0',

'10.112.110.26' : '/datacenter1/rack0',
'10.112.110.27' : '/datacenter1/rack0',
'10.112.110.28' : '/datacenter1/rack0',

'10.2.5.1' : '/datacenter2/rack0',
'10.2.10.1' : '/datacenter2/rack1'
}

if len(sys.argv)==1:
print DEFAULT_RACK
else:
print join([RACK_MAP.get(i, DEFAULT_RACK) for i in sys.argv[1:]]," ")




Remember to include the following in the hadoop-site.xml:


<property>
 <name>topology.script.file.name</name>
 <value>/home/hadoop/topology.py</value>
</property>



You can find more details here : How to kick-off Hadoop rack awareness