Sunday, June 27, 2010

Hadoop I/O - the compression matters!

Although largely encapsulated by the Hadoop MapReduce framework, the impact of I/O cannot be under estimated even for the simplest implementation on the MapReduce paradigm. I have come to noticed (after many nights of investigation) that one should always use the compression options in Hadoop for better performance.

These lines are really critical in your driver codes:



  Configuration conf = new Configuration();
  // set map output and reduce output as compress
  conf.setBoolean("mapred.compress.map.output", true);
  conf.setClass("mapred.map.output.compression.codec", GzipCodec.class, CompressionCodec.class);
  conf.setBoolean("mapred.output.compress", true);
  conf.setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class);


The idea is simple: compressed your outputs so that there are less I/O. Especially when we are dealing with large amount of intermediary data.

So, do remember these lines!

0 comments:

Post a Comment