Getting started with Hadoop development

Last modified : 20 November, 2017

In this article I would like to describe the environment I use for developing Hadoop. The target audience are people who are interested in contributing to the Hadoop project but have not yet done so.

I’m sure there are several things that could be improved, and I would appreciate it if you could please leave a comment with any suggestions you may have.

I am going to describe the steps I had to follow on a Fedora Virtual Machine. Steps on a Mac or Windows should be similar with perhaps different commands.

Set environment variables

Please set the following variables in your .bashrc

# This environment variable is NOT used by Hadoop. It has been created solely to help set up the development environment.
export HADOOP_SRC_PATH=~/Code/hadoop

# Set up alias commands for building Hadoop in one of several ways
# This is the alias I use most. The '-Pdist' maven profile assembles all projects into a distribution.
# '-Pnative' builds native code which speeds up compression and checksum etc.
alias mvnp='mvn -Pdist -Pnative -Dmaven.javadoc.skip -DskipTests install'

# Same as mvnp but added a clean (so all previous artifacts are blown away).
alias mvnc='mvn -Pdist -Pnative -Dmaven.javadoc.skip -DskipTests clean install'

# This environment variable is used by all Hadoop scripts to locate the binaries that ought to be loaded and run
export HADOOP_HOME=$HADOOP_SRC_PATH/trunk/hadoop-dist/target/hadoop-3.1.0-SNAPSHOT

# Since we built the native libraries, we can add them to LD_LIBRARY_PATH to be loaded
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native/

# This is the directory to which logs for Hadoop daemons will be written
export HADOOP_LOG_DIR=$HADOOP_SRC_PATH/logs
# YARN could complain about this not being set in some versions of Hadoop.
export YARN_LOG_DIR=$HADOOP_LOG_DIR
# The directory which should contain core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml, log4j.properties etc.
export HADOOP_CONF_DIR=$HADOOP_SRC_PATH/config

# This appends JAVA options to various Hadoop daemons, so that I can connect a debugger to them if needed.
export HADOOP_NAMENODE_OPTS='-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1049'
export HADOOP_DATANODE_OPTS='-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1050'
export HADOOP_SECONDARYNAMENODE_OPTS='-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1051'
export YARN_RESOURCEMANAGER_OPTS='-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1044'
export YARN_NODEMANAGER_OPTS='-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1045'
export HADOOP_JOB_HISTORYSERVER_OPTS='-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1046'

# Some times I uncomment this if I am debugging the client.
#export HADOOP_CLIENT_OPTS='-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=1047'

# enable debugging output on hdfs commands
#export HADOOP_ROOT_LOGGER=DEBUG,console

# Add the native libraries built to the JVM
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_HOME/lib/native"

# Put the Hadoop commands on your PATH
export PATH=$HADOOP_HOME/bin:$PATH


Retrieve the source code

There are a few sources where you could get the Hadoop code from. I prefer getting it from the Apache git server which is the source of truth. Alternatively, you could clone from the Github mirror ( https://github.com/apache/hadoop ) which follows this Apache git repository. If you have a fork, you could add yet another remote.

I like to keep several copies of the code (one for trunk, another for branch-2 and yet another for my fork). That way I can load all versions of the project in my IDE (Eclipse) and have them to compare side by side. More on that later.

# Create the directory where source code will be cloned
mkdir -p $HADOOP_SRC_PATH
cd $HADOOP_SRC_PATH

# git clone the source.
git clone https://git-wip-us.apache.org/repos/asf/hadoop.git trunk

# We don't need to clone again. We can just copy the directory for another copy that would be branch-2.
cp -R trunk branch-2

# Go into branch-2 and check out branch-2
cd branch-2
git checkout remotes/origin/branch-2 -b branch-2
cd ..

# You could do this if you have your own fork
# cp -R trunk MyForkOfHadoop
# cd MyForkOfHadoop
# git remote add myfork git@github.com:MyForkOfHadoop/hadoop.git
# git fetch myfork


Building the source

Now that we have the source code, let’s build it. Please read BUILDING.txt to get more details. I have setup alias commands for different ways in which I could be building the source.

Depending on what your operating system already ships with, you would need the following packages:

Its quite likely that the build may fail in between. You can resume by using -rf :<module-which-failed>

cd $HADOOP_SRC_PATH/trunk
mvnp

# Create the eclipse project files.
mvn eclipse:eclipse


Setting up a Single Node Hadoop Cluster

I almost always do my development and testing with this set up. Please follow instructions from the Hadoop docs

$HADOOP_SRC_PATH/start.sh

#!/bin/bash

$HADOOP_HOME/sbin/start-dfs.sh 
$HADOOP_HOME/sbin/start-yarn.sh
nohup mapred historyserver > ~/Code/hadoop/logs/historyserver.log &
if [ $(hadoop version | head -n 1 | cut -d' ' -f2 | cut -d- -f1) == "2.7.4" ]; then
 nohup yarn timelineserver > ~/Code/hadoop/logs/timelineserver.log &
else
 yarn --daemon start timelineserver
fi

echo -n "Number of running hadoop servers: "
jps | egrep 'NameNode|DataNode|SecondaryNameNode|ResourceManager|NodeManager|JobHistoryServer|ApplicationHistoryServer' | wc -l
echo ""

$HADOOP_SRC_PATH/stop.sh

jps | egrep "SecondaryNameNode|NameNode|JobHistoryServer|DataNode|NodeManager|ResourceManager|ApplicationHistoryServer" | awk '{print $1}' | xargs -r kill -9

sleep 1
echo -n "Number of running hadoop servers: "
jps | egrep 'NameNode|DataNode|SecondaryNameNode|ResourceManager|NodeManager|JobHistoryServer|ApplicationHistoryServer' | wc -l
echo ""


Run a YARN Application

If everything goes well, you should be able to run a sleep job

# Run a sleep job with 2 mappers and 1 reducer. The Map tasks should sleep for 100ms and the Reduce tasks should sleep for 200ms
yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.4-tests.jar sleep -m 2 -r 1 -mt 100 -rt 200

All content on this website is licensed as Creative Commons-Attribution-ShareAlike 4.0 License. Opinions expressed are solely my own.