Writing An Hadoop MapReduce Program In Python
by Michael G. Noll on September 21, 2007 (last updated: October 19, 2011)
In this tutorial, I will describe how to write a simple MapReduce program for Hadoop in the Python programming language.
- Motivation
- What we want to do
- Prerequisites
- Python MapReduce Code
- Map: mapper.py
- Reduce: reducer.py
- Test your code (cat data | map | sort | reduce)
- Running the Python Code on Hadoop
- Download example input data
- Copy local example data to HDFS
- Run the MapReduce job
- Improved Mapper and Reducer code: using Python iterators and generators
- mapper.py
- reducer.py
- Related Links
- Comments (94)
Motivation
Even though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python or C++ (the latter since version 0.14.1). However, the documentation and the most prominent Python example on the Hadoop home page could make you think that youmust translate your Python code using Jython into a Java jar file. Obviously, this is not very convenient and can even be problematic if you depend on Python features not provided by Jython. Another issue of the Jython approach is the overhead of writing your Python program in such a way that it can interact with Hadoop – just have a look at the example in<HADOOP_INSTALL>/src/examples/python/WordCount.py and you see what I mean. I still recommend to have at least a look at the Jython approach and maybe even at the new C++ MapReduce API called Pipes, it’s really interesting.
Having that said, the ground is prepared for the purpose of this tutorial: writing a Hadoop MapReduce program in a more Pythonic way, i.e. in a way you should be familiar with.
What we want to do
We will write a simple MapReduce program (see also Wikipedia) for Hadoop in Python but without using Jython to translate our code to Java jar files.
Our program will mimick the WordCount example, i.e. it reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a tab.
Note: You can also use programming languages other than Python such as Perl or Ruby with the “technique” described in this tutorial. I wrote some words about what happens behind the scenes. Feel free to correct me if I’m wrong.
Prerequisites
You should have an Hadoop cluster up and running because we will get our hands dirty. If you don’t have a cluster yet, my following tutorials might help you to build one. The tutorials are tailored to Ubuntu Linux but the information does also apply to other Linux/Unix variants.
- Running Hadoop On Ubuntu Linux (Single-Node Cluster)
How to set up a single-node Hadoop cluster using the Hadoop Distributed File System (HDFS) on Ubuntu Linux
- Running Hadoop On Ubuntu Linux (Multi-Node Cluster)
How to set up a multi-node Hadoop cluster using the Hadoop Distributed File System (HDFS) on Ubuntu Linux
Python MapReduce Code
The “trick” behind the following Python code is that we will use HadoopStreaming (see also the wiki entry) for helping us passing data between our Map and Reduce code via STDIN (standard input) and STDOUT (standard output). We will simply use Python’s sys.stdin to read input data and print our own output to sys.stdout. That’s all we need to do because HadoopStreaming will take care of everything else! Amazing, isn’t it? Well, at least I had a “wow” experience…
Map: mapper.py
Save the following code in the file /home/hadoop/mapper.py. It will read data from STDIN (standard input), split it into words and output a list of lines mapping words to their (intermediate) counts to STDOUT (standard output). The Map script will not compute an (intermediate) sum of a word’s occurrences. Instead, it will output “<word> 1″ immediately – even though the <word> might occur multiple times in the input – and just let the subsequent Reduce step do the final sum count. Of course, you can change this behavior in your own scripts as you please, but we will keep it like that in this tutorial because of didactic reasons :-)
Make sure the file has execution permission (chmod +x /home/hduser/mapper.py should do the trick) or you will run into problems.
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)
Reduce: reducer.py
Save the following code in the file /home/hduser/reducer.py. It will read the results of mapper.py from STDIN (standard input), and sum the occurrences of each word to a final count, and output its results to STDOUT (standard output).
Make sure the file has execution permission (chmod +x /home/hduser/reducer.py should do the trick) or you will run into problems.
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)
Test your code (cat data | map | sort | reduce)
I recommend to test your mapper.py and reducer.py scripts locally before using them in a MapReduce job. Otherwise your jobs might successfully complete but there will be no job result data at all or not the results you would have expected. If that happens, most likely it was you (or me) who screwed up.
Here are some ideas on how to test the functionality of the Map and Reduce scripts.
# very basic test hduser@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hduser/mapper.py foo 1 foo 1 quux 1 labs 1 foo 1 bar 1 quux 1
hduser@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hduser/mapper.py | sort -k1,1 | /home/hduser/reducer.py bar 1 foo 3 labs 1 quux 2
# using one of the ebooks as example input # (see below on where to get the ebooks) hduser@ubuntu:~$ cat /tmp/gutenberg/20417-8.txt | /home/hduser/mapper.py The 1 Project 1 Gutenberg 1 EBook 1 of 1 [...] (you get the idea)
Running the Python Code on Hadoop
Download example input data
We will use three ebooks from Project Gutenberg for this example:
- The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
- The Notebooks of Leonardo Da Vinci
- Ulysses by James Joyce
Download each ebook as text files in Plain Text UTF-8 encoding and store the files in a temporary directory of choice, for example /tmp/gutenberg.
hduser@ubuntu:~$ ls -l /tmp/gutenberg/ total 3604 -rw-r--r-- 1 hduser hadoop 674566 Feb 3 10:17 pg20417.txt -rw-r--r-- 1 hduser hadoop 1573112 Feb 3 10:18 pg4300.txt -rw-r--r-- 1 hduser hadoop 1423801 Feb 3 10:18 pg5000.txt hduser@ubuntu:~$
Copy local example data to HDFS
Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS.
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hduser/gutenberg hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls Found 1 items drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg Found 3 items -rw-r--r-- 3 hduser supergroup 674566 2011-03-10 11:38 /user/hduser/gutenberg/pg20417.txt -rw-r--r-- 3 hduser supergroup 1573112 2011-03-10 11:38 /user/hduser/gutenberg/pg4300.txt -rw-r--r-- 3 hduser supergroup 1423801 2011-03-10 11:38 /user/hduser/gutenberg/pg5000.txt hduser@ubuntu:/usr/local/hadoop$
Run the MapReduce job
Now that everything is prepared, we can finally run our Python MapReduce job on the Hadoop cluster. As I said above, we useHadoopStreaming for helping us passing data between our Map and Reduce code via STDIN (standard input) and STDOUT (standard output).
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -file /home/hduser/mapper.py -mapper /home/hduser/mapper.py -file /home/hduser/reducer.py -reducer /home/hduser/reducer.py -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output
If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the -Doption:
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -D mapred.reduce.tasks=16 ...
An important note about mapred.map.tasks: Hadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn’t manipulate that. You cannot force mapred.map.tasks but can specify mapred.reduce.tasks.
The job will read all the files in the HDFS directory /user/hduser/gutenberg, process it, and store the results in the HDFS directory /user/hduser/gutenberg-output. In general Hadoop will create one output file per reducer; in our case however it will only create a single file because the input files are very small.
Exemplary output of the previous command in the console:
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper /home/hduser/mapper.py -reducer /home/hduser/reducer.py -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output additionalConfSpec_:null null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming packageJobJar: [/app/hadoop/tmp/hadoop-unjar54543/] [] /tmp/streamjob54544.jar tmpDir=null [...] INFO mapred.FileInputFormat: Total input paths to process : 7 [...] INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local] [...] INFO streaming.StreamJob: Running job: job_200803031615_0021 [...] [...] INFO streaming.StreamJob: map 0% reduce 0% [...] INFO streaming.StreamJob: map 43% reduce 0% [...] INFO streaming.StreamJob: map 86% reduce 0% [...] INFO streaming.StreamJob: map 100% reduce 0% [...] INFO streaming.StreamJob: map 100% reduce 33% [...] INFO streaming.StreamJob: map 100% reduce 70% [...] INFO streaming.StreamJob: map 100% reduce 77% [...] INFO streaming.StreamJob: map 100% reduce 100% [...] INFO streaming.StreamJob: Job complete: job_200803031615_0021 [...] INFO streaming.StreamJob: Output: /user/hduser/gutenberg-output hduser@ubuntu:/usr/local/hadoop$
As you can see in the output above, Hadoop also provides a basic web interface for statistics and information. When the Hadoop cluster is running, go to http://localhost:50030/ and browse around. Here’s a screenshot of the Hadoop web interface for the job we just ran.

A screenshot of Hadoop's web interface, showing the details of the MapReduce job we just ran.
Check if the result is successfully stored in HDFS directory /user/hduser/gutenberg-output:
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg-output Found 1 items /user/hduser/gutenberg-output/part-00000 <r 1> 903193 2007-09-21 13:00 hduser@ubuntu:/usr/local/hadoop$
You can then inspect the contents of the file with the dfs -cat command:
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat /user/hduser/gutenberg-output/part-00000 "(Lo)cra" 1 "1490 1 "1498," 1 "35" 1 "40," 1 "A 2 "AS-IS". 2 "A_ 1 "Absoluti 1 [...] hduser@ubuntu:/usr/local/hadoop$
Note that in this specific output above the quote signs (“) enclosing the words have not been inserted by Hadoop. They are the result of how our Python code splits words, and in this case it matched the beginning of a quote in the ebook texts. Just inspect the part-00000 file further to see it for yourself.
Improved Mapper and Reducer code: using Python iterators and generators
The Mapper and Reducer examples above should have given you an idea of how to create your first MapReduce application. The focus was code simplicity and ease of understanding, particularly for beginners of the Python programming language. In a real-world application however, you might want to optimize your code by using Python iterators and generators (an even better introduction in PDF) as some readers have pointed out.
Generally speaking, iterators and generators (functions that create iterators, for example with Python’s yield statement) have the advantage that an element of a sequence is not produced until you actually need it. This can help a lot in terms of computational expensiveness or memory consumption depending on the task at hand.
Note: The following Map and Reduce scripts will only work “correctly” when being run in the Hadoop context, i.e. as Mapper and Reducer in a MapReduce job. This means that running the naive test “cat DATA | ./mapper.py | sort -k1,1 | ./reducer.py” will not work correctly anymore because some functionality is intentionally outsourced to Hadoop.
Precisely, we compute the sum of a word’s occurrences, e.g. (“foo”, 4), only if by chance the same word (“foo”) appears multiple times in succession. In the majority of cases, however, we let the Hadoop group the (key, value) pairs between the Map and the Reduce step because Hadoop is more efficient in this regard than our simple Python scripts.
mapper.py
#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""
import sys
def read_input(file):
for line in file:
# split the line into words
yield line.split()
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
for word in words:
print '%s%s%d' % (word, separator, 1)
if __name__ == "__main__":
main()
reducer.py
#!/usr/bin/env python
"""A more advanced Reducer, using Python iterators and generators."""
from itertools import groupby
from operator import itemgetter
import sys
def read_mapper_output(file, separator='\t'):
for line in file:
yield line.rstrip().split(separator, 1)
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_mapper_output(sys.stdin, separator=separator)
# groupby groups multiple word-count pairs by word,
# and creates an iterator that returns consecutive keys and their group:
# current_word - string containing a word (the key)
# group - iterator yielding all ["<current_word>", "<count>"] items
for current_word, group in groupby(data, itemgetter(0)):
try:
total_count = sum(int(count) for current_word, count in group)
print "%s%s%d" % (current_word, separator, total_count)
except ValueError:
# count was not a number, so silently discard this item
pass
if __name__ == "__main__":
main()
Related Links
From yours truly:
Hi,
Urs is one of the best tutorials on hadoop streaming in the internet i have come across!!
Keep up the good work
Hi Michael,
Thanks for the nice intro to python via streaming. Do you happen to know whether one could Initialize and Close the mapper in python akin to the facility available in Java?
hi
i m getting an error:
hadoop@shekhar-virtual-machine:~$ /home/hadoop/bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -file /home/hadoop/mapper.py -mapper /home/hadoop/mapper.py -file /home/hadoop/reducer.py -reducer /home/hadoop/reducer.py -input /home/hadoop/dfs/data/* -output /home/hadoop/programs/output
Input path does not exist: hdfs://localhost:54310/home/hadoop/dfs/data/Xerxes.txt
Streaming Job Failed!
@Shekhar: See my email. Basically, you seem to have confused HDFS and the local filesystem.
@Zak: I’m not sure. I haven’t had a need yet, so I didn’t bother to find out.
Maybe you could try some tricks with things like __init__()?
thanks, nice tutorial.
It’s pretty amazing this all “just works” and we can limit our code to the more basic algorithms (with the funny note that often people will have a mapper or reducer that does basically nothing)
One thing I wonder though, how will hadoop decide to split up the input files? Or are we supposed to do that ourself? I’m testing with 4 nodes with 4 cores and would like to see them all being used. Now it only ran 3 mappers. Is it my job to split the input files into 16 similarly-sized input files? I also see it only runs 1 reducer, isn’t it hadoops job to create multiple reducer jobs and spread them around on my nodes?
@Dieter_be: Hadoop does the heavy lifting, i.e. it takes care of splitting the data. However you can also influence how the data is (logically) split. See this discussion, for instance. I quote:
Why only 3 out of 4 mappers used. As explained in How Many Maps And Reduces, you basically can’t really “force” Hadoop to use a specific number of mappers (see the link for the subtle details). There is a parameter called mapred.map.tasks which you might try to tune but its name is misleading (as briefly described in MAPREDUCE-1884).
That said, one way to get Hadoop to use 4 mappers is to set the HDFS block size of your input data so that the data is split across (at least) 4 blocks. The HDFS block size configured via the dfs.block.size property in your conf/hdfs-site.xml file (or the default value in hdfs-default.xml) is actually only a default value. Also, it is a per-file setting, i.e. you can have files with different block sizes on the same cluster. Clients can overwrite the default block size e.g. on the command line via hadoop ... -D dfs.block.size=536870912 (here: 512MB).
> It’s pretty amazing this all “just works” .
Yep, it is. That’s why the Hadoop team has recently won the Innovator of the Year award.
thanks for useful tut.how can i use partitioner and combiner if i use Python in Hadoop ?
@AB: Just add the parameters to the command line. For instance, hadoop ... -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner -D map.output.key.field.separator=, -D num.key.fields.for.partition=2. See Hadoop Streaming.
Thanks for a great tutorial.
I have a question:
Is it possible to chain map/reduce operations?
e.g. Could i write n mappers and n reducers and ask hadoop to execute mapper i+1 over the results of reducer i, all the way to the great win?
@Shai: Yes, you can. You can do it yourself (e.g. via Java or via the command line), or you could use tools such as Oozie or Cascading.
Hi,
There is an issue in your reducer.py, you are unnecessarily storing the word2count dictionary.
Since the output from the mappers is sorted, you are guaranteed by the framework that all words,count pairs will come ordered. Note that you must specify that only words are used as keys.
Then you just need to check for change in key i.e.
if current_word != word: print current_word,count current_word = word count = 1 else: count += 1@Akshay: Yes, that’s true. It is even completely against best practices.
I’ll update it.
Is there a way to update mysql database from the python based reducer code?
@Anil: Yes, you can update a MySQL database from a Python based Reducer code. Just do it like with “normal” Python code. However, you should generally avoid introducing “side effects” into your MapReduce code. For instance, you might end up hammering your MySQL instance with more requests than it can handle. This could become a bottleneck in your MapReduce job, among other things.
Hi,
Thanks for sharing this. I have a question.
Is it possible to run this using seq file as input format ? meaning, would hadoop streaming.jar and python work with sequence file as input format and text as output format ?
Hi Mike,
Thanks a lot for the help provided in the instructions. I was able to reproduce the results in python (http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/#run-the-mapreduce-job), but created a Mapper and Reducer in Java since all my code is currently in Java.
I first tried this:
echo “foo foo quux labs foo bar quux” |java -cp ~/dummy.jar WCMapper | sort | java -cp ~/dummy.jar WCReducer
It gave the correct output:
Then, I installed a single-node cluster in hadoop and tried this:
(by tailoring the python command)
This is the error:
Any advice?
resolved it myself.
This sounds stupid, but the mapper part now works fine, if i use -files dummy.jar instead of -file dummy.jar.
The reducer is going into an infinite loop (0%, 22%, 0%, …), but I will figure that out later! This was my code for reducer.java (mimicking your python code). Do you think I was on the right track?
public class WCReducer { /** * @param args */ public static void main(String[] args) { Scanner sc = new Scanner(System.in); HashMap wordCounts = new HashMap(); while(sc.hasNextLine()){ String[] split = sc.nextLine().trim().split("\\s"); if(!wordCounts.containsKey(split[0])) wordCounts.put(split[0], Integer.parseInt(split[1])); else wordCounts.put(split[0], wordCounts.get(split[0])+Integer.parseInt(split[1])); } for (String string : wordCounts.keySet()) { System.out.println(string+"\t"+wordCounts.get(string)); } sc.close(); } }Mike, Nice one. Though its a same old concept that IBM mainframe used in their s390 allocaiton of storage or assignment of tasks, understanding the same in the millenium linux world is still tough. Your example gave a simple representation of how it all fits in, removed the doubts on ‘how hadoop works’ . Good one. I will follow your work whereever possible. nice one tx
Hi, Micheal.
Tanks for this great tutorial.
I am struggling with a very issue in hadoop streaming in the “-file” option.
First I tried the very basic example in streaming:
which worked absolutely fine.
Then I copied the IdentityMapper.java source code and compiled it. Then I placed this class file in the /home/hadoop folder and executed the following in the terminal.
The execution failed with the following error in the stderr file:
Then again I tried it by copying the IdentityMapper.class file in the hadoop installation and executed the following:
But unfortunately again I got the same error.
It would be great if you can help me with it as I cannot move any further without overcoming this.
Thanking you in anticipation
Seems like you cannot use the built-in aggregate class as a combiner in streaming jobs. Here’s an example:
However, this works:
The error is:
Any ideas?
Thanks for great tutorial micheal,
but i am facing some issue with cloudera hadoop…. i am able to run hadoop examples with eclipse…. can you help me out ……
thanks in advance
I was trying to run this example in CDH3. Map tasks is running in a loop and the reducer is not running. I am using the following command to run the job
@Jagaran: I do not have actual experience with CDH3. However, it looks as if your input and output data locations are on the local filesystem (you are using /home/hadoop/... paths). Both input and output locations must be on HDFS.
Hi Michael,
Both the input and output data are in HDFS and the mapper shows that it gets the path too.
It shows that “One File Found” and also bytes read.
But the reducer is getting stuck.
PFB the stack trace. I manually killed the job at last
Another awesome Hadoop Tutorial from you!!
Small typo:
“occurences” should be –> “occurRences”
@Sameer: Thanks — I fixed the typo!
Hi Michael
It is a great tutorial. But unfortunately I’m facing some issues. The mappers run to completion but the reducers are not. Showing some error related to pipes. This is the error I obtained after scanning my task tracker logs.
stderr logs:
But there is a directory as /usr/bin/env, I verified.
Sorry if it is really a stupid question. I’m basically a java developer with no knowledge on python, but just want to try out hadoop streaming.
@bejoy: See the thread /usr/bin/env: python: No such file or directory on StackOverflow.
There is a comment in the reducer program that the output of the mapper is sorted by key. Is this really relevant, because isnt the reducer supposed to get all the key value pairs with the same key or am I missing something here?
Why can’t we simply sum up the values of all the key value pairs that come into a single reducer?
>>The job will read all the files in the HDFS directory /user/hduser/gutenberg, process it, and store the results in a single result file in the HDFS directory /user/hduser/gutenberg-output.
Shouldn’t it one file per reducer in the o/p?
@Praveen: Yes, in general it will be one file per reducer. In this example however the input files are so small so that it will be just a single file. But I’ll clarify the relevant section.
Hi Michael,
I have a linux executable which I am trying to make work over cloud using hadoop streaming.
I pass a configuration file to the executable and apart from this it reads text data from a textfile.
below is the command I use to run the application,
I could make ./vina as mapper , but how to specify that mapper to take conf.txt as input
Regards,
Praveen
The ‘| sort’ as mentioned in many examples of how to locally test, should be ‘| sort -k1,1′, to better mimic how HadoopStreaming sorts.
@druud: Nice suggestion! I updated the text.
Thanks for sharing.
Praveen – why not send the location of files as input in mapper, and run the program?
Hello guys.Nice solutions.Does anyone know how to combine this wordcount solution with naive bayes to read the lines of a text and thourgh a training set of data make a system that will take for instance harry hates john and will convert it to John hates Harry?
Great tutorial. Helped me a lot in getting started with Hadoop.
Thanks,
Arvind
I want to write hadoop streaming job for map reduce, however I am not aware how to get filename where I am getting input line. How I can do that?
Hi,
I am getting the following Error.. Any suggestion would be highly helpful..
Edited by Michael G. Noll: I have moved your long logging output to https://gist.github.com/1587990.
packageJobJar: [/Users/anuj.maurice/Downloads/hadoop-0.20.2-cdh3u2/python/mapper.py, /Users/anuj.maurice/Downloads/hadoop-0.20.2-cdh3u2/python/reducer.py, /tmp/hadoop-anuj.maurice/hadoop-unjar2426556812178658809/] [] /var/folders/Yu/YuXibLtIHOuWcHsjWu8zM-Ccvdo/-Tmp-/streamjob4679204253733026415.jar tmpDir=null
[...snip...]
12/01/04 12:03:04 INFO streaming.StreamJob: map 100% reduce 100%
12/01/04 12:03:04 INFO streaming.StreamJob: To kill this job, run:
12/01/04 12:03:04 INFO streaming.StreamJob: /Users/anuj.maurice/Downloads/hadoop-0.20.2-cdh3u2/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201201041122_0004
12/01/04 12:03:04 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201201041122_0004
12/01/04 12:03:04 ERROR streaming.StreamJob: Job not successful. Error: NA
12/01/04 12:03:04 INFO streaming.StreamJob: killJob…
Streaming Command Failed!
@Sachin: Hadoop sets job configuration parameters as environment variables when streaming is used. For instance, os.environ["mapred_job_id"] gives you the mapred.job.id configuration property. Off the top of my head the name of the input file will be in os.environ["map_input_file"]. Note that Hadopo replaces non-alphanumeric characters such as dots “.” with underscores.
See Tom White’s book Hadoop: The Definitive Guide (2nd ed.), page 187 for more information.
Dear Michael,
- I am writing my code in Python. So, can you please suggest how can we introduce cascading using Hadoop Streaming without actually using “Cascading” package
- Do I need to save intermediate files in this case
I tried searching this on internet but could not come up with a definite answer
I have the following scenario: Map1->Red1->Map2->Red2