The mapper method receives a key-value pair already parsed out from input text. mrjob uses Hadoop streaming, and each input text is divided by the new line character and then each line is split into key-value pair based on an input protocol in use. That's something the framework takes care of for you, so you don't have to do any heavy lifting; you can just assume you will get proper key and value.
However, you do need to specify what kind of input text files are specified. For example, if the key and/or value are not plain text (as in the original question) but serialized JSON, then you use JSONProtocol/JSONValueProtocol, etc., instead of RawValueProtocol which is the default.
For the initial mapper, each line is read into value (by RawValueProtocol), so that is why you don't receive key. Using _
is just a Python convention for an unused dummy variable. (However, _
is actually a valid name for a Python variable. You can do something like this a = 3; _ = 2; b = a + _
. Blasphemy, isn't it?)
mrjob can take multiple input files. You can do for example
$ python wordcount.py text1.txt text2.txt
If you want all text files as input to an mrjob job, you can do things like
$ python wordcount.py inputdir/*.txt
or just simply
$ python wordcount.py inputdir
and all the files selected are used as input.
What reducer receives is a key and the iterator for all the values associated with that key. So if you example, the variable values
in the reducer method is an iterator. If you want to do something over all values, you need to actually iterate over all of them. In the specific example in the question, the built-in function sum
can take an iterator as an argument, and that's why you can do it in one shot. But it is effectively similar to sum([value for value in values])
.
I actually don't know how you would unit test mrjob scripts. I have usually just tested on a small chunk of test data before production run.
{'key1':'value1'}{'key2':'value2'}
where as the txt file was just havingvalue1\n value2\n value3\n
? – Seldon