mrjob: how does the example automatically know how to find lines in text file?
Asked Answered
S

3

6

I'm trying to understand the example for mrjob better

from mrjob.job import MRJob  
class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)
if __name__ == '__main__':
    MRWordFrequencyCount.run()

I run it by

$ python word_count.py my_file.txt

and it works as expected but I don't get how it automatically knows that it's going to read a text file and split it by each line. and I'm not sure what the _ does either.

From what I understand, the mapper() generates the three key/value pairs for each line correct? What if I want to work with each file in a folder?

And the reducer() automatically know how to add each key's values up?

What if I want to run unit tests via map reduce, what would the mapper and reducer look like? Is it even necessary?

Seldon answered 21/4, 2014 at 7:33 Comment(0)
O
5

The mapper method receives a key-value pair already parsed out from input text. mrjob uses Hadoop streaming, and each input text is divided by the new line character and then each line is split into key-value pair based on an input protocol in use. That's something the framework takes care of for you, so you don't have to do any heavy lifting; you can just assume you will get proper key and value.

However, you do need to specify what kind of input text files are specified. For example, if the key and/or value are not plain text (as in the original question) but serialized JSON, then you use JSONProtocol/JSONValueProtocol, etc., instead of RawValueProtocol which is the default.

For the initial mapper, each line is read into value (by RawValueProtocol), so that is why you don't receive key. Using _ is just a Python convention for an unused dummy variable. (However, _ is actually a valid name for a Python variable. You can do something like this a = 3; _ = 2; b = a + _. Blasphemy, isn't it?)

mrjob can take multiple input files. You can do for example

$ python wordcount.py text1.txt text2.txt

If you want all text files as input to an mrjob job, you can do things like

$ python wordcount.py inputdir/*.txt

or just simply

$ python wordcount.py inputdir

and all the files selected are used as input.

What reducer receives is a key and the iterator for all the values associated with that key. So if you example, the variable values in the reducer method is an iterator. If you want to do something over all values, you need to actually iterate over all of them. In the specific example in the question, the built-in function sum can take an iterator as an argument, and that's why you can do it in one shot. But it is effectively similar to sum([value for value in values]).

I actually don't know how you would unit test mrjob scripts. I have usually just tested on a small chunk of test data before production run.

Otti answered 22/4, 2014 at 6:47 Comment(4)
so if it was parsing json file, should it look like {'key1':'value1'}{'key2':'value2'} where as the txt file was just having value1\n value2\n value3\n ?Seldon
With the mrjob implementation, key and value can each be JSON object. In fact, I think it's internally passed around with JSONProtocol. And key-value pairs, when serialized, are delimited by a tab character. So with JSON protocol, a line looks like "key"\t{"somelist": [1, 2, 3]}\n for example. This is actually the reason why your output, the key is double-quoted, because it's output as a string in JSON. For the initial input, if you use JSONValueProtocol, then you want the input file to look like {"L1": 1}\n{"L2": 2}\n{"L3": 3} and so on. Each line only contains a value for that protocol.Otti
I actually suggest that you read the mrjob documentation on input protocol. It's essential and useful but can be confusing if you are not familiar with how input/output/key-values are processed internally.Otti
thank you this is very helpful i was a bit confused at first while reading the bit on jsonprotocolSeldon
T
2

I don't know much about mrjob,so I'm going to make a few assumptions. First,the _ means to disregard the key (verified after a Google search). Second, i would assume it would work on a comma separated list of files or a directory. Next,this code is devoid of a setup probably because these are default method names. I'm sure if you named your mapper or reducer something different mrjob couldn't pick it up automatically.

I found some examples here.

Tajuanatak answered 22/4, 2014 at 4:19 Comment(0)
C
0
from mrjob.job import MRJob

class MRRatingCounter(MRJob):
    def mapper(self, key, line):


        (userID, movieID, rating, timestamp) = line.split('\t')
        yield rating, 1

    def reducer(self, rating, occurences):
        yield rating, sum(occurences)

if __name__ == '__main__':
    MRRatingCounter.run()

So please correct me based of above discussion :

Please correct me if i am wrong , so in this case key is taking the value of input file in this case can i read it in my mind like this :

def mapper ( self{ this is the object/instance} , key{ input text file in this case the name of the file that is and correct me ml-100k/u.data , line{ this is the are trying to pass to mapper() every time from the data file ., )

Another code from Udemy where i am trying to learn is as asked by the original question : class MRFriendsByAge(MRJob):

def mapper(self, _, line):
    (ID, name, age, numFriends) = line.split(',')
    yield age, float(numFriends)

def reducer(self, age, numFriends):
    total = 0
    numElements = 0
    for x in numFriends:
        total += x
        numElements += 1

    yield age, total / numElements

if name == 'main': MRFriendsByAge.run()

Found a MR job book and am trying to see if its make sense but i am still struggling

Chilton answered 11/1, 2018 at 4:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.