Mahout : To read a custom input file
Asked Answered
A

3

6

I was playing with Mahout and found that the FileDataModel accepts data in the format

     userId,itemId,pref(long,long,Double).

I have some data which is of the format

     String,long,double 

What is the best/easiest method to work with this dataset on Mahout?

Affinitive answered 26/8, 2011 at 19:28 Comment(0)
H
3

One way to do this is by creating an extension of FileDataModel. You'll need to override the readUserIDFromString(String value) method to use some kind of resolver do the conversion. You can use one of the implementations of IDMigrator, as Sean suggests.

For example, assuming you have an initialized MemoryIDMigrator, you could do this:

@Override
protected long readUserIDFromString(String stringID) {
    long result = memoryIDMigrator.toLongID(stringID); 
    memoryIDMigrator.storeMapping(result, stringID);
    return result;
}

This way you could use memoryIDMigrator to do the reverse mapping, too. If you don't need that, you can just hash it the way it's done in their implementation (it's in AbstractIDMigrator).

Hock answered 15/3, 2012 at 11:51 Comment(0)
M
3

userId and itemId can be string, so this is the CustomFileDataModel which will convert your string into integer and will keep the map (String,Id) in memory; after recommendations you can get string from id.

Multicolored answered 17/4, 2015 at 2:57 Comment(0)
W
1

Assuming that your input fits in memory, loop through it. Track the ID for each string in a dictionary. If it does not fit in memory, use sort and then group by to accomplish the same idea.

In python:

import sys

import sys

next_id = 0
str_to_id = {}
for line in sys.stdin:
    fields = line.strip().split(',')
    this_id = str_to_id.get(fields[0])
    if this_id is None:
        next_id += 1
        this_id = next_id
        str_to_id[fields[0]] = this_id
    fields[0] = str(this_id)

    print ','.join(fields)
Weis answered 26/8, 2011 at 19:40 Comment(1)
There is a component in Mahout which does this sort of automagically, called IDMigrator, but I also would recommend translating to numeric IDs externally.Landbert

© 2022 - 2024 — McMap. All rights reserved.