How can I use Mahout's sequencefile API code?
Asked Answered
F

1

3

There exists in Mahout a command for create sequence file as bin/mahout seqdirectory -c UTF-8 -i <input address> -o <output address>. I want use this command as code API.

Fireplace answered 25/7, 2012 at 8:6 Comment(0)
D
3

You can do something like this:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;


Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);

Path outputPath = new Path("c:\\temp");

Text key = new Text(); // Example, this can be another type of class
Text value = new Text(); // Example, this can be another type of class

SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, outputPath, key.getClass(), value.getClass());

while(condition) {

    key = Some text;
    value = Some text;

    writer.append(key, value);
}

writer.close();

You can find more information here and here

Additionally, you could call the exact same functionality you described from Mahout by using the org.apache.mahout.text.SequenceFilesFromDirectory

Then the call looks something like this:

ToolRunner.run(new SequenceFilesFromDirectory(), String[] args //your parameters);

The ToolRunner comes from org.apache.hadoop.util.ToolRunner

Hope this was of help.

Dekker answered 25/7, 2012 at 8:16 Comment(9)
You might also want to look here, where the code uses both the SequenceFile Writer and Reader.Dekker
What is the Path "appledata/apples" in Path path = new Path("appledata/apples"); in here. If this is a address directory?Fireplace
It might be relative to the Hadoop File System (HDFS)Dekker
So, how I can set address for this? I haven't more information about this.Fireplace
Then you don't need the HDFS, just specify the local path for where you want the output to be written.Dekker
Set the output! I want give it the input address file text and save sequence file it to output address. I want set both input and output address.Fireplace
I have already stated how to do so. You would do something like this ToolRunner.run(new SequenceFilesFromDirectory(), {"-c", "UTF-8", "-i", "c:\\inputPath", "-o", "c:\\outputPath"});Dekker
I use your code. But, this code not append new input to sequence file. In every run this code, create new "sequence file".Fireplace
Of course it creates a new sequence file. The code I presented creates a new SequenceFile.Writer every time you run it, so it will indeed overwrite anything that is present (if the output path is the same). If want you want to do is append your new data to the existing sequence file, you need to make your own code.Dekker

© 2022 - 2024 — McMap. All rights reserved.