There exists in Mahout a command for create sequence file as bin/mahout seqdirectory -c UTF-8
-i <input address> -o <output address>
. I want use this command as code API.
How can I use Mahout's sequencefile API code?
Asked Answered
You can do something like this:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path outputPath = new Path("c:\\temp");
Text key = new Text(); // Example, this can be another type of class
Text value = new Text(); // Example, this can be another type of class
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, outputPath, key.getClass(), value.getClass());
while(condition) {
key = Some text;
value = Some text;
writer.append(key, value);
}
writer.close();
You can find more information here and here
Additionally, you could call the exact same functionality you described from Mahout by using the org.apache.mahout.text.SequenceFilesFromDirectory
Then the call looks something like this:
ToolRunner.run(new SequenceFilesFromDirectory(), String[] args //your parameters);
The ToolRunner
comes from org.apache.hadoop.util.ToolRunner
Hope this was of help.
You might also want to look here, where the code uses both the SequenceFile Writer and Reader. –
Dekker
What is the Path
"appledata/apples"
in Path path = new Path("appledata/apples");
in here. If this is a address directory? –
Fireplace It might be relative to the Hadoop File System (HDFS) –
Dekker
So, how I can set address for this? I haven't more information about this. –
Fireplace
Then you don't need the HDFS, just specify the local path for where you want the output to be written. –
Dekker
Set the output! I want give it the input address file text and save
sequence file
it to output address. I want set both input and output address. –
Fireplace I have already stated how to do so. You would do something like this
ToolRunner.run(new SequenceFilesFromDirectory(), {"-c", "UTF-8", "-i", "c:\\inputPath", "-o", "c:\\outputPath"});
–
Dekker I use your code. But, this code not append new input to sequence file. In every run this code, create new "sequence file". –
Fireplace
Of course it creates a new sequence file. The code I presented creates a new
SequenceFile.Writer
every time you run it, so it will indeed overwrite anything that is present (if the output path is the same). If want you want to do is append your new data to the existing sequence file, you need to make your own code. –
Dekker © 2022 - 2024 — McMap. All rights reserved.