How can I inspect a Hadoop SequenceFile for which I lack full schema information?
Asked Answered
D

6

7

I have a compressed Hadoop SequenceFile from a customer which I'd like to inspect. I do not have full schema information at this time (which I'm working on separately).

But in the interim (and in the hopes of a generic solution), what are my options for inspecting the file?

I found a tool forqlift: http://www.exmachinatech.net/01/forqlift/

And have tried 'forqlift list' on the file. It complains that it can't load classes for the custom subclass Writables included. So I will need to track down those implementations.

But is there any other option available in the meantime? I understand that most likely I can't extract the data, but is there some tool for scanning how many key values and of what type?

Derisive answered 26/9, 2011 at 19:50 Comment(0)
F
11

Check the SequenceFileReadDemo class in the 'Hadoop : The Definitive Guide'- Sample Code. The sequence files have the key/value types embedded in them. Use the SequenceFile.Reader.getKeyClass() and SequenceFile.Reader.getValueClass() to get the type information.

Felafel answered 27/9, 2011 at 3:11 Comment(1)
If you have a copy of the definitive guide, this example is on page 133 (of the third edition). There are also examples to write and display a sequence file.Afflict
G
21

From shell:

$ hdfs dfs -text /user/hive/warehouse/table_seq/000000_0

or directly from hive (which is much faster for small files, because it is running in an already started JVM)

hive> dfs -text /user/hive/warehouse/table_seq/000000_0

works for sequence files.

Graviton answered 3/9, 2014 at 13:13 Comment(0)
F
11

Check the SequenceFileReadDemo class in the 'Hadoop : The Definitive Guide'- Sample Code. The sequence files have the key/value types embedded in them. Use the SequenceFile.Reader.getKeyClass() and SequenceFile.Reader.getValueClass() to get the type information.

Felafel answered 27/9, 2011 at 3:11 Comment(1)
If you have a copy of the definitive guide, this example is on page 133 (of the third edition). There are also examples to write and display a sequence file.Afflict
P
6

My first thought would be to use the Java API for sequence files to try to read them. Even if you don't know which Writable is used by the file, you can guess and check the error messages (there may be a better way that I don't know).

For example:

private void readSeqFile(Path pathToFile) throws IOException {
  Configuration conf = new Configuration();
  FileSystem fs = FileSystem.get(conf);

  SequenceFile.Reader reader = new SequenceFile.Reader(fs, pathToFile, conf);

  Text key = new Text(); // this could be the wrong type
  Text val = new Text(); // also could be wrong

  while (reader.next(key, val)) {
    System.out.println(key + ":" + val);
  }
}

This program would crash if those are the wrong types, but the Exception should say which Writable type the key and value actually are.

Edit: Actually if you do less file.seq usually you can read some of the header and see what the Writable types are (at least for the first key/value). On one file, for example, I see:

SEQ^F^Yorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable

Paid answered 26/9, 2011 at 21:14 Comment(0)
O
4

I'm not a Java or Hadoop programmer, so my way of solving problem could be not the best one, but anyway.

I spent two days solving the problem of reading FileSeq locally (Linux debian amd64) without installation of hadoop.

The provided sample

while (reader.next(key, val)) {
    System.out.println(key + ":" + val);
  }

works well for Text, but didn't work for BytesWritable compressed input data.

What I did? I downloaded this utility for creating (writing SequenceFiles Hadoop data) github_com/shsdev/sequencefile-utility/archive/master.zip , and got it working, then modified for reading input Hadoop SeqFiles.

The instruction for Debian running this utility from scratch:

sudo apt-get install maven2
sudo mvn install
sudo apt-get install openjdk-7-jdk

edit "sudo vi /usr/bin/mvn",
change `which java` to `which /usr/lib/jvm/java-7-openjdk-amd64/bin/java`

Also I've added (probably not required)
'
PATH="/home/mine/perl5/bin${PATH+:}${PATH};/usr/lib/jvm/java-7-openjdk-amd64/"; export PATH;

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/
export JAVA_VERSION=1.7
'
to ~/.bashrc


Then usage:
sudo mvn install
~/hadoop_tools/sequencefile-utility/sequencefile-utility-master$ /usr/lib/jvm/java-7-openjdk-amd64/bin/java -jar ./target/sequencefile-utility-1.0-jar-with-dependencies.jar


-- and this doesn't break the default java 1.6 installation that is required for FireFox/etc.

For resolving error with FileSeq compatability (e.g. "Unable to load native-hadoop library for your platform... using builtin-java classes where applicable"), I used the libs from the Hadoop master server as is (a kind of hack):

scp [email protected]:/usr/lib/libhadoop.so.1.0.0 ~/
sudo cp ~/libhadoop.so.1.0.0 /usr/lib/
scp [email protected]:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64/server/libjvm.so ~/
sudo cp ~/libjvm.so /usr/lib/
sudo ln -s /usr/lib/libhadoop.so.1.0.0 /usr/lib/libhadoop.so.1
sudo ln -s /usr/lib/libhadoop.so.1.0.0 /usr/lib/libhadoop.so

One night drinking coffee, and I've written this code for reading FileSeq hadoop input files (using this cmd for running this code "/usr/lib/jvm/java-7-openjdk-amd64/bin/java -jar ./target/sequencefile-utility-1.3-jar-with-dependencies.jar -d test/ -c NONE"):

import org.apache.hadoop.io.*;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.ValueBytes;

import java.io.DataOutputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;

Path file = new Path("/home/mine/mycompany/task13/data/2015-08-30");
reader = new SequenceFile.Reader(fs, file, conf);
long pos = reader.getPosition();

logger.info("GO from pos "+pos);
DataOutputBuffer rawKey = new DataOutputBuffer();
ValueBytes rawValue = reader.createValueBytes();

int DEFAULT_BUFFER_SIZE = 1024 * 1024;
DataOutputBuffer kobuf = new DataOutputBuffer(DEFAULT_BUFFER_SIZE);
kobuf.reset();

int rl;
do {
  rl = reader.nextRaw(kobuf, rawValue);
  logger.info("read len for current record: "+rl+" and in more details ");
  if(rl >= 0)
  {
    logger.info("read key "+new String(kobuf.getData())+" (keylen "+kobuf.getLength()+") and data "+rawValue.getSize());
    FileOutputStream fos = new FileOutputStream("/home/mine/outb");
    DataOutputStream dos = new DataOutputStream(fos);
    rawValue.writeUncompressedBytes(dos);
    kobuf.reset();
  }
} while(rl>0);
  • I've just added this chunk of code to file src/main/java/eu/scape_project/tb/lsdr/seqfileutility/SequenceFileWriter.java just after the line

    writer = SequenceFile.createWriter(fs, conf, path, keyClass, valueClass, CompressionType.get(pc.getCompressionType()));

Thanks to these sources of info: Links:

If using hadoop-core instead of mahour, then will have to download asm-3.1.jar manually: search_maven_org/remotecontent?filepath=org/ow2/util/asm/asm/3.1/asm-3.1.jar search_maven_org/#search|ga|1|asm-3.1

The list of avaliable mahout repos: repo1_maven_org/maven2/org/apache/mahout/ Intro to Mahout: mahout_apache_org/

Good resource for learning interfaces and sources of Hadoop Java classes (I used it for writing my own code for reading FileSeq): http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.action/0.2.7/org/apache/hadoop/io/BytesWritable.java

Sources of project tb-lsdr-seqfilecreator that I used for creating my own project FileSeq reader: www_javased_com/?source_dir=scape/tb-lsdr-seqfilecreator/src/main/java/eu/scape_project/tb/lsdr/seqfileutility/ProcessParameters.java

stackoverflow_com/questions/5096128/sequence-files-in-hadoop - the same example (read key,value that doesn't work)

https://github.com/twitter/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/RawSequenceFileRecordReader.java - this one helped me (I used reader.nextRaw the same as in nextKeyValue() and other subs)

Also I've changed ./pom.xml for native apache.hadoop instead of mahout.hadoop, but probably this is not required, because the bugs for read->next(key, value) are the same for both so I had to use read->nextRaw(keyRaw, valueRaw) instead:

diff ../../sequencefile-utility/sequencefile-utility-master/pom.xml ./pom.xml 
9c9
<     <version>1.0</version>
---
>     <version>1.3</version>
63c63
<             <version>2.0.1</version>
---
>             <version>2.4</version>
85c85
<             <groupId>org.apache.mahout.hadoop</groupId>
---
>             <groupId>org.apache.hadoop</groupId>
87c87
<             <version>0.20.1</version>
---
>             <version>1.1.2</version>
93c93
<             <version>1.1</version>
---
>             <version>1.1.3</version>
Osterman answered 9/9, 2015 at 10:42 Comment(1)
Yes, it is an answer.Moncear
F
3

I was just playing with Dumbo. When you run a Dumbo job on a Hadoop cluster, the output is a sequence file. I used the following to dump out an entire Dumbo-generated sequence file as plain text:

$ bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar \
    -input totals/part-00000 \
    -output unseq \
    -inputformat SequenceFileAsTextInputFormat
$ bin/hadoop fs -cat unseq/part-00000

I got the idea from here.

Incidentally, Dumbo can also output plain text.

Footslog answered 15/2, 2013 at 18:41 Comment(1)
Error: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.LongWritable, received org.apache.hadoop.io.Text It's not working when i given input with simple sequence file.Masonmasonic
T
0

Following the anwer of Praveen Sripati, here a small example of SequenceFileReadDemo.java from Hadoop the Definitive Guide by Tom White.

Data are in HDFS, in this position : user/hduser/output-hashsort/ and the file is part-r-00001 In eclipse, in the Arguments folder I've written this string : enter image description here

and this is part of the output, with the debugger enter image description here

Terenceterencio answered 30/11, 2019 at 17:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.