Handling Writables fully qualified name changes in Hadoop SequenceFile

Asked 19/9, 2013 at 0:55 Answered 8/11, 2017 at 14:37

Solved serialization hadoop sequencefile

I have a bunch of Hadoop SequenceFiles that have been written with some Writable subclass I wrote. Let's call it FishWritable.

This Writable worked out well for a while, until I decided there was need for a package renaming for clarity. So now the fully qualified name of FishWritable is com.vertebrates.fishes.FishWritable instead of com.mammals.fishes.FishWritable. It was a reasonable change given how the scope of the package in question had evolved.

Then I discover that none of my MapReduce jobs will run, as they crash when attempting to initialize the SequenceFileRecordReader:

java.lang.RuntimeException: java.io.IOException: WritableName can't load class: com.mammals.fishes.FishWritable
at org.apache.hadoop.io.SequenceFile$Reader.getKeyClass(SequenceFile.java:1949)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1899)
...

A couple of options for dealing with this is immediately apparent. I can simply rerun all my previous jobs to regenerate the output with the up to date key class name, running any dependent jobs in sequence. This can obviously be quite time consuming and sometimes not even possible.

Another possibility might be to write a simple job that reads the SequenceFile as text and replaces any instances of the class name with the new one. This is basically method #1 with a tweak that makes it less complicated to do. If I have a lot of big files it's still quite impractical.

Is there a better way to deal with refactorings of fully qualified class names used in SequenceFiles? Ideally, I'm looking for some way to specify a new fallback class name if the specified one is not found, to allow for running against both dated and updated types of this SequenceFile.

Mcdougall answered 19/9, 2013 at 0:55 Comment(5)

Could you write a new MR Job that reads with input value type com.mammals.fishes.FishWritable and emit value of type com.vertebrates.fishes.FishWritable? Of course you'll need to add JARs containing both classdefs through -libjars. – Kosel 20/9, 2013 at 14:1

Yeah, that is another option. It is essentially the same as Method #2 though, except using the Java Api to do it. What I'd prefer is some way to tell the InputFormat that com.vertebrates.fishes.FishWritable is the key it should use in the input step for future jobs. Essentially, there is no reason it should fail, since the class is the same - I just don't know of a way to tell it the new class name, since it assumes the name in the existing SequenceFile is correct. – Mcdougall 20/9, 2013 at 17:18

For anybody wondering, I've answered my question by using Avro for serialization instead. If you use Avro you don't have to ask this question at all, plus a few other advantages. – Mcdougall 27/9, 2013 at 23:3

Would you mind providing an answer to your own question that describes the Avro solution? I've used a bit of Avro myself, but I'm interested in seeing how you've approached this. – Kosel 14/10, 2013 at 14:47

Actually, my solution did not resolve my initial question, so my statement was not correct. I did not have too much data I had to regenerate, so I ended up doing that, and then leaving SequenceFiles behind with some simple jobs to convert files from SequenceFile to Avro schemas I defined as a first step towards moving everything to Avro. I would certainly advise anybody getting into Hadoop to go for a standalone serialization framework like Avro right away.. I wouldn't like to be in a position where I have a cluster full of data that is dependent upon the package structure staying fixed. – Mcdougall 14/10, 2013 at 16:16

The org.apache.hadoop.io.WritableName class mentioned in the exception stack trace has some useful methods.

From the doc:

Utility to permit renaming of Writable implementation classes without invalidiating files that contain their class name.

// Add an alternate name for a class.
public static void addName(Class writableClass, String name)

In your case you could call this before reading from your SequenceFiles:

WritableName.addName(com.vertebrates.fishes.FishWritable.class, "com.mammals.fishes.FishWritable");

This way, when attempting to read a com.mammals.fishes.FishWritable from an old SequenceFile, the new com.vertebrates.fishes.FishWritable class will be used.

PS: Why was the fish in the mammals package in the first place? ;)

Anthocyanin answered 8/11, 2017 at 14:37 Comment(2)

This looks like the correct answer. Have you tested this? I don't use SequenceFiles anymore, so I can't easily verify. If someone can confirm that these methods work I'll mark this as answer. – Mcdougall 9/11, 2017 at 12:32

Yes, I have tested this method and currently use it in some projects. – Anthocyanin 9/11, 2017 at 12:39

Looking at the spec for sequencefile it seems clear there isn't any consideration for alternative class names.

If I wasn't in a position to re-write the data, one more option is to have com.mammals.fishes.writable extend com.vertebrates.fishes.writable and just annotate it as deprecated so nobody accidentally adds code to the empty wrapper. After a long enough time, the data written with the old class will be obsoleted and you'll be able to safely delete the mammals class.

Emotion answered 30/12, 2013 at 3:50 Comment(1)

I'm giving this the check mark. There simply is no implemented mechanism for doing this in the Hadoop framework as of my last investigation. – Mcdougall 6/3, 2014 at 9:55