I have a bunch of Hadoop SequenceFiles that have been written with some Writable subclass I wrote. Let's call it FishWritable.
This Writable worked out well for a while, until I decided there was need for a package renaming for clarity. So now the fully qualified name of FishWritable is com.vertebrates.fishes.FishWritable
instead of com.mammals.fishes.FishWritable
. It was a reasonable change given how the scope of the package in question had evolved.
Then I discover that none of my MapReduce jobs will run, as they crash when attempting to initialize the SequenceFileRecordReader:
java.lang.RuntimeException: java.io.IOException: WritableName can't load class: com.mammals.fishes.FishWritable
at org.apache.hadoop.io.SequenceFile$Reader.getKeyClass(SequenceFile.java:1949)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1899)
...
A couple of options for dealing with this is immediately apparent. I can simply rerun all my previous jobs to regenerate the output with the up to date key class name, running any dependent jobs in sequence. This can obviously be quite time consuming and sometimes not even possible.
Another possibility might be to write a simple job that reads the SequenceFile as text and replaces any instances of the class name with the new one. This is basically method #1 with a tweak that makes it less complicated to do. If I have a lot of big files it's still quite impractical.
Is there a better way to deal with refactorings of fully qualified class names used in SequenceFiles? Ideally, I'm looking for some way to specify a new fallback class name if the specified one is not found, to allow for running against both dated and updated types of this SequenceFile.