For a data recovery program I need to be able to extract the values+types from files written by NSArchiver, without having access to Apple's CF / NS frameworks.
The OS X file
command reports such files as:
NeXT/Apple typedstream data, little endian, version 4, system 1000
Is there any documentation on how these files are encoded, or has anyone come up with code that can parse them?
Here's an example of such data (also: downloadable):
04 0B 73 74 72 65 61 6D 74 79 70 65 64 81 E8 03 ..streamtyped...
84 01 40 84 84 84 12 4E 53 41 74 74 72 69 62 75 [email protected]
74 65 64 53 74 72 69 6E 67 00 84 84 08 4E 53 4F tedString....NSO
62 6A 65 63 74 00 85 92 84 84 84 08 4E 53 53 74 bject.......NSSt
72 69 6E 67 01 94 84 01 2B 06 46 65 73 6B 65 72 ring....+.Fesker
86 84 02 69 49 01 06 92 84 84 84 0C 4E 53 44 69 ...iI.......NSDi
63 74 69 6F 6E 61 72 79 00 94 84 01 69 01 92 84 ctionary....i...
96 96 1D 5F 5F 6B 49 4D 4D 65 73 73 61 67 65 50 ...__kIMMessageP
61 72 74 41 74 74 72 69 62 75 74 65 4E 61 6D 65 artAttributeName
86 92 84 84 84 08 4E 53 4E 75 6D 62 65 72 00 84 ......NSNumber..
84 07 4E 53 56 61 6C 75 65 00 94 84 01 2A 84 99 ..NSValue....*..
99 00 86 86 86 .....
This contains a NSAttributedString. I have similar examples that contain NSMutableAttributedStrings, etc., but all eventually resolve to NSAttributedStrings, for which I like to get the text. I do not care for the rest, but I need to know if it's valid.
My current solution is to use the NSUnarchiver and, assuming I always should find a NSAttributedString in there, get its first element and read its text, then recreate an archive from it and see if it is the same as the original data. If I get an exception or a different archive back, I assume that the archive is damaged or invalid:
NSData *data = [[NSData alloc] initWithBytesNoCopy:dataPtr length:dataLen freeWhenDone:false];
NSUnarchiver *a = NULL;
// The algorithm simply assumes that the data contains a NSAttributedString, retrieves it,
// and then recreates the NSArchived version from it in order to tell its size.
@try {
a = [[NSUnarchiver alloc] initForReadingWithData:data];
NSAttributedString *s = [a decodeObject];
// re-encode the string item so we can tell its length
NSData *d = [NSArchiver archivedDataWithRootObject:s];
if ([d isEqualTo:[data subdataWithRange:NSMakeRange(0,d.length)]]) {
lenOut = (int) d.length;
okay = true; // -> lenOut is valid, though textOut might still fail, see @catch below
textOut = [s.string cStringUsingEncoding:NSUTF8StringEncoding];
} else {
// oops, we don't get back what we had as input, so let's better not consider this valid
}
} @catch (NSException *e) {
// data is invalid
}
However, there are several issues with the above code:
- It's not x-platform. I need this to work on Windows, too.
- Some examples of damaged data cause an unwanted error msg written to stderr or syslog (not sure which), such as:
*** mmap(size=18446744071608111104) failed (error code=12) *** error: can't allocate region *** set a breakpoint in malloc_error_break to debug
(I filed a bug report about this which was closed as "won't fix", sadly). - Nothing guarantees that the NSUnarchiver code is 100% crashproof. The malloc error is an example for this. I might as well get a bus error in some situations, and that'd be fatal. If I had custom code for parsing, I can take care of that myself (and fix any crashes I encounter). (Update: I just found some invalid data that does indeed crash NSUnarchiver with a SIGSEGV.)
Therefore, I need custom code to decode these kinds of archives. I've looked at a few, but can't make sense of the codes it uses. Apparently, there are length fields and type fields, with the types being in the range around 0x81 to 0x86, apparently. Also, the first 16 byte are the header, including the system code (0x03E8 = 1000) at offset 14-15.
I also wonder if the source code is available in some old NeXT sources or in the Windows version that once existed, but where would I find that? (Note: I was directed to the GNUstep source ("core.20131003.tar.bz2"), in which I found its NSUnarchiver source, but that code, apparently from 1998, uses its own encoding, which isn't understanding this "streamtyped" encoding.)