How to parse a tar file in C++

Asked 24/3, 2010 at 2:54 Answered 13/1, 2016 at 15:29

What I want to do is download a .tar file with multiple directories with 2 files each. The problem is I can't find a way to read the tar file without actually extracting the files (using tar).

The perfect solution would be something like:

#include <easytar>

Tarfile tar("somefile.tar");
std::string currentFile, currentFileName;
for(int i=0; i<tar.size(); i++){
  file = tar.getFileText(i);
  currentFileName = tar.getFileName(i);
  // do stuff with it
}

I'm probably going to have to write this myself, but any ideas would be appreciated..

Bollen answered 24/3, 2010 at 2:54 Comment(2)

man tar tells me -t List archive contents to stdout. Is that what you want? – Merissameristem 24/3, 2010 at 7:51

What I'm actually wanting is the opposite: reading a tar file from stdin. – Bollen 24/3, 2010 at 21:26

I figured this out myself after a bit of work. The tar file spec actually tells you everything you need to know.

First off, every file starts with a 512 byte header, so you can represent it with a char[512] or a char* pointing at somewhere in your larger char array (if you have the entire file loaded into one array for example).

The header looks like this:

location  size  field
0         100   File name
100       8     File mode
108       8     Owner's numeric user ID
116       8     Group's numeric user ID
124       12    File size in bytes
136       12    Last modification time in numeric Unix time format
148       8     Checksum for header block
156       1     Link indicator (file type)
157       100   Name of linked file

So if you want the file name, you grab it right here with string filename(buffer[0], 100);. The file name is null padded, so you could do a check to make sure there's at least one null and then leave off the size if you want to save space.

Now we want to know if it's a file or a folder. The "link indicator" field has this information, so:

// Note that we're comparing to ascii numbers, not ints
switch(buffer[156]){
    case '0': // intentionally dropping through
    case '\0':
        // normal file
        break;
    case '1':
        // hard link
        break;
    case '2':
        // symbolic link
        break;
    case '3':
        // device file/special file
        break;
    case '4':
        // block device
        break;
    case '5':
        // directory
        break;
    case '6':
        // named pipe
        break;
}

At this point, we already have all of the information we need about directories, but we need one more thing from normal files: the actual file contents.

The length of the file can be stored in two different ways, either as a 0-or-space-padded null-terminated octal string, or "a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field".

Numeric values are encoded in octal numbers using ASCII digits, with leading zeroes. For historical reasons, a final NUL or space character should be used. Thus although there are 12 bytes reserved for storing the file size, only 11 octal digits can be stored. This gives a maximum file size of 8 gigabytes on archived files. To overcome this limitation, star in 2001 introduced a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field. GNU-tar and BSD-tar followed this idea. Additionally, versions of tar from before the first POSIX standard from 1988 pad the values with spaces instead of zeroes.

Here's how you would read the octal format, but I haven't written code for the base-256 version:

// in one function
int size_of_file = octal_string_to_int(&buffer[124], 11);

// elsewhere
int octal_string_to_int(char *current_char, unsigned int size){
    unsigned int output = 0;
    while(size > 0){
        output = output * 8 + *current_char - '0';
        current_char++;
        size--;
    }
    return output;
}

Ok, so now we have everything except the actual file contents. All we have to do is grab the next size bytes of data from the tar file and we'll have our file contents:

// Get to the next block after the header ends
location += 512;
file_contents = new char[size];
memcpy(file_contents, &buffer[location], size);
// Go to the next block by rounding up to 512
// This isn't necessarily the most efficient way to do this,
// but it's the most obvious.
location += (int)ceil(size / 512.0)

Bollen answered 24/3, 2010 at 21:25 Comment(6)

I am currently using your code, and for tar files created with Gnome File Roller, the "sizeOfFile = octalStringToInt(..., 11)" seems to be wrong "in some rare cases". Could you point out what was the "magic" omitted in the 12th byte ? – Jessiejessika 17/6, 2013 at 17:59

@Jessiejessika I really don't know. If you find out, let me know. – Bollen 28/6, 2013 at 22:8

Note, if file size is exactly 512 bytes, then location = location + ((size / 512) + 1) * 512 will miss next header – Poacher 3/12, 2013 at 20:9

if the size is 0 (such as for a directory), then I think that line will advance an extra 512 when it shouldn't. Maybe have a condition: if (size > 0) { location = location + (((size - 1) / 512) + 1) * 512; } – Ohara 17/12, 2013 at 22:35

@Ohara I did it a simpler way so I can stop thinking about edge cases. I assume people who are using this can figure out a more efficient way to round if they want. – Bollen 18/12, 2013 at 1:52

@Jessiejessika This may be a bit late for you, but apparently there's some base-256 format, which is distinguished by the first bit of the size field. I plan to look into this question later and actually write some code for parsing it. – Bollen 30/4, 2014 at 19:11

Have you looked at libtar?

From the fink package info:

libtar-1.2-1: Tar file manipulation API libtar is a C library for manipulating POSIX tar files. It handles adding and extracting files to/from a tar archive. libtar offers the following features:
* Flexible API - you can manipulate individual files or just extract a whole archive at once.
* Allows user-specified read() and write() functions, such as zlib's gzread() and gzwrite().
* Supports both POSIX 1003.1-1990 and GNU tar file formats.

Not c++ per se, but you can link to c pretty easily...

Diphenyl answered 24/3, 2010 at 2:55 Comment(1)

@BrendanLong King of sucks is an overstatement. – Swoon 2/7, 2016 at 22:57

libarchive can be the open source library to parse the tarball. Libarchive can read each files from an archive file without extraction, and also it can write data to form a new archive file.

Calamine answered 13/1, 2016 at 15:29 Comment(0)

Recommended topics

Hot tags