Code and Life

Programming, electronics and other cool tech stuff

Supported by

Supported by Picotech

Unzip Library for C

Christmas holidays are a wonderful time to invent new projects. I decided I’d do some desktop coding for a change, and try to code an optimized image viewer for my old zipped pocket camera photos. First task of course was to read a zip file.

To my surprise, there wasn’t a “GNU standard library” available for this task like there is zlib for general compression, or libjpeg and libpng for images. Best match for my simple needs seemed to be Minizip, but at 7378 lines of code, and 2125 for just unzip.c (utilizing zlib so basically just file handling), I was not convinced, especially because I knew I had some very specific requirements to cater for (namely uncompressing all JPEGs to memory for fast rendering and thumbnail generation).

Zip File Structure – The Essentials

The ZIP file format turned out to be surprisingly simple, especially since I decided I would be sticking to bare essentials and skipping zip64 support, encryption, multifile zips, and all other compression methods than “store” (no compression) and “deflate” (easily decompressed with zlib, see below). Even with barebones setup, my zip routines would handle about 99.9 % of zips out there just fine.

Drawing on excellent ZIP format documentation from InfoZip’s latest appnote, the file structure I needed to parse seemed to have the following structure:

  • Local file header 1
  • File data 1
  • Data descriptor 1
  • Local file header N
  • File data N
  • Data descriptor N
  • … optional decryption / extra stuff …
  • Central directory
  • … zip64-specific extra stuff …
  • End of central directory record

Locating The End Record

Like many archive formats, zip has data of compressed file in the beginning, and contents of the archive in the end. This means that the “starting point” for reading zip contents is to first read the “end of central directory record” (I’ll refer it as “end record” from now on), which contains a pointer to the central directory. Because this end record ends with variable size comment field, the exact offset cannot be known. Instead, we look for certain bytes (0x504b0506) in the end, which indicate the beginning of this record. Below is all the code needed to locate and read the end record:


typedef struct __attribute__ ((__packed__)) {
    unsigned long signature; // 0x06054b50
    unsigned short diskNumber; // unsupported
    unsigned short centralDirectoryDiskNumber; // unsupported
    unsigned short numEntriesThisDisk; // unsupported
    unsigned short numEntries;
    unsigned long centralDirectorySize;
    unsigned long centralDirectoryOffset;
    unsigned short zipCommentLength;
    // Followed by .ZIP file comment (variable size)
} JZEndRecord;

#define JZ_BUFFER_SIZE 65536

unsigned char jzBuffer[JZ_BUFFER_SIZE]; // limits maximum zip descriptor size

int jzReadEndRecord(FILE *zip, JZEndRecord *endRecord) {
    long fileSize, readBytes, i;
    JZEndRecord *er;

    fseek(zip, 0, SEEK_END); // go to end of file
    fileSize = ftell(zip); // current position equals file size

    // Fill the buffer, but at most the whole file
    readBytes = (fileSize < sizeof(jzBuffer)) ? fileSize : sizeof(jzBuffer);
    fseek(zip, fileSize - readBytes, SEEK_SET);
    fread(jzBuffer, 1, readBytes, zip);

    // Naively assume signature can only be found in one place...
    for(i = readBytes - sizeof(JZEndRecord); i >= 0; i--) {
        er = (JZEndRecord *)(jzBuffer + i);
        if(er->signature == 0x06054B50)
            break;
    }

    if(i < 0) {
        fprintf(stderr, "End record signature not found in zip!");
        return Z_ERRNO;
    }

    memcpy(endRecord, er, sizeof(JZEndRecord));

    return Z_OK;
}

Reading The Central Directory

The “central directory offset” field in end record now tells us where the contents of the zip are found. Reading the central directory is a trivial for loop repeated endRecord->numEntries times, only extra effort that is needed is to read in or skip variable length fields, like filename and extra field info. Here is again the code:


int jzReadCentralDirectory(FILE *zip, JZEndRecord *endRecord,
        JZRecordCallback callback) {
    JZGlobalFileHeader fileHeader;
    JZFileHeader header;
    long totalSize = 0;
    int i;

    // Go to the beginning of central directory
    fseek(zip, endRecord->centralDirectoryOffset, SEEK_SET);

    for(i=0; i<endRecord->numEntries; i++) {
        fread(&fileHeader, 1, sizeof(JZGlobalFileHeader), zip);

        if(fileHeader.signature != 0x02014B50) // check signature
            return Z_ERRNO;

        fread(jzBuffer, 1, fileHeader.fileNameLength, zip); // read filename
        jzBuffer[fileHeader.fileNameLength] = '\\0'; // NULL terminate

        fseek(zip, fileHeader.extraFieldLength, SEEK_CUR); // skip
        fseek(zip, fileHeader.fileCommentLength, SEEK_CUR); // skip

        // Construct JZFileHeader from global file header
        memcpy(&header, &fileHeader.compressionMethod, sizeof(header));
        header.offset = fileHeader.relativeOffsetOflocalHeader;

        // Invoke callback to do something with the file header and filename
        callback(zip, i, &header, (char *)jzBuffer);
    }

    return Z_OK;
}

Note that I omitted the structure typedefs this time, they can be found from InfoZip’s app note, as well as the project zip in the end of this post. The file header contains all necessary information to go and uncompress that file, most importantly the offset (location) of local file header record within the zip.

In the end of the for loop, you can see I copy part of the header data to a simplified JZFileHeader structure that gets passed to a callback. I basically strip fields that are not shared by local and global file headers or otherwise useful so the code using this library has a simpler, unified data structure to work with.

Reading a Local File Header

Local file header is very similar to the one in central directory. Main differences are the absence of some fields only available at central directory (like offset, because local header is immediately followed by file data).


int jzReadLocalFileHeader(FILE *zip, JZFileHeader *header,
        char *filename, int len) {
    JZLocalFileHeader localHeader;

    fread(&localHeader, 1, sizeof(JZLocalFileHeader), zip);

    if(localHeader.signature != 0x04034B50)
        return Z_ERRNO;

    if(localHeader.fileNameLength >= len)
        return Z_ERRNO; // filename cannot fit

    fread(filename, 1, localHeader.fileNameLength, zip);
    filename[localHeader.fileNameLength] = '\\0'; // NULL terminate

    if(localHeader.extraFieldLength) // skip extra field if there is one
        fseek(zip, localHeader.extraFieldLength, SEEK_CUR);

    if(localHeader.generalPurposeBitFlag)
        return Z_ERRNO; // Flags not supported

    memcpy(header, &localHeader.compressionMethod, sizeof(JZFileHeader));
    header->offset = 0; // not used in local context

    return Z_OK;
}

Note that because a zip file actually starts with the first local file header, which contains all necessary information to uncompress or skip the file data (with fseek), reading central directory is not necessary if you want to just process the files one by one as they come up (and don’t need any of the additional fields exclusive to the global file header structure).

Uncompressing a file

There are several compression methods supported by the .ZIP file format, but because 99 % only use the “store” (method 0) and “deflate” (method 8) methods, this is what I support in this library. For those who want the 1 %, extending the library or using a more comprehensive library is recommended.

The code below either reads (stored) or inflates (deflated) the data into a already allocated buffer, depending on compression method. Inflating code is just an adaption of the public domain zpipe.c example from zlib. Only gotcha here is, that you cannot call inflateInit because the data stream does not contain the kind of header information zlib assumes. Instead, you have to call inflateInit2 with a negative window size. After that, it’s mainly just keeping count of how many bytes have been read and uncompressed, and checking for any errors.


int jzReadData(FILE *zip, JZFileHeader *header, void *buffer) {
    unsigned char *bytes = (unsigned char *)buffer; // cast
    long compressedLeft, uncompressedLeft;
    z_stream strm;
    int ret;

    if(header->compressionMethod == 0) { // Store - just read it
        if(fread(buffer, 1, header->uncompressedSize, zip) <
                header->uncompressedSize || ferror(zip))
            return Z_ERRNO;
    } else if(header->compressionMethod == 8) { // Deflate - using zlib
        strm.zalloc = Z_NULL;
        strm.zfree = Z_NULL;
        strm.opaque = Z_NULL;

        strm.avail_in = 0;
        strm.next_in = Z_NULL;

        // Use inflateInit2 with negative window bits to indicate raw data
        if((ret = inflateInit2(&strm, -MAX_WBITS)) != Z_OK)
            return ret; // Zlib errors are negative

        // Inflate compressed data
        for(compressedLeft = header->compressedSize,
                uncompressedLeft = header->uncompressedSize;
                compressedLeft && uncompressedLeft && ret != Z_STREAM_END;
                compressedLeft -= strm.avail_in) {
            // Read next chunk
            strm.avail_in = fread(jzBuffer, 1,
                    (sizeof(jzBuffer) < compressedLeft) ?
                    sizeof(jzBuffer) : compressedLeft, zip);

            if(strm.avail_in == 0 || ferror(zip)) {
                inflateEnd(&strm);
                return Z_ERRNO;
            }

            strm.next_in = jzBuffer;
            strm.avail_out = uncompressedLeft;
            strm.next_out = bytes;

            compressedLeft -= strm.avail_in; // inflate will change avail_in

            ret = inflate(&strm, Z_NO_FLUSH);

            if(ret == Z_STREAM_ERROR) return ret; // shouldn't happen

            switch (ret) {
                case Z_NEED_DICT:
                    ret = Z_DATA_ERROR;     /* and fall through */
                case Z_DATA_ERROR: case Z_MEM_ERROR:
                    (void)inflateEnd(&strm);
                    return ret;
            }

            bytes += uncompressedLeft - strm.avail_out; // bytes uncompressed
            uncompressedLeft = strm.avail_out;
        }

        inflateEnd(&strm);
    } else {
        return Z_ERRNO;
    }

    return Z_OK;
}

Using the Library

After a day’s work, I got a nice library with a 88-line header file (most taken up by four structure typedefs), 230 lines of actual library code (including comments and generous whitespace :), and a 100-line example that can unzip a zip file. Not bad at all!

The library and example code are released into public domain. Grab the full source from github:

https://github.com/jokkebk/JUnzip

And throw me a line if you found this useful!

10 comments

Mike:

Thank you very much for coding this library, it does work very nicely on files which libz discards as inconsistent!

James:

Thanks for sharing this code with the world. I’m currently trying to see how to use this code, without actually writing the contents of the data buffer to disk.

I have zipped binary files which are in a one to one basis. I need to be able to unzip the file into memory, and then walk through the data buffer reading the binary data into a specified structure.

I need to call the processfile function from the main routine but I keep getting an error. It complains about “Couldn’t read local file header!”, regardless of whether or not I comment out the processfile in the recordcallback.

Help!

Regards

Joonas Pihlajamaa:

Thanks! I have encountered a similar problem sometimes myself, I did some patching to JUnzip github based on similar error in the past. I have made a program called JZipView which is also available on Github and has example code on how to read the central directory (function processZip):

https://github.com/jokkebk/JZipVIew/blob/master/main.c

You might want to check that out and see if you can first read the central dir from ZIP file, then you can look at the individual file loading with a debugger or debug statements, and see what goes wrong.

In the main.c I linked, recordCallback is used in processZip to save jpeg file header info, and then later loadImageFromZip reads the JPEG into memory, exactly what you are trying to accomplish. Good luck with your project!

N. S.:

Was libzip inappropriate for this?

https://nih.at/libzip/index.html

Joonas Pihlajamaa:

No, it actually looks very decent. Still a slightly larger code footprint than the one I made, but not by a large margin. I didn’t find this one when researching the alternatives, might have saved me a dozen hours of coding. :)

Dypsok:

Hello Jonaas,
I am a newbie C developper and I have enjoyed using your lib : thank you, it work like a charm. I am using your demo code and I wonder how could I use multithreading to process each file separetely : in you demo code the recordCallback need a serial call…
How could I parallelize a work on each compressed file?
Thank you for any comment.

Joonas Pihlajamaa:

Thanks! Sorry, I don’t have much experience on multithreaded apps, you could probably run several threads of the library, uncompressing one file in each for example. Not the most efficient, and if you do several file reads at the same time, mechanical hard drives may have problems with random access, actually slowing the decompression. If you read all into memory it might be more efficient. You could build a queue of files to be decompressed, and have several threads processing that queue.

Again, saving uncompressed data to disk might have performance impact on multithread, hard to say without more experience.

Dypsok:

Hi Joonas,
Thanks for your response : I have done some tests with multithreading but cannot get it to work as the code need to work on each files in the archive one after one… So I’m still working on it to improve speed : I think my problem is with the IO access. .. I was wondering how one could modify the Junzipper code in order to work on a zip buffer already in memory as may be the following code is intented to : https://github.com/FIX94/Nintendont/blob/master/loader/source/unzip/ioapi_mem.c ?
Could you help me figure how to adapt your code to work from memory in order to do some benchmark of the two techniques ?
Writing that I am just realizing that I’m on Holyday at some days of Xmas : hope that’s a sign :=)
Anyway, thank you for any help.

Aastha Shrivastva:

where is the callback function defined i can’t find it ?

Joonas Pihlajamaa:

It’s been a few years since I wrote the library, but I think you write the callback function yourself to do whatever you want to do with the resulting data, and pass that as a parameter to the function needing it. There should be an example in the Github IIRC.