Christmas holidays are a wonderful time to invent new projects. I decided I’d do some desktop coding for a change, and try to code an optimized image viewer for my old zipped pocket camera photos. First task of course was to read a zip file.

To my surprise, there wasn’t a “GNU standard library” available for this task like there is zlib for general compression, or libjpeg and libpng for images. Best match for my simple needs seemed to be Minizip, but at 7378 lines of code, and 2125 for just unzip.c (utilizing zlib so basically just file handling), I was not convinced, especially because I knew I had some very specific requirements to cater for (namely uncompressing all JPEGs to memory for fast rendering and thumbnail generation).

Zip File Structure – The Essentials

The ZIP file format turned out to be surprisingly simple, especially since I decided I would be sticking to bare essentials and skipping zip64 support, encryption, multifile zips, and all other compression methods than “store” (no compression) and “deflate” (easily decompressed with zlib, see below). Even with barebones setup, my zip routines would handle about 99.9 % of zips out there just fine.

Drawing on excellent ZIP format documentation from InfoZip’s latest appnote, the file structure I needed to parse seemed to have the following structure:

  • Local file header 1
  • File data 1
  • Data descriptor 1
  • Local file header N
  • File data N
  • Data descriptor N
  • … optional decryption / extra stuff …
  • Central directory
  • … zip64-specific extra stuff …
  • End of central directory record

Locating The End Record

Like many archive formats, zip has data of compressed file in the beginning, and contents of the archive in the end. This means that the “starting point” for reading zip contents is to first read the “end of central directory record” (I’ll refer it as “end record” from now on), which contains a pointer to the central directory. Because this end record ends with variable size comment field, the exact offset cannot be known. Instead, we look for certain bytes (0x504b0506) in the end, which indicate the beginning of this record. Below is all the code needed to locate and read the end record:


typedef struct __attribute__ ((__packed__)) {
    unsigned long signature; // 0x06054b50
    unsigned short diskNumber; // unsupported
    unsigned short centralDirectoryDiskNumber; // unsupported
    unsigned short numEntriesThisDisk; // unsupported
    unsigned short numEntries;
    unsigned long centralDirectorySize;
    unsigned long centralDirectoryOffset;
    unsigned short zipCommentLength;
    // Followed by .ZIP file comment (variable size)
} JZEndRecord;

#define JZ_BUFFER_SIZE 65536

unsigned char jzBuffer[JZ_BUFFER_SIZE]; // limits maximum zip descriptor size

int jzReadEndRecord(FILE *zip, JZEndRecord *endRecord) {
    long fileSize, readBytes, i;
    JZEndRecord *er;

    fseek(zip, 0, SEEK_END); // go to end of file
    fileSize = ftell(zip); // current position equals file size

    // Fill the buffer, but at most the whole file
    readBytes = (fileSize < sizeof(jzBuffer)) ? fileSize : sizeof(jzBuffer);
    fseek(zip, fileSize - readBytes, SEEK_SET);
    fread(jzBuffer, 1, readBytes, zip);

    // Naively assume signature can only be found in one place...
    for(i = readBytes - sizeof(JZEndRecord); i >= 0; i--) {
        er = (JZEndRecord *)(jzBuffer + i);
        if(er->signature == 0x06054B50)
            break;
    }

    if(i < 0) {
        fprintf(stderr, "End record signature not found in zip!");
        return Z_ERRNO;
    }

    memcpy(endRecord, er, sizeof(JZEndRecord));

    return Z_OK;
}

Reading The Central Directory

The “central directory offset” field in end record now tells us where the contents of the zip are found. Reading the central directory is a trivial for loop repeated endRecord->numEntries times, only extra effort that is needed is to read in or skip variable length fields, like filename and extra field info. Here is again the code:


int jzReadCentralDirectory(FILE *zip, JZEndRecord *endRecord,
        JZRecordCallback callback) {
    JZGlobalFileHeader fileHeader;
    JZFileHeader header;
    long totalSize = 0;
    int i;

    // Go to the beginning of central directory
    fseek(zip, endRecord->centralDirectoryOffset, SEEK_SET);

    for(i=0; i<endRecord->numEntries; i++) {
        fread(&fileHeader, 1, sizeof(JZGlobalFileHeader), zip);

        if(fileHeader.signature != 0x02014B50) // check signature
            return Z_ERRNO;

        fread(jzBuffer, 1, fileHeader.fileNameLength, zip); // read filename
        jzBuffer[fileHeader.fileNameLength] = '\\0'; // NULL terminate

        fseek(zip, fileHeader.extraFieldLength, SEEK_CUR); // skip
        fseek(zip, fileHeader.fileCommentLength, SEEK_CUR); // skip

        // Construct JZFileHeader from global file header
        memcpy(&header, &fileHeader.compressionMethod, sizeof(header));
        header.offset = fileHeader.relativeOffsetOflocalHeader;

        // Invoke callback to do something with the file header and filename
        callback(zip, i, &header, (char *)jzBuffer);
    }

    return Z_OK;
}

Note that I omitted the structure typedefs this time, they can be found from InfoZip’s app note, as well as the project zip in the end of this post. The file header contains all necessary information to go and uncompress that file, most importantly the offset (location) of local file header record within the zip.

In the end of the for loop, you can see I copy part of the header data to a simplified JZFileHeader structure that gets passed to a callback. I basically strip fields that are not shared by local and global file headers or otherwise useful so the code using this library has a simpler, unified data structure to work with.

Reading a Local File Header

Local file header is very similar to the one in central directory. Main differences are the absence of some fields only available at central directory (like offset, because local header is immediately followed by file data).


int jzReadLocalFileHeader(FILE *zip, JZFileHeader *header,
        char *filename, int len) {
    JZLocalFileHeader localHeader;

    fread(&localHeader, 1, sizeof(JZLocalFileHeader), zip);

    if(localHeader.signature != 0x04034B50)
        return Z_ERRNO;

    if(localHeader.fileNameLength >= len)
        return Z_ERRNO; // filename cannot fit

    fread(filename, 1, localHeader.fileNameLength, zip);
    filename[localHeader.fileNameLength] = '\\0'; // NULL terminate

    if(localHeader.extraFieldLength) // skip extra field if there is one
        fseek(zip, localHeader.extraFieldLength, SEEK_CUR);

    if(localHeader.generalPurposeBitFlag)
        return Z_ERRNO; // Flags not supported

    memcpy(header, &localHeader.compressionMethod, sizeof(JZFileHeader));
    header->offset = 0; // not used in local context

    return Z_OK;
}

Note that because a zip file actually starts with the first local file header, which contains all necessary information to uncompress or skip the file data (with fseek), reading central directory is not necessary if you want to just process the files one by one as they come up (and don’t need any of the additional fields exclusive to the global file header structure).

Uncompressing a file

There are several compression methods supported by the .ZIP file format, but because 99 % only use the “store” (method 0) and “deflate” (method 8) methods, this is what I support in this library. For those who want the 1 %, extending the library or using a more comprehensive library is recommended.

The code below either reads (stored) or inflates (deflated) the data into a already allocated buffer, depending on compression method. Inflating code is just an adaption of the public domain zpipe.c example from zlib. Only gotcha here is, that you cannot call inflateInit because the data stream does not contain the kind of header information zlib assumes. Instead, you have to call inflateInit2 with a negative window size. After that, it’s mainly just keeping count of how many bytes have been read and uncompressed, and checking for any errors.


int jzReadData(FILE *zip, JZFileHeader *header, void *buffer) {
    unsigned char *bytes = (unsigned char *)buffer; // cast
    long compressedLeft, uncompressedLeft;
    z_stream strm;
    int ret;

    if(header->compressionMethod == 0) { // Store - just read it
        if(fread(buffer, 1, header->uncompressedSize, zip) <
                header->uncompressedSize || ferror(zip))
            return Z_ERRNO;
    } else if(header->compressionMethod == 8) { // Deflate - using zlib
        strm.zalloc = Z_NULL;
        strm.zfree = Z_NULL;
        strm.opaque = Z_NULL;

        strm.avail_in = 0;
        strm.next_in = Z_NULL;

        // Use inflateInit2 with negative window bits to indicate raw data
        if((ret = inflateInit2(&strm, -MAX_WBITS)) != Z_OK)
            return ret; // Zlib errors are negative

        // Inflate compressed data
        for(compressedLeft = header->compressedSize,
                uncompressedLeft = header->uncompressedSize;
                compressedLeft && uncompressedLeft && ret != Z_STREAM_END;
                compressedLeft -= strm.avail_in) {
            // Read next chunk
            strm.avail_in = fread(jzBuffer, 1,
                    (sizeof(jzBuffer) < compressedLeft) ?
                    sizeof(jzBuffer) : compressedLeft, zip);

            if(strm.avail_in == 0 || ferror(zip)) {
                inflateEnd(&strm);
                return Z_ERRNO;
            }

            strm.next_in = jzBuffer;
            strm.avail_out = uncompressedLeft;
            strm.next_out = bytes;

            compressedLeft -= strm.avail_in; // inflate will change avail_in

            ret = inflate(&strm, Z_NO_FLUSH);

            if(ret == Z_STREAM_ERROR) return ret; // shouldn't happen

            switch (ret) {
                case Z_NEED_DICT:
                    ret = Z_DATA_ERROR;     /* and fall through */
                case Z_DATA_ERROR: case Z_MEM_ERROR:
                    (void)inflateEnd(&strm);
                    return ret;
            }

            bytes += uncompressedLeft - strm.avail_out; // bytes uncompressed
            uncompressedLeft = strm.avail_out;
        }

        inflateEnd(&strm);
    } else {
        return Z_ERRNO;
    }

    return Z_OK;
}

Using the Library

After a day’s work, I got a nice library with a 88-line header file (most taken up by four structure typedefs), 230 lines of actual library code (including comments and generous whitespace :), and a 100-line example that can unzip a zip file. Not bad at all!

The library and example code are released into public domain. Grab the full source from github:

https://github.com/jokkebk/JUnzip

And throw me a line if you found this useful!