Learning Zig with Bespoke File Formats

Recently, I was in a discussion with my partner about a file format known as a PDB or a Protein Data Bank. This format is used to hold information about 3D structures of molecules. This file piqued my interest because it has a few unique features:

The file mostly uses a fixed-column, 80 byte line
There is a well-defined spec that breaks down its structure
There are multiple variants used by different kinds of computational scientists

These kinds of files are not ones that I typically interact with, and it makes it an interesting one to work on! But why Zig? Zig provides some great baked in tools for manual memory management as well as parsing and tokenizing.

So where do we start?

I decided to start with parsing the ATOM spec. This is the meat of most of these files as they represent the many atoms of these files. My first crack at this was to define a data model that represents what we find on a given line. Below is a representative of both the byte-level breakdown of a line and translated representation into something more usable for Zig. This maps directly to the spec that is provided to us. Interestingly, there are pieces of the spec that have no item in them and most be parsed into empty space (more on this later).

const string = []const u8;
const char = u8;

const Line = extern struct {
    record: [6]u8,
    serial: [5]u8,
    _space: [1]u8,
    name: [4]u8,
    altLoc: [1]u8,
    resName: [3]u8,
    _space2: [1]u8,
    chainID: [1]u8,
    resSeq: [4]u8,
    iCode: [1]u8,
    _space3: [3]u8,
    x: [8]u8,
    y: [8]u8,
    z: [8]u8,
    occupancy: [6]u8,
    tempFactor: [6]u8,
    _space4: [10]u8,
    element: [2]u8,
    charge: [2]u8,
};

const AtomRecord = struct {
    record: string,
    serial: u32,
    name: string,
    altLoc: char,
    resName: string,
    chainID: char,
    resSeq: u16,
    iCode: char,
    x: f32,
    y: f32,
    z: f32,
    occupancy: f32,
    tempFactor: f32,
    element: string,
    atom: string,
    charge: string,
};

We use a Zig built-in, @ptrCast, to directly map the pointer of the bytes to a line record. We use the line record as a temporary structure to map to usable structures in the record itself. The function below is used to duplicate that memory into a more permanent structure as well as map remove the “empty” parts of the line. However, it is important to note that we use an allocator to allocate the string fields of the atom.

fn convertToAtomRecord(self: *const Line, serialIndex: u32, len: usize, allocator: std.mem.Allocator) !AtomRecord {
    var atom: AtomRecord = AtomRecord{};
    atom.serial = std.fmt.parseInt(u32, strings.removeSpaces(&self.serial), 10) catch serialIndex + 1;
    atom.name = try allocator.dupe(u8, strings.removeSpaces(&self.name));
    atom.altLoc = if (self.altLoc[0] == 32) null else self.altLoc[0];
    atom.resName = try allocator.dupe(u8, strings.removeSpaces(&self.resName));
    atom.chainID = self.chainID[0];
    atom.resSeq = try std.fmt.parseInt(u16, strings.removeSpaces(&self.resSeq), 10);
    atom.iCode = if (self.iCode[0] == 32) null else self.iCode[0];
    atom.x = try std.fmt.parseFloat(f32, strings.removeSpaces(&self.x));
    atom.y = try std.fmt.parseFloat(f32, strings.removeSpaces(&self.y));
    atom.z = try std.fmt.parseFloat(f32, strings.removeSpaces(&self.z));
    atom.occupancy = try std.fmt.parseFloat(f32, strings.removeSpaces(&self.occupancy));
    atom.tempFactor = try std.fmt.parseFloat(f32, strings.removeSpaces(&self.tempFactor));
    const entry = strings.removeSpaces(&self._space4);
    if (entry.len != 0) {
        atom.entry = try allocator.dupe(u8, entry);
    }
    if (len > 76) {
        const element = strings.removeSpaces(&self.element);
        if (element.len != 0) {
            atom.element = try allocator.dupe(u8, element);
        }
        if (len == 80) {
            const charge = strings.removeSpaces(&self.charge);
            if (charge.len != 0) {
                atom.charge = try allocator.dupe(u8, charge);
            }
        }
    }
    return atom;
}
}

This leads us to having to write a function to free our memory for the allocated fields. We have to allocate

/// Frees all the strings in the struct
pub fn free(self: *AtomRecord, allocator: std.mem.Allocator) void {
    if (self.charge != null) {
        allocator.free(self.charge.?);
    }
    if (self.element != null) {
        allocator.free(self.element.?);
    }
    if (self.entry != null) {
        allocator.free(self.entry.?);
    }
    allocator.free(self.name);
    allocator.free(self.resName);
}

But why should I manage my own memory?

Languages like Rust and Go don’t have this complexity due to their management of memory. In Rust’s case, the borrow checker handles our allocation on the heap for us; Go does this with garbage collection. So why use Zig at all? Why regress into manually managing your memory. When working with a file like this, the performance benefit might not be apparent and might actually make life MORE difficult. The value comes in cases where you have limited or constrained resources, this is where the ability to manage your own memory can be extremely helpful. One such example is WASM, where a garbage collector (like in Go’s case) has to be integrated into your WASM code. Zig provides you as the developer with the tools to handle how all of your memory can be handled.

I found that having to manage my own memory brought back something I love about computing that I don’t do much anymore. As I get further into web development with modern languages, I find myself having to think about these things less which in turn makes me more productive overall, but it makes it easier to forget about memory usage and performance. While the goal of this project is to create a parser to play around with, the goal is to ensure we are performant while doing it. I have already started to explore how we can use this parser to handle conversion to things like FASTA files (more on that in a later blog post).

Summary

So with all of the above, should you learn Zig? If you want to dig into what the computer is doing under the hood, I think Zig is an excellent place to learn. However, Zig does have rough edges. The standard library can change underneath you and can sometimes render examples useless as the functions being called have changed. With many of these cases, the Zig community has proven to be very helpful. I intend to continue writing a few of my projects in Zig in the future because I love how it feels to write. If you are interested in seeing more about this project, please feel free to check out sonic-pdb-parser or come watch me build it live on Twitch.

Learning Zig with Bespoke File Formats

How I started my Zig journey with Protein Data Bank files

So where do we start?

But why should I manage my own memory?

Summary