Performance Engineering Lessons From GF(2) Linear Algebra And CTMC

June 1, 2026 • Interview Review

The useful bridge between my research code and systems interview prep is not "I used the same library." It is the engineering loop: choose a representation, reason about memory traffic, measure a real bottleneck, and keep correctness tests close enough to catch subtle damage.

The GF(2) Story

For binary linear algebra, scalar arithmetic is the wrong mental model. A binary matrix can be bit-packed so one machine word represents many field elements. Addition over GF(2) becomes XOR, and row operations become word-level operations with much less memory traffic.

Project answer. My performance project optimized MATLAB MEX functions for GF(2) linear algebra. The core idea was representation: bit-pack binary matrices so operations become word-level XOR and vectorized kernels rather than scalar arithmetic. That led naturally to cache locality, SIMD width, OpenMP parallelism, benchmark design, and correctness checks against MATLAB/GF(2) reference behavior.

M4RI Matrix Multiplication

M4RI-style multiplication groups `b` columns of `A` and `b` rows of `B`. For each strip, build a table of all `2^b` XOR combinations of selected rows of `B`. Then each row of `A` supplies a small index into that table.

A in F_2^(I x K)
B in F_2^(K x J)
W = machine word size
b = M4RI block size

per b-wide strip:
  build table:       O(2^b * J/W)
  construct indices: O(I * b)
  apply table rows:  O(I * J/W)

across K/b strips:
  O((K/b) * (2^b * J/W + I*b + I*J/W))

The correction that matters: constructing a table index from `A` is scalar work per output row. It should not be multiplied by `J/W` unless the implementation repeats index construction for every output word. The table-row XOR is the part that scales with `J/W`.

Rank And Nullspace

Bit-packed Gaussian elimination is roughly `O(r*m*n/W)`, where each row XOR costs `n/W`. M4RI groups pivots and uses lookup tables to reduce the constant in the dense elimination term.

For nullspace, the ideal path is: row-reduce, identify pivot and free columns, then construct basis vectors. After an RREF-like form `[I_r | A]`, the nullspace basis is essentially `[A; I_f]` over GF(2). But an implementation that still does standard-style elimination below pivots may not get the full M4RI reduction benefit. The honest complexity has to describe the actual code, not the algorithm I wish I had implemented.

Layout Experiments

A full row-major copy of `A` sounded plausible because it could reduce strided gathers when building M4RI indices. In measurements, the full transpose/copy was too expensive. The better abstraction was narrower: precompute exactly the M4RI table index needed by the hot loop.

full bit-packed A:
  correct, but slower

row-major byte A:
  tiled copy fixed the worst slowdown, but still lost

aidx:
  precompute one 8-bit M4RI index per row per strip
  hot loop becomes one byte load
  same asymptotic complexity, better constant

The lesson is not "transpose is bad." The lesson is: optimize the actual repeated work. A broad data layout transformation can lose to a small precomputed artifact that matches the hot path.

CTMC And Fenwick Trees

The CTMC sampler uses a rejection-free Gillespie step. Each possible spin flip has a rate; sample an event proportional to rate; update the state and the affected local rates; advance time by an exponential waiting time.

for each event:
  Z = total rate
  u = random number in (0, Z]
  idx = weighted prefix search
  flip selected spin
  update affected rates
  t += exponential_sample / Z

A Fenwick tree is the right data structure for event selection when rates change locally:

point update: `O(log N)`
prefix sum / total: `O(log N)`
weighted prefix search: `O(log N)`

Correctness Before Speed

The Fenwick tree can be internally consistent with the stored rate vector while the rate vector is wrong. That is the dangerous failure mode. If the local update set misses sites whose rates changed, the CTMC event distribution becomes stale after some flips.

The guardrail test is simple and powerful:

for many random lattices and random flips:
  compute all rates
  flip one spin
  update rates locally
  recompute all rates from scratch
  assert local rates == full recomputation

Only after this test passes should performance cleanup matter: `O(N)` Fenwick build, maintaining the total rate as a scalar, exposing runnable step counts, and making parallel worker counts configurable.

Reusable Engineering Loop

Change representation when it changes the cost model.
Measure the hot path, not the prettiest abstraction.
State complexity for the implementation, not only the ideal algorithm.
Add correctness tests that compare optimized local updates with slow full recomputation.
Only then tune constants, parallelism, and resource use.