Performance Engineering Lessons From GF(2) Linear Algebra And CTMC
The useful bridge between my research code and systems interview prep is not "I used the same library." It is the engineering loop: choose a representation, reason about memory traffic, measure a real bottleneck, and keep correctness tests close enough to catch subtle damage.
The GF(2) Story
For binary linear algebra, scalar arithmetic is the wrong mental model. A binary matrix can be bit-packed so one machine word represents many field elements. Addition over GF(2) becomes XOR, and row operations become word-level operations with much less memory traffic.
Project answer. My performance project optimized MATLAB MEX functions for GF(2) linear algebra. The core idea was representation: bit-pack binary matrices so operations become word-level XOR and vectorized kernels rather than scalar arithmetic. That led naturally to cache locality, SIMD width, OpenMP parallelism, benchmark design, and correctness checks against MATLAB/GF(2) reference behavior.
M4RI Matrix Multiplication
M4RI-style multiplication groups `b` columns of `A` and `b` rows of `B`. For each strip, build a table of all `2^b` XOR combinations of selected rows of `B`. Then each row of `A` supplies a small index into that table.
A in F_2^(I x K)
B in F_2^(K x J)
W = machine word size
b = M4RI block size
per b-wide strip:
build table: O(2^b * J/W)
construct indices: O(I * b)
apply table rows: O(I * J/W)
across K/b strips:
O((K/b) * (2^b * J/W + I*b + I*J/W))
The correction that matters: constructing a table index from `A` is scalar work per output row. It should not be multiplied by `J/W` unless the implementation repeats index construction for every output word. The table-row XOR is the part that scales with `J/W`.
Rank And Nullspace
Bit-packed Gaussian elimination is roughly `O(r*m*n/W)`, where each row XOR costs `n/W`. M4RI groups pivots and uses lookup tables to reduce the constant in the dense elimination term.
For nullspace, the ideal path is: row-reduce, identify pivot and free columns, then construct basis vectors. After an RREF-like form `[I_r | A]`, the nullspace basis is essentially `[A; I_f]` over GF(2). But an implementation that still does standard-style elimination below pivots may not get the full M4RI reduction benefit. The honest complexity has to describe the actual code, not the algorithm I wish I had implemented.
Layout Experiments
A full row-major copy of `A` sounded plausible because it could reduce strided gathers when building M4RI indices. In measurements, the full transpose/copy was too expensive. The better abstraction was narrower: precompute exactly the M4RI table index needed by the hot loop.
full bit-packed A:
correct, but slower
row-major byte A:
tiled copy fixed the worst slowdown, but still lost
aidx:
precompute one 8-bit M4RI index per row per strip
hot loop becomes one byte load
same asymptotic complexity, better constant
The lesson is not "transpose is bad." The lesson is: optimize the actual repeated work. A broad data layout transformation can lose to a small precomputed artifact that matches the hot path.
CTMC And Fenwick Trees
The CTMC sampler uses a rejection-free Gillespie step. Each possible spin flip has a rate; sample an event proportional to rate; update the state and the affected local rates; advance time by an exponential waiting time.
for each event:
Z = total rate
u = random number in (0, Z]
idx = weighted prefix search
flip selected spin
update affected rates
t += exponential_sample / Z
A Fenwick tree is the right data structure for event selection when rates change locally:
- point update: `O(log N)`
- prefix sum / total: `O(log N)`
- weighted prefix search: `O(log N)`
Correctness Before Speed
The Fenwick tree can be internally consistent with the stored rate vector while the rate vector is wrong. That is the dangerous failure mode. If the local update set misses sites whose rates changed, the CTMC event distribution becomes stale after some flips.
The guardrail test is simple and powerful:
for many random lattices and random flips:
compute all rates
flip one spin
update rates locally
recompute all rates from scratch
assert local rates == full recomputation
Only after this test passes should performance cleanup matter: `O(N)` Fenwick build, maintaining the total rate as a scalar, exposing runnable step counts, and making parallel worker counts configurable.
Reusable Engineering Loop
- Change representation when it changes the cost model.
- Measure the hot path, not the prettiest abstraction.
- State complexity for the implementation, not only the ideal algorithm.
- Add correctness tests that compare optimized local updates with slow full recomputation.
- Only then tune constants, parallelism, and resource use.