For most of last year, every pull request in our monorepo triggered a full twelve-minute CI pipeline. Engineers learned to open a new tab, make coffee, and check Slack before coming back to look at the results. Twelve minutes doesn't sound catastrophic, but when you multiply it across forty engineers pushing two or three branches a day, it adds up to something that starts to feel like institutional drag.
The obvious answer would have been to migrate to a faster build system, swap out our runners, or throw more hardware at the problem. We seriously evaluated all three. But each option carried a meaningful risk: migration breaks things in non-obvious ways, and our reliability targets left us little margin for surprises. So we took a different route — we decided to understand what we actually had before deciding what to change.
Start by measuring, not guessing
The first thing we did was instrument the pipeline itself. We added timing spans to every stage, captured artifact sizes and cache-hit rates, and wrote a small dashboard that let us see the distribution of build durations over time rather than just averages. Averages, it turns out, had been hiding a bimodal story: most builds finished in under seven minutes, but a tail of around eighteen percent were taking eighteen minutes or more, and they were skewing the mean badly.
"The bottleneck was never where we assumed it was. We had spent months optimising the test runner while the real latency was sitting in the dependency-resolution step, completely invisible to us."
— Dana Okoro, Platform Engineering
Once we had real data, the culprit became clear almost immediately. Dependency resolution was being re-run from scratch on every build because our cache key was too coarse — it hashed the entire lockfile, which changed whenever any package anywhere in the monorepo was updated. Splitting the cache key by workspace and by whether the lockfile section relevant to that workspace had actually changed brought our median resolution time from ninety seconds down to under four seconds.
The second win came from build-graph pruning. We already had a tool that could tell us which packages a given commit touched, but we weren't wiring that information into the pipeline decision. Adding a step that computed the affected subgraph and skipped unaffected packages entirely cut the number of compilation units by, on average, sixty percent for routine feature branches.
Taken together, these two changes brought our median pipeline time to just under five and a half minutes — a reduction of more than fifty percent — without touching the underlying build toolchain, the test framework, or the runner infrastructure. The lesson we came away with is a familiar one, but easy to forget under deadline pressure: in complex systems, the limiting factor is almost never the thing you suspect before you look at the data. Measure first, then move.