CI/CD pipeline that doesn’t suck

The CI/CD pipeline is the single highest-leverage system in your engineering setup — it dictates how fast you can ship, how confidently, and how often someone has to drop their tea to fix it on a Friday afternoon. Most pipelines we inherit from previous teams are either too slow (engineers batch their changes) or too lenient (broken code ships).

Here’s the shape of a pipeline we’d defend in a code review.

The five-stage pipeline

Stage 1: Install + cache (~30s)

Cache aggressively. Node modules, pip packages, Docker layers, build artifacts. Anything that doesn’t change between commits should be retrieved from cache, not rebuilt. GitHub Actions, GitLab CI and CircleCI all have first-class cache primitives — use them.

Stage 2: Build (~60s)

Type-check, compile, bundle. Whatever produces the artifact your tests run against and your deploy ships. Build once; reuse the artifact across all downstream stages. Don’t build twice (once for test, once for deploy) — that’s how “works on my CI” happens.

Stage 3: Test, in parallel (~90s)

Shard your test suite across N workers. A 6-minute serial suite is 90 seconds on 4 workers. Vitest, Jest, pytest-xdist, Go’s parallel tests — all support this natively. The marginal cost of more workers is dwarfed by the cost of engineers waiting.

Inside this stage, run in parallel: unit tests, integration tests, type check, lint, a11y tests, visual regression. They don’t depend on each other; let them race.

Stage 4: Preview deploy (~60s)

Every PR gets a real URL with the change deployed. Real env, real DB (a clone), real traffic-shape. Reviewers click the link; PMs click the link; designers click the link. The PR conversation is grounded in something tangible.

Stage 5: Production deploy (gated)

On merge to main, the same artifact rolls out to production. Gated by either auto (canary at 5% → wait for health → ramp) or manual (Slack approval). The gate is about confidence, not about meetings.

Roll forward, not back

When prod breaks, your instinct will be to roll back to the previous version. Resist. Roll forward is faster, more honest, and forces you to fix the underlying problem instead of papering over it. Most modern deploy systems (Vercel, Fly, Render, AWS with proper canaries) make roll-forward as cheap as roll-back — without the “we lost the migration” risks.

Reserve rollback for the genuinely catastrophic: data corruption, security incident, complete outage. For everything else — fix forward.

Pipeline-as-code, version-controlled

Your pipeline lives in .github/workflows/ (or equivalent), not in a web UI. Reviewable. Diffable. Reverts cleanly.
Secrets in a real secrets manager (GitHub Secrets, AWS Secrets Manager). Never inline.
Environment-specific values driven by environment variables, not by hand-editing steps.

The metrics that matter

Watch these four numbers like a hawk:

P50 pipeline duration.Target: <5 min. If engineers can’t context-switch to a code review in less time than CI takes, batching starts.
Flakiness rate.A test that fails 1% of the time is a bug. Quarantine it the first time it’s flaky; fix or delete within a week. Flaky tests poison trust in the whole pipeline.
Mean time to revert.When prod breaks, how fast can you ship a revert? Target: <10 min.
Deploys per day.Health metric. If it’s zero on most days, you have a deploy pain that’s shaping engineering behavior.

How we approach this

Every project we ship via Ongoing Maintenance and SaaS Product Development ships with this pipeline shape pre-built — cached, parallel, preview-per-PR, canary-gated to prod. We treat the pipeline as a product, not a one-time setup.

Takeaways

Five stages: install · build · test (parallel) · preview · prod (gated).
Cache everything reusable. Build once. Test in parallel.
Every PR gets a real preview URL.
Roll forward, not back. Most deploys.
Watch P50 duration, flakiness, MTTR, deploys/day.

CI/CD pipeline that doesn’t suck

The five-stage pipeline

Stage 1: Install + cache (~30s)

Stage 2: Build (~60s)

Stage 3: Test, in parallel (~90s)

Stage 4: Preview deploy (~60s)

Stage 5: Production deploy (gated)

Roll forward, not back

Pipeline-as-code, version-controlled

The metrics that matter

How we approach this

Takeaways

More from the engine room

AI in QA: where it helps, where it doesn’t

Controlling LLM costs in production

RAG vs fine-tuning: which do you actually need?

Agentic features in SaaS: the maturity ladder

Offline-first mobile: the app that works on the subway

Lift-and-shift vs refactor: how to actually decide

Monolith migration: the strangler-fig playbook

SOC 2 readiness in plain English

Let’s Build the Future Together!