Audit template
performance
Find where the system is slow, where it falls over under load, and where it wastes resources.
Maps to: SRESite Reliability Engineering — operating systems with engineering, using SLOs and error budgets. · DORADevOps Research and Assessment — the four key software-delivery performance metrics. · SLOsService Level Objective — a target for a reliability metric, such as 99.9% availability.
specialists, in parallel
Each finding is evidence-bound and survives ≥2-of-3 adversarial skeptics.
How this audit works
Eleven specialists run in parallel over server-side latency, throughput, and scaling behavior: algorithmic hot paths, database query efficiency and N+1An N+1 query: one extra query per row instead of one for the whole set — a common performance trap. patterns, caching, concurrency, leaks, network I/O, resilience, and cost. Every finding cites a concrete artifact — a hot-path file:line, a query plan, a measured latency, a profiler frame — labels itself measured or reasoned, and survives adversarial verification before it lands. Each confirmed fix ships with an estimated metric improvement and the load level at which the path breaks today.
Use it when
An endpoint got slow after a release
A request path that used to be fast now drags, and the trace points at the database. The audit hunts N+1An N+1 query: one extra query per row instead of one for the whole set — a common performance trap. patterns per endpoint, reads the query plan for full scans and missing indexes, and quantifies each: queries per request, p95 before versus after, and the index or batched query that fixes it.
Sizing for a 10x traffic increase
Before a launch or campaign you need to know whether the system holds. The audit reasons each critical path through at 2x and 10x current load, names the first bottleneck to saturate — a hot row, a global lock, an undersized connection pool — and states the load level at which it breaks and the realistic ceiling after remediation.
A dependency slowdown caused a cascade
One slow downstream call backed up threads and took the service with it. The audit checks timeout discipline on every external call, hunts retries without backoffWaiting progressively longer between retries so a struggling dependency can recover. and jitterRandom variation added to retry timing so many clients don't retry in the same instant., and flags missing circuit breakersA guard that stops calling a failing dependency for a while, so failures don't pile up., backpressureA signal that slows a fast producer so a slower consumer is not overwhelmed., and load sheddingDeliberately dropping or rejecting some requests under overload to keep the rest healthy. — pinpointing the failure mode when a dependency is slow, not just down.
What you get
A scorecard graded per dimension plus a priority-sorted backlog of GitHub issues, each with the evidence, the quantified cost, and a before/after fix with its estimated metric gain.