Skip to main content
thebuilder.company
technical

designing systems that fade into silence

read time:7 min

most systems scream when they are failing. logs explode, dashboards turn red, and alerts page the people who were trying to sleep.

we prefer systems that fade into silence — where the lack of noise is a signal that things are working as intended.

that means conservative defaults, explicit backpressure, and architectures that treat failure as an expected state, not an exception.

the noise problem

most systems scream when they are failing. logs explode, dashboards turn red, and alerts page the people who were trying to sleep.

but the deeper problem is that many systems also scream when they are working. info logs on every request. heartbeat events every 30 seconds. metrics pipelines emitting thousands of data points per minute that nobody reads.

when everything is loud, nothing is a signal. the alert for the actual outage gets lost in the daily noise of the system just existing.

silence as signal

we design aura's infrastructure around a different model: silence is the expected state. the absence of output means things are running as intended.

this is not a new idea — it is the unix philosophy applied to observability. programs that have nothing to say should say nothing. a successful request should emit no log. a healthy service should produce no alert.

the practical consequence is that when something does emit output, someone pays attention to it.

conservative defaults

silence-first design starts at the defaults. every new service we write logs at warn level in production. info and debug are development-only. we err on the side of logging less and adding more if something turns out to need it.

the same applies to retries, timeouts, and circuit breakers. we set them once, document the reasoning, and treat any change as requiring a small design discussion. you should know why your timeout is 3 seconds, not 30.

failure as expected state

the other half of calm systems is treating failure as a normal case, not an exceptional one. every network call can fail. every disk write can fail. every downstream service will eventually be unavailable.

when failure is expected, your code handles it explicitly rather than propagating it through exception hierarchies until something crashes. the system degrades gracefully. the user sees a useful message. the operator sees a quiet, informative log entry.

we have a rule: if a failure path in your code ends in a panic or an unhandled exception, you have not finished designing that path yet.

related next steps