Designing Systems: Core Elements

This is Part 2 of the Designing Systems series.

Part 1: How to Approach a System Design: Framing, requirements vs constraints vs tradeoffs, the hidden assumptions, and the design as a communication artifact.
Part 2: Core Elements: Data, state, control flow, failure, and observability - the five things every system design has to decide.
Part 3: Structuring for Change: Boundaries, contracts, evolution, and ownership. How a system survives the changes that come after the first version ships.
Part 4: Anti-Patterns and What Experience Teaches: The failure modes that show up repeatedly and what a good design actually feels like.

Part 1 was about how to slow down before drawing. Part 2 is about what to draw once you start. Every non-trivial system design makes decisions, explicitly or implicitly, about the same five things: data, state, control flow, failure, and observability. Most designs that go wrong went wrong at one of these five hinge points - usually because the decision was implicit rather than explicit.

Data
State
Control Flow
Failure
Observability
Conclusion

Data

Data is the part of the system that outlives everything else. Services get rewritten. Frameworks get replaced. The data and the access patterns built on top of it tend to stay.

A useful framing for the data decisions is to answer four questions, in order:

What is the shape? Entities, relationships, cardinality. The shape determines what storage technologies are even reasonable. A graph with deep traversal needs is different from a flat event stream is different from a typical OLTP row store.
What is the access pattern? Reads vs writes, point lookups vs range scans vs aggregations, latency-sensitive vs throughput-sensitive. The access pattern determines indexing, partitioning, and whether you need read replicas, caches, or denormalised views.
What is the volume and velocity? How much data sits in the system, how much arrives per second, how much you have to keep around. The answers determine partitioning strategy and operational overhead.
What is the retention? Forever, seven years for compliance, ninety days for tracing, a sliding window for caches. Retention is one of the most commonly under-specified properties and one of the most expensive to get wrong.

Principles:

Treat the data model as the most expensive part of the system to change. Spend more time here than feels comfortable.
Separate the storage decision from the schema decision. Postgres can run several access patterns; that does not mean you should design as though it can run all of them well.
Assume the volume and velocity will grow at least one order of magnitude before the system is replaced. Designs that survive that scaling tend to handle it gracefully; designs that do not survive it get expensive.

Ownership matters too. Every piece of data should have an owner - a team or service responsible for its correctness. Shared ownership is a polite way of saying nobody owns it, and unowned data is the most reliable source of long-term inconsistency in a system.

State

State is the part of the system you can read between requests. If you remove state, the request becomes a pure function of its inputs. If you keep state, you have to decide where it lives, who is allowed to modify it, and how the rest of the system finds out when it changes.

A useful first cut is to separate primary state from derived state.

Primary state is the source of truth. There should be exactly one source of truth for each fact in the system. The price of an order, the status of a payment, the membership of a user in a group - each of these has one place that owns the answer, and everywhere else either reads from it or holds a derived copy.

Derived state is a projection of the primary state, kept somewhere else for performance or convenience. Caches are derived state. Search indexes are derived state. Read replicas, denormalised views, materialised aggregates, and most analytics pipelines are derived state.

Principles:

Be explicit about which state is primary and which is derived. Conflating them is how systems lose data.
Derived state has to be rebuildable from primary state. If a cache or index gets corrupted or lost, there should be a clear path to regenerating it.
Treat the boundary between primary and derived as a design decision, not an implementation detail. The boundary determines what fails when one side gets ahead of the other.

The second decision is mutability. Mutable state is what most systems start with - update the row, change the value. Immutable state, where new facts are appended and old facts are kept, is the design choice behind event sourcing, append-only logs, and most modern analytics systems. Neither is right or wrong. The choice has consequences for auditability, debuggability, and how easy it is to reason about what the system was doing at a particular moment in time.

Control Flow

Control flow is how work moves through the system. Three pairs of choices show up in almost every design.

Synchronous vs asynchronous. A synchronous call blocks the caller until the work is done. An asynchronous call hands off the work and returns immediately. Sync is easier to reason about; async scales better and isolates failure. Most non-trivial systems end up with a mix, and the interesting design question is which calls are sync and why.

Push vs pull. A pushed event is sent by the producer to the consumer. A pulled event is fetched by the consumer from a queue or stream. Push is faster when the consumer is ready; pull is more robust when the consumer is slow, busy, or temporarily down. Most production systems use pull or push-with-acknowledgement for anything that has to be reliable.

Fan-out vs fan-in. Fan-out is one event producing many consequences (a webhook firing five downstream actions). Fan-in is many events producing one consequence (an aggregator collecting per-user clicks into a daily summary). Both patterns are useful; both are common sources of unintended coupling when the fan ratio gets large.

Choice	Easy mode	Hard mode
Sync vs async	Sync for simple flows	Async for resilience, batching, and isolation
Push vs pull	Push for low fan-out	Pull for reliable consumers, back-pressure, replay
Fan-out vs fan-in	Fan-out for events	Fan-in for aggregation; both for cross-cutting concerns

Principles:

Default to async for anything that crosses a service boundary and does not need an immediate answer.
Default to pull for anything that has to be reliable, ordered, or replayable.
Be conservative about fan-out ratios. Each downstream consumer is a future operational concern.

The most common control-flow failure mode in a real system is implicit coupling: services that look independent but are actually serialised behind a shared queue, or asynchronous flows that quietly turn into synchronous chains because every step is waiting on the next. The work of designing control flow is mostly the work of seeing those couplings before they ship.

Failure

Most designs are built around the happy path. Real systems live in the failure modes. Designing for failure is not pessimism; it is realism about what the system is going to spend most of its operational life doing.

A useful starting point is to ask, for each component: what is the failure mode, how often does it happen, and what does recovery look like?

Component class	Common failure	Typical recovery
Single node / pod	Crash, restart, kernel issue	Auto-restart, health check, replica failover
Single region	Provider outage, network partition	Multi-region replication, regional failover
Storage layer	Disk loss, corruption, latency spike	Replication, backups, point-in-time recovery
Network path	DNS, routing, intermittent loss	Retries with backoff, circuit breakers
Dependent service	Slow, down, or returning bad data	Timeouts, fallbacks, degraded mode
Cascading failure	One slow component dragging the whole system	Bulkheads, rate limiting, load shedding
Data corruption	Bug writes bad data that persists	Validation, audit trails, repair tooling

Principles:

Distinguish between failures that the system handles automatically and failures that require human intervention. The first set should grow over time. The second set should be documented, paged, and small.
Design for the failure mode you have not seen yet. Most production incidents are not catastrophic - they are slow, partial, ambiguous. The recovery story for "slow and partial" is usually harder than the recovery story for "down."
Practice the failure modes. A failure mode that has never been exercised in a non-prod environment is a failure mode you do not actually understand.

The deeper lesson is that failure is part of the design, not something that happens to a finished design. A system that pretends failure is exceptional will eventually be embarrassed by reality.

Where Significant Incidents Trace Back To (Illustrative Distribution)

The shape of this distribution will vary by system, but the broad message is consistent: incidents are spread across the five hinge points, with no single dominant cause. Designs that treat one element well and the others as afterthoughts inherit incidents from whichever element was neglected.

Observability

The final element is the one that most often gets bolted on at the end of a project, and it is the one that determines whether the system is operable. Observability is not monitoring. Monitoring tells you that something is wrong. Observability lets you find out why.

The three pillars are familiar: metrics, logs, traces. The interesting design questions are about how they fit together.

Metrics are aggregates: counts, distributions, rates. They are cheap, dense, and good for alerts. They are not useful for explaining a specific request.
Logs are events: lines of text emitted by code. They are flexible, expensive at high cardinality, and indispensable for explaining what one process was doing at one moment.
Traces are connected sequences: the spans of work across services that made up a single logical operation. They are how you actually find out where the latency went.

Principles:

Design the instrumentation in parallel with the system, not after. The metric you wish you had during an incident is the one you forgot to emit six months earlier.
Identifiers that travel with the request (trace IDs, correlation IDs, user IDs) are the connective tissue of observability. Treat them as a first-class concern in the design.
Apply the on-call test: if a stranger gets paged at 3am, can they find the cause from what the system emits? If not, the observability is not done.

A useful frame is to imagine the future engineer who has to debug a problem you did not anticipate. Everything you make it easy for them to see is investment. Everything they have to add themselves under pressure is a debt you left them.

Conclusion

Every non-trivial system design makes decisions about the same five things. Data is the part that outlives everything else and is the most expensive to change. State has to be partitioned cleanly between primary and derived, with the boundary owned and rebuildable. Control flow is mostly a conversation about sync vs async, push vs pull, and fan ratios, and the design failures are usually the implicit couplings. Failure is part of the design from the start, not a postscript. Observability is what determines whether the system is operable, and it has to be built in.

These five are not a checklist that finishes the design. They are the hinge points. Part 3 picks up the next question: once a system makes good decisions about these five, how does it survive the changes that will come after the first version ships - new requirements, new scale, new teams? That is the work of structuring for change.

Table of Contents

Data

State

Control Flow

Failure

Observability

Conclusion