← All Posts

Chapter 1: Thinking About Data Systems

Data systems are not just databases. Message queues and databases do the same thing of storing data for some time, but have completely different implementations and use cases. But now many systems blur the line. Redis is a data source that can also act as a message queue, and Apache Kafka is a message queue with database-level durability.

Applications are becoming more demanding and no single tool can be used for everything. Application tasks are divided and the best tools that can handle each task well are used. These tools are orchestrated by application code. For example, an application needs a cache and a database — and the application code needs to orchestrate how these work together, update each other, etc. If we contain this and provide an API to this system with the right abstraction, it becomes a full-fledged data system with features that ensure required caching based on a policy, reliability, and robustness. This is what a data system engineer does.

While building this kind of system, many questions arise:

The major concerns while building this system can be grouped into three needs:

Reliability

Continuing to work correctly, even when things go wrong. Functioning as expected, tolerating user mistakes or unexpected usage, performance justified under load, preventing unauthorized access and staying secure.

Things that go wrong are called faults, and systems that anticipate them and provide capabilities to handle them are called resilient or fault-tolerant. Faults are different from failures — faults are components going off the required spec, sometimes leading to system failure and sometimes just unreliability.

You can test your system by creating deliberate chaos — terminating a system suddenly and checking error handling. Netflix Chaos Monkey is an example of this in action. Most bugs come from poor error handling that doesn't break the system outright but causes unexpected behavior under certain conditions.

⚙️ Hardware Faults

Hard disks have a Mean Time To Failure (MTTF) of about 10 to 50 years. On a storage cluster of 10,000 disks, an average of 1 disk failure can be expected each day.

Failure Rate of 1 disk   = 1 / MTTF
Failure Rate of system   = N / MTTF  (N = Number of disks)

- In 3650 to 18250 days → 1 hard disk fails
- Disk failure rate of 1 disk = 1/3650 to 1/18250
- Disk failure rate of system with 10,000 disks = 10000/3650 to 10000/18250

Worst case:
10000 / 3650 = 2.74 disk failures per day  (10 year MTTF)

Best case:
10000 / 18250 = 0.55 disk failures per day  (50 year MTTF)

Average:
(2.74 + 0.55) / 2 = 1.65 disk failures per day
in a system of 10,000 hard disks with MTTF 10-50 years

If this is not anticipated from the start and fault-tolerated, every day somewhere in the system something would fail.

To mitigate hardware faults: RAID, dual power supply to servers, hot-swappable CPUs, and power backups. But as the application scales, it becomes hard to mitigate hardware faults this way. More recently, redundancy is the preferred approach — if a system fails and we can restore a backup to another system fast enough, the user never notices.

Even AWS prioritizes flexibility and elasticity over single-system fault tolerance. EC2 instances sometimes fail without warning — they are not perfectly fault-tolerant. Instead, a redundant system is set up and restored in reasonable time from backup. This is hardware redundancy, and it has an extra benefit: each node can be patched independently so the application doesn't have to worry about downtime for maintenance or releases. This is a rolling upgrade. Hardware fault tolerance is also handled at the software level, but majorly it is redundancy that helps.

🧩 Software Errors

— notes in progress —

Scalability

— notes in progress —

Maintainability

— notes in progress —


← Back to all posts