Data systems are not just databases. Message queues and databases do the same thing of storing data for some time, but have completely different implementations and use cases. But now many systems blur the line. Redis is a data source that can also act as a message queue, and Apache Kafka is a message queue with database-level durability.
Applications are becoming more demanding and no single tool can be used for everything. Application tasks are divided and the best tools that can handle each task well are used. These tools are orchestrated by application code. For example, an application needs a cache and a database — and the application code needs to orchestrate how these work together, update each other, etc. If we contain this and provide an API to this system with the right abstraction, it becomes a full-fledged data system with features that ensure required caching based on a policy, reliability, and robustness. This is what a data system engineer does.
While building this kind of system, many questions arise:
- How do you ensure that the data remains correct and complete, even when things go wrong internally?
- How do you provide consistently good performance to clients, even when parts of your system are degraded?
- How do you scale to handle an increase in load?
- What does a good API for the service look like?
The major concerns while building this system can be grouped into three needs:
- 🛡 Reliability — The system should continue to work correctly even in the face of adversity (hardware or software faults, and even human error)
- 📈 Scalability — As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth
- 🧹 Maintainability — Maintaining current behavior and adapting the system to new use cases
Reliability
Continuing to work correctly, even when things go wrong. Functioning as expected, tolerating user mistakes or unexpected usage, performance justified under load, preventing unauthorized access and staying secure.
Things that go wrong are called faults, and systems that anticipate them and provide capabilities to handle them are called resilient or fault-tolerant. Faults are different from failures — faults are components going off the required spec, sometimes leading to system failure and sometimes just unreliability.
You can test your system by creating deliberate chaos — terminating a system suddenly and checking error handling. Netflix Chaos Monkey is an example of this in action. Most bugs come from poor error handling that doesn't break the system outright but causes unexpected behavior under certain conditions.
⚙️ Hardware Faults
- Hard disk crash
- Faulty RAM
- Power grid blackout
- Unplugging the wrong cable
Hard disks have a Mean Time To Failure (MTTF) of about 10 to 50 years. On a storage cluster of 10,000 disks, an average of 1 disk failure can be expected each day.
Failure Rate of 1 disk = 1 / MTTF
Failure Rate of system = N / MTTF (N = Number of disks)
- In 3650 to 18250 days → 1 hard disk fails
- Disk failure rate of 1 disk = 1/3650 to 1/18250
- Disk failure rate of system with 10,000 disks = 10000/3650 to 10000/18250
Worst case:
10000 / 3650 = 2.74 disk failures per day (10 year MTTF)
Best case:
10000 / 18250 = 0.55 disk failures per day (50 year MTTF)
Average:
(2.74 + 0.55) / 2 = 1.65 disk failures per day
in a system of 10,000 hard disks with MTTF 10-50 years
If this is not anticipated from the start and fault-tolerated, every day somewhere in the system something would fail.
To mitigate hardware faults: RAID, dual power supply to servers, hot-swappable CPUs, and power backups. But as the application scales, it becomes hard to mitigate hardware faults this way. More recently, redundancy is the preferred approach — if a system fails and we can restore a backup to another system fast enough, the user never notices.
Even AWS prioritizes flexibility and elasticity over single-system fault tolerance. EC2 instances sometimes fail without warning — they are not perfectly fault-tolerant. Instead, a redundant system is set up and restored in reasonable time from backup. This is hardware redundancy, and it has an extra benefit: each node can be patched independently so the application doesn't have to worry about downtime for maintenance or releases. This is a rolling upgrade. Hardware fault tolerance is also handled at the software level, but majorly it is redundancy that helps.
🧩 Software Errors
— notes in progress —
Scalability
— notes in progress —
Maintainability
— notes in progress —
← Back to all posts