In the 8 fallacies of distributed computing a set of problems common to distributed systems are discussed:
- The network is reliable
- Packets get dropped, handshakes don’t complete, etc
- Latency is zero
- Requests take time, and the amount of time depends on geography
- Bandwidth is infinite
- Information sent over the wire has a cost
- The network is secure
- Anything on the network is an attack surface
- Topology doesn’t change
- The arrangement of services changes constantly
- There is one administrator
- Many people may make adminstrative decisions independently
- Transport cost is zero
- There are costs associated with the transport layer
- The network is homogeneous
- Architectures differ, and integration is always challenging
Distributed Systems and Partition Tolerance
CAP Theorem (CAP) addresses some of these problems by splitting architectural responsibilities into three categories and presenting a trilemma. There is an important caveat that the choice between availability/consistency isn’t absolute, but we do need to pick one to prioritise. These concepts apply to any stateful system, but particularly ones that are replicated according to some policy:
- Consistency
- That all operations are treated as if they were atomic (e.g. monotonic reads)
- Availability
- The system’s uptime is guaranteed by some strategy (e.g. replication)
- Partition-tolerance
- The system continues to operate in the event of network isolation (e.g. from another service)
The convention is that partition tolerance is always desirable. The fallacies of distributed systems help justify this decision. It accounts for the fact that the network is not reliable, or that things may change and cause unforeseen issues.
Latency and Consistency
PACELC stands for Partition-tolerance, Availability, Consistency, Else Latency or Consistency. It states that CAP has been misinterpreted to mean that the choice between consistency and availability is absolute. In reality, it is a question of which one to sacrifice when a partition actually occurs. So long as one hasn’t occurred, a system can be both consistent and available. The possibility of a network partition does not in itself enforce this tradeoff.
Eventual consistency provides a way for systems to be both consistent and available. The sacrifice we need to make in the event of a partition is between consistency and latency. An available system is typically one that is replicated across different nodes. Ensuring those nodes are consistent entails latency, as it means that nodes must share information to ensure consistent state.
As an example, in a leaderless replicated database writes are broadcasted, and decisions on those writes may happen by quorum. Broadcasting and decisions by quorum both involve network latency. It would be faster to use a single leader based system with failover, but this comes with its own problems. The second we replicate data, we sacrifice latency for consistency.
PACELC is therefore read as PA/EC (partition tolerance and availability, else choose consistency), PC/EL, and so on. The use of the system determines these architectural decisions.