The introduction of the introduction
The tale
Let’s imagine you are a general of a huge army. And you, along with your colleague who’s another general of another massive army are besieging a big city. The city has invincible defenses so it can be captured only and only if it has been besieged long enough so one of the generals (either you or your colleague) would decide that it’s about the right time to attack based on your observation and the only way that attack would succeed if it was coordinated properly so both armies must attack at almost the exact same time.
Now let’s assume that your army and you (the general) located on the East side of the city and your colleague is located on the West side. And after few days you decided that it looks about the right time to attack, how would you let the general in the West side know that you want to kick off the coordinated attack? You could send a messenger right? Great thought! Except for the fact that you are in a war and the messenger could be killed or captured so basically the means of transmission between you and the other general is completely unreliable.
Imagine that you send the messenger telling the West general to prepare for attack next morning and the messenger is killed or captured while on-route and poor you living in a fantasy land assuming that your messenger is probably drinking the finest Latte macchiato there is with the West general who’s fully prepared for coordinated attack; launched your attack next morning, a terrible fate is awaiting you! So you thought: “I must devise a communication protocol with the West general” Good for you! So you decided to be smart and not assume a reliable network with the West general so you tell the West general:
It’s the right time to attack as I can see from my side, please prepare for the attack and I’ll attack next morning only and only if you send a confirmation of the receipt of my message from your side.
That solves the issue for you! Great! But now, imagine you received the messenger first. Because your colleague observed on his side that it’s about the right time to attack.
The one million dollar question (not really):
Will you attack?
So you send the confirmation according to the protocol from your side that you successfully received the command to initiate a coordinated attack.
How would you know that your confirmation reached your colleague?
Because according to the protocol he is going to attack only and only if he receives the confirmation from you! But what if the messenger you send is killed or captured? How would you know that your colleague received your confirmation of the attack?
Perhaps you need a confirmation of the confirmation. And that’s it! The protocol is solid?
You can see where that’s going right?
We will end up with the confirmation of the confirmation of the Nth confirmation for infinity. It’s 1an impossible problem so I am not going to torture you dear careful reader more than that to think about impossible problem.
The two generals problem is a perfect depiction of the challenges that the distributed systems bring!
So, when working with distributed systems you have to really be careful what kind of assumptions to make and what kind of things to take for granted. Examples of assumptions that people normally make which doesn’t mostly hold in a distributed architecture:
The Network is reliable
Latency is zero
Bandwidth is infinite
The Network is secure
The Topology doesn’t change
The Network is homogeneous
Instead, the reality is, in a distributed system the network is not reliable; network failures and packet loss can occur. The latency is never zero; communication between nodes takes time. The bandwidth is very much finite; there's a limit to the data that can be transmitted. The network is not secure; data can be intercepted or tampered with. The topology is and will change; nodes can join or leave the network dynamically. Finally, the network is not homogeneous; nodes may have different capabilities and speeds.
What is a distributed system
A definition
There are many definitions of distributed systems out there. From my personal perspective a distributed system is any system that consists of multiple components that are inter-connected and decentralized (that is, there is no single authority that tells individual components how to work and what to do and manages their actions constantly) instead individual component of a distributed system work in kind of cooperative way to achieve a common goal or solve a certain problem in some domain.
With that said, the degree of decentralization varies between systems there are the fully decentralized distributed systems like the blockchain and semi-centralized or somewhat centralized distributed system where the protocol require a single coordinator or master node that dictates how the protocol should proceed or work or even where the coordinator or master node has exclusive access to the data modifications and act as central authority in the protocol.
Examples of distributed Systems
As we mentioned earlier, examples of distributed systems come in different forms showcasing their versatile applications and architectural designs.
A blockchain network which is a fully decentralized system. It consists of interconnected ledgers distributed across nodes which ensures transparency security and immutability of transactions without relying on a central authority all of that operating within an inherently unreliable and open network.
A database management system, which in of itself operates as a distributed system. Nodes or replicas collaborate within the protocol to achieve a common goal. The level of decentralization depends on the specific protocol used allowing for data redundancy, fault tolerance, and scalable data processing.
Another example is found in microservices software systems. In this setup, individual microservices interconnect while maintaining a significant degree of autonomy. This architecture fosters agility, scalability, and fault isolation, making it a popular choice for building complex applications.
Have you ever played an online multiplayer game like an MMORPG or a Battle Royale title? Perhaps you have played Call of Duty or PUBG.
Recollect those moments when you were certain you had landed the first shot, only to find yourself defeated? Those example showcase a profound example of a distributed system in action. The interaction between players across diverse geographical locations introduces network latency, affecting the synchronization between actions and outcomes. So, next time frustration ensues, remember – it's not the player's fault, but rather the intricate mechanics of the distributed system at play (don't hate the player hate the game! See what I did there? 😉)."
Challenges of Distributed Systems
What it is so hard and challenging about distributed systems
I don’t know about you, but the thought of `%99.999` or the five-nines-availability to me as an Engineer is enticing. Imagining that me as a Software Engineer who knows about Thermodynamics and Entropy being able to build a distributed system that is only unavailable 5 minutes a year is absolutely fascinating 🤯. Anything that could improve the fault tolerance, availability and the scalability of the software we build is very interesting to us which is exactly the benefits that distributed systems bring. But that does not come for free as you might imagine. Once the data is spread over multiple nodes/replicas/containers/instances etc, once you have many moving parts and need to be in-sync, secured, fault tolerant, monitored properly, utilize resources optimally, etc... distributed systems introduce whole host of issues and challenges for example:
Unreliable Network: Traditional systems assume that messages sent between components will reliably reach their destination. However in a distributed system network failures, delays, and message losses are to be expected and thus protocols for acknowledgment, retransmission, and fault tolerance must be implemented.
Non-Zero Latency: In a single machine setting, communication is virtually instantaneous. In distributed systems, data transmission between nodes introduces latency due to physical distances and network traffic. Minimizing and managing latency become vital. The physical constraints of network communication mean that latency is always present. Design choices need to account for this delay, particularly in real-time applications.
Finite Bandwidth: While local systems may enjoy high data transfer rates, distributed systems encounter bandwidth limitations. Sharing resources over the network demands efficient data compression, prioritization, and distribution strategies. Network resources are limited. Efficient data transfer techniques, such as data compression and data streaming, must be used to optimize bandwidth usage.
Insecure Network: Centralized systems can be more controlled and secured. Distributed systems expose data to various nodes, necessitating encryption, authentication, and authorization mechanisms to ensure confidentiality and integrity. The distribution of data across nodes exposes it to potential security breaches. Robust encryption, authentication mechanisms, and intrusion detection systems are essential for safeguarding sensitive information.
Changing Topology: Unlike static setups, distributed systems often feature dynamic and evolving network topologies due to nodes joining or leaving. Adaptability becomes imperative to maintain seamless functionality. Nodes joining and leaving the network dynamically is a norm. Distributed algorithms, such as consensus protocols, are employed to maintain coherence in such situations.
Heterogeneous Network: In many cases, individual nodes in a distributed system might differ in processing power, memory, or capabilities. This heterogeneity requires strategies to manage load balancing and resource allocation effectively. Variability in node capabilities requires load distribution mechanisms and adaptive algorithms that account for differing performance levels.
And the more decentralized it is, the more moving components you have, the more challenging it is to manage and maintain the distributed system. It’s like becoming a parent for only one kid VS many many kids. Instead of monitoring one kid imagine having to monitor 5 of them? It’s unbelievably challenging task isn’t it? Terrified enough? No? OK, now imagine those kids can somehow replicate themselves randomly, so out of nowhere instead of one cute little Mikey, all of the sudden you have 15 instances of Mikey 5 of them %100 CPU utilization screaming, crying, pooping, getting suddenly sick, etc..
Basically, multiply the challenges of securing, maintaining, deploying, monitoring, scaling, testing and verifying one monolith system times number of interconnected components you have times random factor times the number of constraints the protocol has and how often things change times..etc.
Cap and PACELC theorem
Where it is all started
The distributed systems have inherit limitations, the more constraints you have in a distributed system the more limitations you end up with. It sounds contradictory statements. The 2CAP theorem is a prime example of that fact.
The Cap theorem is proposed by the computer scientist Eric Brewer in the 2000s it shows the challenges faced when building such systems when aiming for Consistency, Availability and Partition tolerance. Which are the three pillars that distributed systems built upon but the funny part is you can only have 2 pillar not the third. When you try to combine all 3 of them, one of them will fail you unfortunately and on top of that there is absolutely nothing you can do about it! In other words, given a distributed system you are building, you can only guarantee 2 out of the 3 (CA, CP, AP). You can only guarantee consistency and availability, consistency and partition tolerance or availability and partition tolerance which urges the architects to carefully weigh their priorities based on the system's requirements and operational context.
Let’s take an example:
So we have this simple distributed system that consists of 3 nodes that act as a basic storage medium. A user send a write request after connecting to Node 1 `Write X = 1` which the node 1 propagates properly to the other nodes it is connecting with. If another user send read request afterwards to Node 3 for example `Read X` that user will receive `1` as a value which means the system satisfy the consistency requirement, and the availability requirement sense the system is available.
Now let’s imagine that the link between node Node 1 and Node 3 is broken for some reason for example a network failure (Network partition). And another user wants to update X value to 2.
Node 1 will receive the request and update its local store and set `X = 2` and then propagate the update to Node Node 2. And at some point during that link-breakage between Node 3 and the rest of the system another user connected and unfortunately for us they connected to Node 3 and on top of that they send request `Read X` to Node 3. How unlucky, right?
So, the million dollar question (not really):
What the system should do in such case?
Will give you 5 minutes, dear careful reader, coffee break to think about.
OK. Times up!
A careful reader: Well, we could return the `X=1` to the user since that what Node 3 saw last time.
Good thought! Except for the fact that by doing that you sacrificed the C pillar, Consistency. The system is not consistent anymore.
A careful reader: We could block the request until Node 3 connections to the rest of the server is back online again?
Seems like a decent decision, but, we in this case are sacrificing on the A pillar, the availability part.
That where the CAP theorem comes to play.
We cannot guarantee all 3 of the pillars and not just that CAP theorem states that no matter what kind of advanced algorithms you use or optimizations technique you utilize you just cannot guarantee all 3 of the pillars.
A careful reader: Well, that’s just sad. And cruel 😓
But no worries, we can always “relax” the constraints or weaken the guarantees that our system provide. Instead of hating the game, we could play it right? For example, we could adopt an “Eventual Consistency” model in our system, we could implement retries, we could implement strategies for achieving consensus to recover from a network partition… etc.
There is also an “extension” to the CAP theorem which is the 3PACELC theorem, it looks deeper into the distributed system world and adds two more constraints Latency and Consistency. It is basically say “Partition Availability Consistency Else Latency Consistency”. So basically what PACELC theorem is saying: Even when you have no network partition (The else part) you will still have to consider a tradeoff between Latency and Consistency, you cannot max-out on both! (the interpretation is more detailed than that but that’s good enough for the sake of simplicity)
But again, don’t hate the player and rather than hating the game, play it well after knowing the rules and the intricacies of it!
The CAP and PACELC theorems, originating as conceptual frameworks, have evolved into guiding principles helping engineers navigate the intricate maze of distributed systems. These theorems remind us that achieving perfection across all dimensions is unfeasible and that distributed system design is fundamentally about making strategic compromises to align with the system's intended goals and operational realities. As we delve deeper into the intricacies of distributed systems these theorems remain at the core, serving as beacons of wisdom to illuminate the path toward robust and efficient distributed system architecture.