Order
This article mainly studies SWIM Protocol
SWIM Protocol
The full name of SWIM is scalable, vulnerable-consistent, infection-style, processes group membership protocol.
heartbeats
In traditional membership protocols such as heartbeats, each node periodically sends heartbeat to all other nodes in the network to indicate that it is alive. If the peer exceeds the specified interval and does not receive a node’s HeartBeat, the node is considered dead. This method is suitable for small networks. The number of heartbeart it sends is O (n 2), which will cause huge network burden when there are thousands of node in the network. SWIM uses infection-style disablement to solve this problem.
tasks
Compared with traditional heartbeats, SWIM divides the whole process into two task: Failure Detection and membership update disablement.
Completeness and Accuracy
There are several measures for failure detection:
- Completeness
Is every failed node finally detected?
- Speed of failure detection
The average time taken by a node from failed to detected failed
- Accuracy
False positive rate is the probability that a node is misjudged as failed.
- Message Load
What is the network load of each node in the test and whether it is evenly distributed
Unreliable Failure Detectors for Reliable Distributed SystemsAn article pointed out that for asynchronous networks, 100% of Completeness and Accuracy cannot be guaranteed at the same time, so SWIM chose Completeness under trade-off and reduced false positive rate as much as possible to improve Accuracy.
Failure Detection
SWIM’s failure detection process is divided into two parts, one is direct ping and the other is indirect ping.
- direct ping
Local node randomly selects n node from alienodes to detect;; If some node in direct ping do not return ack within timeout time, indirectping will be performed.
- indirect ping
Local node randomly selects k nodes from alive nodes to perform indetect ping on the direct ping target node. The k nodes will give the results forwards to the LocaNode. Finally, LocaNode checks if none of the k nodes returns ack. The target node is marked as FAILED, and then the failed information of the node is propagated to other nodes in the network through membership update disablement.
Membership update Dissemination
Messages can be divided into two categories: JOINED and FAILED:
- JOINED
When a node joins the network, it needs to notify other nodes to update the local membership to add the node.
- FAILED
When a node is detected as failed, other nodes need to be notified to update the local membership to remove the node.
This process can be implemented using multicast
Improvement
- Infection-Style Dissemination
Multicast’s DISCUSSION is unreliable and inefficient. A more robust version of SWIM uses the Infection-Style method to conduct DISCUSSION, that is, using the ping mechanism of Failure Detection, piggyback of the message requiring DISCUSSION is used on the ping/ack to implement gossip-like message propagation, thus reducing the additional overhead of separate information transmission.
- Suspicion Mechanism
In order to better reduce false positive rate to improve Accuracy, Suspicion Mechanism can be introduced, i.e. when local node detects the node failed, it is marked as suspected;; Node marked suspected is considered alive; before it is finally confirmed as failed; If other nodes detect that the node is alive, they cancel suspected for the node and resume Alive; If the node is not restored to alive at the specified time, it is marked as failed
- Time bound failure detection
Random selection of node for ping may cause certain delay. round robin can be used instead of random selection. when all nodes have been selected, shuffle the node list again
Summary
- The full name of SWIM is scalable, vulnerable-consistent, infection-style, processsgroup membership protocol; Compared with traditional heartbeats, SWIM divides the whole process into two task: Failure Detection and membership update disablement.
- SWIM’s failure detection process is divided into two parts, one is direct ping and the other is indirect ping.
- Infection-Style is used to carry out the disablement, i.e. the piggyback of the message requiring disablement is used on the ping/ACK by using the ping mechanism of Failure Detection, so as to realize the message propagation similar to gossip, thus reducing the additional overhead of separate information transmission.