Talk about SWIM Protocol

  distributed-systems

Order

This article mainly studies SWIM Protocol

SWIM Protocol

The full name of SWIM is scalable, vulnerable-consistent, infection-style, processes group membership protocol.

heartbeats

In traditional membership protocols such as heartbeats, each node periodically sends heartbeat to all other nodes in the network to indicate that it is alive. If the peer exceeds the specified interval and does not receive a node’s HeartBeat, the node is considered dead. This method is suitable for small networks. The number of heartbeart it sends is O (n 2), which will cause huge network burden when there are thousands of node in the network. SWIM uses infection-style disablement to solve this problem.

tasks

Compared with traditional heartbeats, SWIM divides the whole process into two task: Failure Detection and membership update disablement.

Completeness and Accuracy

There are several measures for failure detection:

  • Completeness

Is every failed node finally detected?

  • Speed of failure detection

The average time taken by a node from failed to detected failed

  • Accuracy

False positive rate is the probability that a node is misjudged as failed.

  • Message Load

What is the network load of each node in the test and whether it is evenly distributed

Unreliable Failure Detectors for Reliable Distributed SystemsAn article pointed out that for asynchronous networks, 100% of Completeness and Accuracy cannot be guaranteed at the same time, so SWIM chose Completeness under trade-off and reduced false positive rate as much as possible to improve Accuracy.

Failure Detection


SWIM’s failure detection process is divided into two parts, one is direct ping and the other is indirect ping.

  • direct ping

Local node randomly selects n node from alienodes to detect;; If some node in direct ping do not return ack within timeout time, indirectping will be performed.

  • indirect ping

Local node randomly selects k nodes from alive nodes to perform indetect ping on the direct ping target node. The k nodes will give the results forwards to the LocaNode. Finally, LocaNode checks if none of the k nodes returns ack. The target node is marked as FAILED, and then the failed information of the node is propagated to other nodes in the network through membership update disablement.

Membership update Dissemination

Messages can be divided into two categories: JOINED and FAILED:

  • JOINED

When a node joins the network, it needs to notify other nodes to update the local membership to add the node.

  • FAILED

When a node is detected as failed, other nodes need to be notified to update the local membership to remove the node.

This process can be implemented using multicast

Improvement

  • Infection-Style Dissemination

Multicast’s DISCUSSION is unreliable and inefficient. A more robust version of SWIM uses the Infection-Style method to conduct DISCUSSION, that is, using the ping mechanism of Failure Detection, piggyback of the message requiring DISCUSSION is used on the ping/ack to implement gossip-like message propagation, thus reducing the additional overhead of separate information transmission.

  • Suspicion Mechanism

In order to better reduce false positive rate to improve Accuracy, Suspicion Mechanism can be introduced, i.e. when local node detects the node failed, it is marked as suspected;; Node marked suspected is considered alive; before it is finally confirmed as failed; If other nodes detect that the node is alive, they cancel suspected for the node and resume Alive; If the node is not restored to alive at the specified time, it is marked as failed

  • Time bound failure detection

Random selection of node for ping may cause certain delay. round robin can be used instead of random selection. when all nodes have been selected, shuffle the node list again

Summary

  • The full name of SWIM is scalable, vulnerable-consistent, infection-style, processsgroup membership protocol; Compared with traditional heartbeats, SWIM divides the whole process into two task: Failure Detection and membership update disablement.
  • SWIM’s failure detection process is divided into two parts, one is direct ping and the other is indirect ping.
  • Infection-Style is used to carry out the disablement, i.e. the piggyback of the message requiring disablement is used on the ping/ACK by using the ping mechanism of Failure Detection, so as to realize the message propagation similar to gossip, thus reducing the additional overhead of separate information transmission.

doc