How to Talk about partition

  architecture

Order

This article mainly talks about the partition method of open source mainstream products.

partition

Generally speaking, the busyness of the database is reflected in the fact that different users need to access different parts of the data set. In this case, we store each part of the data in different servers/nodes, and each server/node is responsible for reading and writing its own data, so as to realize lateral expansion. This technology is called sharding.

Ideally, different nodes serve different users, each user only needs to communicate with one node, and the server response can be obtained quickly. Of course, the ideal situation is relatively rare. In order to obtain an almost ideal effect, it is necessary to ensure that all the data that need to be accessed at the same time are stored on the same node, and the nodes must arrange these data blocks so as to optimize the access speed.

Fragmentation can greatly improve reading performance, but it is not helpful for applications that need frequent writing. In addition, fragmentation does not help to improve the ability of fault recovery, but it reduces the scope of fault. Only those users who access this node will be affected, and the rest users can access it normally. Although some of the data are missing, the rest can still function normally.

Problem point

1. How to fragment/route

How to store data can ensure that users basically only need to obtain it from one node. If you are using an aggregation-oriented database instead of a tuple-oriented database, it is very easy to solve. Aggregation is designed to store data that often need to be accessed at the same time. Therefore, aggregation can be taken as a unit of distributed data.

Another thing to consider is how to keep the load balanced. That is, how to distribute the aggregated data evenly in each node so that the load they need to process is equal. The load distribution may change with time, so some domain-specific rules are required. For example, some need to be in dictionary order, others need to be in reverse domain name order, etc.
Many NoSQL provide auto-sharding function, which allows the database to distribute the data to each slice and direct the data access request to the appropriate slice.

2. How to rebalance

In the case of dynamic increase or decrease of machines or partition, if rebalance is performed again, the data will be distributed evenly or hot spot access will be avoided.

Fragmentation method

There are two main categories here, one is hash fragmentation (hash based partitionning), one is range slicing (range based partitioning)

1. Hash fragmentation (hash based partitionning)

Data fragmentation is carried out through hash functions, mainly including Round Robbin, Virtual Bucket and Consistent Hash.

A、Round Robbin
Commonly known as hash modulization algorithm, H(key) = hash(key) mode K (where the physical machine is numbered from 0 to K-1, key is the primary key of a record, and H(key) is the physical machine number storing the data). The advantage is simple, but the disadvantage is that adding or subtracting machines needs to be hash again and lacks flexibility. In fact, it combines the two function points of physical machine and data slicing into one, so it lacks flexibility.
B. virtual bucket
Membase introduces a virtual bucket between the record to be stored and the physical machine to form a two-level mapping. The key-partition mapping is implemented by hash function and partition-machine is implemented by table management. When a new machine is added, only some original virtual buckets need to be divided into new machines, and only the partition-machine mapping needs to be modified, thus having flexibility.
C. consistent hash
Consistent hash is an implementation algorithm of distributed hash table. Hash value space is formed into an end-to-end annular sequence according to size. For each machine, it can be mapped into hash value space through hash function according to IP and port number. Through directed ring sequential lookup or Finger Table. Virtual nodes can be used to solve the load imbalance of each node caused by consistent hashing. A physical machine node is virtualized into a plurality of virtual nodes and mapped to different positions of a ring structure.

Hash fragmentation has the advantage of evenly distributing data, but it may cause disorder of data and inconvenience of range.

Mongo2.4 version 2.4+supports hash partition

2. Range fragmentation (range based partitioning)

This is distributed according to the key order, for example, the dictionary is divided by 24 initials. this has the advantage of being convenient for range, but it is easy to cause uneven data distribution and hot spot access problems (For example, individual nodes have a large amount of data visits/queries/calculations, resulting in a particularly high load.)。

Bigtable, hbase and mongo before version 2.4 all used this method.

Index fragmentation policy (secondary indexes)

In addition to the fragmentation of the data itself, the index also needs fragmentation. Two well-known reverse index partitioning strategies are document-based partitioning and term-based partitioning. Then, based on these two basic strategies, hybrid’s plan is derived.

1.local index(document-based partitioning)

Also known as document-based partitioning. A reverse index of local data is maintained locally on each partition. In this way, the scatter/gather mode is mainly used, that is, each query needs to send a request to all partitions, and then each partition is retrieved according to the local index, and then the results are summarized.

  • Benefits
    Simple and easy to maintain
  • Disadvantages
    It is rather difficult to query, for example, there are n partitions. to check top k, each partition must check top k, which requires a total of n*k documents to be summarized.

Mongo,cassandra,es,solr ES and Solr adopt this scheme.

2.global index(term-based partitioning)

Also known as term-based partitioning, in this way, the index created is not based on partial data of partition, but on all data. However, these global indexes are redistributed to various nodes using range-based partitioning.

  • Benefits
    The reading efficiency is high, because the indexes are orderly. Based on range parititioning, the indexes can be found very quickly. Moreover, these indexes are global and can be located to the position of the document immediately.
  • Disadvantages
    The writing cost is relatively high, and the writing of each document needs to maintain/update the global index. Another disadvantage is the disadvantage of range-partitioning itself, which is easy to cause uneven data distribution, hot spots and affect throughput.

Dynamodb,riak supports this scheme

rebalance

When the node where the partition is located is broken or a new machine is added, the rebalance of the partition is involved at this time. Originally, those who should have requested this node now need to transfer the request to another node. The process is called rebalancing.

Rebalancing’s goals

  • Divide data storage and read-write requests equally to avoid hot spots.
  • During rebalancing, normal reading and writing will not be affected.
  • It should be done as fast as possible and with as little network and IO load as possible.

Rebalance strategy

Direct hash (Modulus fixed)

That is, the direct mapping of key-machine. The disadvantage of this is that adding or subtracting machines requires hash again and lacks flexibility.

Two-level mapping (Partition hashing, fixed number of partition)

In order to improve the flexibility of direct hashing, two levels of mapping are introduced, namely, key-partition and partition-matchine/node, and the association between key and machine/node is decoupled through partition. For newly added machines, only some partitions of the original node need to be divided into new machines, and only the partition-machine mapping needs to be modified, thus having flexibility.

  • Benefits
    Compared with direct hashing, when nodes increase or decrease, only the mapping relationship between partiton-matchine needs to be adjusted, and the client does not need to reroute.
  • Disadvantages
    Fixed partition requires a reasonable number, and the size of each partition needs to be reasonably determined. The equivalent of these fixed numbers of partition is to divide the entire dataset equally. If the data set continues to grow, if the original number of partitions is too small, the size of each partition will continue to increase, resulting in relatively time-consuming node recovery /rebalance. If the original number of partitions is too large and the subsequent growth of the data set is not too large, it may result in too little data for some partitions to achieve the equalization effect.

Elasticsearch adopts this scheme. When creating indexes, the number of shard/partition and the number of replication must be specified.
Couchbase introduced the concept of vBucket, which can be understood as a virtual paritition here.

Dynamic partition

The number of partitions varies dynamically, and dynamic splitting or merging is carried out according to the set partition size threshold.

This is how hbase uses it.

Consistent hash (hash ring)

Consistent hashing is somewhat different from the first two because the algorithm hashes machine/node together and then performs interval matching with the hash value of key to determine which machine/node the key falls on. Distributed cache uses more, for example, redis uses consistent hashing.

The details are as follows:

  • Divide the annular space into a total of 2 32 zones
  • Key and machine are converted into a 32-bit binary number by some hashing algorithm and then fall into the corresponding interval range
  • The clockwise nearest node of each key is the storage node to which the key belongs.
  • Benefits
    When nodes increase or decrease, the mapping of the entire annular space will still keep the clockwise rule of consistent hash, so the attribution of a small part of key will be affected. When a node hangs up, it is equivalent to a cache miss, and it will be cached again the next time it is accessed.
  • Disadvantages
    Using the general hash function, the mapping distribution of machine is very uneven, which may cause hot spots. In this case, virtual nodes are introduced to solve this problem. In other words, the two-level mapping mode, key-vnode, vnode-machine, is used for reference. A machine is mapped into a plurality of vnode, and then distributed on a ring structure, so that vnode can be evenly distributed, and finally the storage of each machine is relatively uniform.

However, the introduction of virtual nodes also has problems. When adding machine, it is equivalent to adding multiple vnode distributed on the ring, but this will cause more range of key to require rebalance.

1. Improve monotonicity (reduce cache migration cost when adding or subtracting nodes through ring algorithm) 2. Improve balance (reduce the problem of uneven cache distribution caused by adding or subtracting nodes as much as possible through virtual nodes)

Summary

Products Partition mode Index fragmentation strategy
redis consistent hash
elasticsearch Fixed partition two-level mapping local index
mongo Prior to version 2.4, range fragmentation was used, and 2.4+ supports hash fragmentation. local index
kafka Fixed partition

The high version of kafka’s key-to-partition mapping supports custom policies. if cluster increases or decreases node, it will not be effective for previously created topic. therefore, it is necessary to call reassign-partitions to redistribute to avoid hot spots.

doc