Tree structure storage of a large number of file name records

  file store, nas, tree structure

Over the past ten years, there have been as many as 1 billion directories and files in NAS, and many challenges have been encountered in the process of designing and developing backup systems. This article will share a large number of tree structure storage practices of file name records.

I introduction

Since it is a regular backup, there will definitely be more than one backup. For a specific directory, each backup should be compared with the last backup in order to find out which files have been deleted and which files have been added. This requires that all file names under the directory should be saved during each backup. Our first thought is to splice all file names with specific characters and save them. Because we use MySQL to store this information, when there are many files in the directory, this splicing method is likely to exceed MySQL’s Blob length limit. According to experience, when there are a large number of files in a directory, the names of these files are often generated by programs with certain regularity, and the beginning is usually repeated, so we thought of using a tree structure for storage.

For example, a tree corresponding to a directory with abc, abc1, ad and CDE files is shown in Figure 1.

Fig. 1 is an example of a tree structure

In fig. 1, r represents the root node, cyan node is called the end node, and the path from r to each end node represents a file name. You can search the tree for a file name, traverse all file names in the tree, save the tree serialization, and deserialize the serialization result to regenerate the tree.

Second, the data structure involved

Note: We use java to write, and the knowledge points related to language features in this article refer to java.

2.1 structure of node

Each Node, including the root node, is represented by a node class. The code is as follows:

    class Node {
        private char value;
        private Node[]children = new Node[0];
        private byte end = 0;
    }

Field description:

  • Value: The character represented by this Node. When node represents the root node, value has no value.
  • Children: All child nodes of this node are initialized to an array of length 0.
  • End: Marks whether the node is an end node. 0 is not; One is. The leaf node must be the ending node. Default non-ending node.

Operation of 2.2 Node

    public Node(char v);
    public Node findChild(char v);
    public Node addChild(char v);

Operating instructions:

  • Node: construction method. Value.
  • FindChild: Finds whether children contain children with a value of V in children. If yes, child nodes are returned; if no, null is returned.
  • AddChild: First, find out whether children already contain child nodes with value V in children, and if so, directly return the found child nodes; Otherwise, create a node with value v, extend the length of children by 1, take the newly created node as the last element of children, and return the newly created node.

2.3 Tree structure

    class Tree {
        public Node root = new Node();
    }

Field description: Tree contains only root Node. As mentioned earlier, root has no value and end is 0. The initial children length is 0.

2.4 Tree operations

    public void addName(String name) ;
    public boolean contain(String name);
    public Found next(Found found);
    public void writeTo(OutputStream out);
    public static Tree readFrom(InputStream in);

Operating instructions:

  • AddName: adds a new file name to the tree, namely parameter name. Starting from root, each character in name is used as a parameter to call addChild, and the return value is used as a new starting point until all characters in name are added, and the return value of the last call to addChild is marked as the end node.
  • Contact: whether the query tree contain a file name.
  • Next: To traverse all the file names contained in the tree, we have introduced a new class Found to enable the traversal to proceed smoothly. Details will be described later.
  • WriteTo: writes the tree to an output stream for persistence.
  • ReadFrom: This method is static. Rebuild the tree from an input stream.

III. Tree Construction

Call the addName method on the newly-built Tree, add all file names to the tree, and the tree construction is completed. Still taking the directory containing abc, abc1, ad and cde as an example, the tree construction is illustrated.

Figure 2 Tree Construction Process

In fig. 2, the orange node indicates that the addChild method needs to be called on the node to addChild nodes, and the return value of addchild is used as the new orange node. Until no child nodes need to be added, the last orange node is marked as the ending node.

Four, the tree query

Find out whether the Tree contains a file name corresponding to the tree’s contain method. Look up ef, ab and abc files on the results in Figure 2 to demonstrate the search process. As shown in fig. 3.

Fig. 3 tree query diagram

In fig. 3, the orange node indicates that the findChild method needs to be called on the node to find the child node.

Five, the traversal of the tree

The traversal here is different from the traversal of general trees. The general traversal is to traverse the nodes in the tree, while the traversal here is to traverse the path from the root node to all end nodes.

We traverse from left to right, from shallow to deep. We introduce the Found class and traverse it as a parameter of the next method.

5.1 structure of found

    class Found {    
        private String name;
        private int[] idx ;
    }

In order to explain the problem more easily, a small modification was made on the basis of fig. 1. subscripts were added to the lower right corner of each node, as shown in fig. 4.

Fig. 4 Tree with lower mark

For the name abc, the name value in Found is “abc” and idx is {0,0,0}.

For the file name abc1, the name value in Found is “abc1” and idx is {0,0,0,0}.

For the file name ad, the name value in Found is “ad” and idx is {0,1}.

For the file name cde, the name value in Found is “cde” and idx is {1,0,0}.

5.2 How to Traverse

For fig. 4, the first call to the next method should pass null, and the first result, Found;, represented by abc, will be returned. Continue to use this Found as a parameter for the second next call, and the second result, found represented by abc1, will be returned. Then continue to use this Found as a parameter for the third next call, and the third result, found represented by ad, will be returned. Continue to use this Found as a parameter for the fourth next call, and the fourth result, found represented by cde, will be returned. Then continue to use this Found as a parameter for the fifth call, and return null, ending the traversal.

Six, serialization and deserialization

6.1 serialization

First, it should be clear that each node should contain 3 pieces of information after serialization: the value of the node, the number of children of the node, and whether the node is the ending node.

6.1.1 value of Node

Although the value of the nodes in the previous examples are all English characters, in fact the file name may contain Chinese characters or characters in other languages. For convenience of processing, we did not use variable length coding. Instead, unicode codes are used directly. Byte sequence adopts big-end coding.

6.1.2 Number of children in Nodes

Since the value of the node uses unicode codes, the number of children cannot exceed the number of characters unicode can represent, that is 65536. The number of children uses 2 bytes. Byte sequence is also coded with large end.

End of node 6.1.3

0 or 1 can be represented by 1bit (1 bit), but the smallest unit in java is bytes. If one byte is used to represent end, it is a waste of space. in fact, it is extremely unlikely that the number of children in any node reaches 65536/2. therefore, we consider using the highest bit of children number to represent end.

To sum up, a node takes up 4 bytes after serialization, taking the root node in fig. 4, the node with value b and the node with value e as examples:

Table 1 example of node serialization

Unicode of value Number of children end Number of children /(end<<15) Final result
root node 0x0000 2 0 0x0002 0x00020000
Node b 0x0062 1 0 0x0001 0x00010062
Node e 0x0065 0 1 0x8000 0x80000065

6.1.4 Serialization of Trees

To traverse the tree in breadth, queues are needed in the traversal process. take the serialization of fig. 4 as an example to illustrate:

Figure 5 Serializes Figure 4

6.2 Deserialization

Deserialization is an inverse process of serialization and will not be described for reasons of space. It is worth mentioning that the deserialization process also requires the assistance of queues.

VII. Discussions

7.1 About Space Saving

For convenience of discussion, assume that the file name in the directory is a full arrangement of 10 Arabic numerals. When the number of digits is 1, the directory contains 10 files, namely 0, 1, 2 … 8, 9. When the number of digits is 2, the directory contains 100 files, namely 00, 01, 02 … 97, 98, 99, and so on.

Compare the two methods, one is separated by “/”,the other is the method introduced in this paper.

Table 2 Comparison of Storage Space between Two Methods (Unit: Bytes)

Number of digits method 1 2 3 4 5 6
“/”separation 19 299 3999 49999 599999 6999999
Tree 44 444 4444 44444 444444 4444444

As can be seen from Table 2, when the number of digits is 4, the Tree method is used to start saving space. The more digits, the higher the saving ratio. This is exactly what we need.

In the table, when “/”is used to separate, the number of bytes occupied is calculated according to utf8 code. If unicode is directly used for storage, the occupied space will be doubled, and the space will be saved when the number of bits is 2. Also use “/”to separate, it seems utf8 saves more space than unicode, but in fact, the file name sometimes contains Chinese characters, and the utf8 encoding of Chinese characters takes up 3 bytes.

7.2 Regarding Time

In the process of tree construction, serialization and deserialization, additional operations are introduced. According to our practice, user CPU has not changed significantly.

7.3 On Idealized Assumptions

At first, we used “/”separation method to store the file name, and the corresponding field type of the database is BLOB (the maximum value Blob(Blob is 65K). At the testing stage, it was found that exceeding 65K was a very common thing. Under the circumstance that it is impossible to count the size of all file names spliced in the largest directory in advance, we adopt two methods, one is to use LongBlob type, the other is to minimize the size of the spliced result, which is the method introduced in this article.

Even if a tree structure is used to store file names, there is no guarantee that the final result will not exceed 4G (the maximum value of Longlob type). At least there is no problem in our practice process. If this happens, only special treatment can be done.

7.4 Regarding other compression methods

After the file names are spliced with “/”,compression algorithms such as gzip are used to compress the spliced results and then store them, which will achieve better results in saving storage space. However, before compression, the splicing results exist in memory, which has higher requirements on JVM’s heap memory. In addition, when using “/”splicing, searching will be more troublesome.

Author: Niu Ningchang

Source: Yixin Institute of Technology