Louvain Algorithm Principle and Design Implementation

  algorithm, Machine learning

In the massive information flow, the strange technology guide has become an extremely important way for products to attract users and obtain profits by recommending the contents of interest to users through precise algorithms.
This is a series of articles on algorithms, which will share 360 algorithm knowledge and experience accumulated by the algorithm team in practice. Welcome to exchange and share ~

Louvain algorithm is an algorithm based on multi-level sub-optimization Modularity. It has the advantages of fast and accurate performance, good efficiency and effect, and can discover hierarchical community structure. It is considered as one of the best community discovery algorithms.

Modality

Louvain algorithm is a community discovery algorithm based on graph data.
The original paper is:
《Fast unfolding of communities in large networks》。

The optimization goal of Louvain algorithm is to maximize the modularity of the whole data.

The modularity is calculated as follows:
图片描述
Where m is the total number of edges in the graph, k_i represents the sum of the weights of all connected edges pointing to node I, and k_j is the same. A_{i,j} represents the edge weight between nodes i, j.

One thing to be clear is that the concept of modularity was not invented by Louvain algorithm, but Louvain algorithm is only a realization of the goal of modularity in optimization diagram.

Two-step Iterative Design of Louvain Algorithm

At the beginning, each original node is regarded as an independent community, and the weight of connected edges in the community is 0
Step 1
The algorithm scans all nodes in the data, traverses all neighbor nodes of the node for each node, and measures the benefits of modularity brought by adding the node to the community where its neighbor nodes are located. And select the neighbor node that corresponds to the maximum benefit to join its community. This process is repeated to guide each node’s community affiliation is not changing.
Step 2
Fold the communities formed in step 1, fold each community into a single point, and calculate the joint weight between these newly generated “community points” and the sum of the joint weight between all points in the community respectively. Step 1 for the next round.

The biggest advantage of this algorithm is its fast speed. The time complexity of each iteration of step 1 is O(N), n is the number of edges in the input data. The time complexity of step 2 is O(M+N), and m is the number of midpoints in this iteration.

Algorithm implementation

Data structure design

The design of algorithm data structure has two main considerations:

  1. How to efficiently store nodes and their relationships in a graph
  2. How to efficiently scan data and carry out algorithm iteration in the designed data structure.

At present, some open source algorithms mainly store the relationship between nodes through hash table or set structure. There are two main disadvantages:

  1. Maintaining hash or set structure itself requires a lot of memory overhead
  2. During the traversal process, corresponding Hash or Set structures need to be continuously created, destroyed and emptied, especially when traversing neighbor nodes and communities of different nodes.

Moreover, in the process of traversal, the structure’s access to elements is not strictly O(1).

In view of the above considerations, we design a more efficient data structure to store nodes and edges in the graph, avoid using complex data structures, and do not apply for redundant space and space destruction operations during algorithm iteration, as follows:
图片描述

Description of node fields:

  • Count: Number of Nodes in Community
  • Clsid: The representative node ID of the node’s home community
  • Next: the next node belonging to the same temporary community in step 1 iteration
  • Prev: Last Node Belonging to the Same Temporary Community in Step 1 Iteration
  • First: the first node belonging to the same community except the representative node, which is generated when the community is collapsed in step 2.
  • Kin: The sum of interconnection weights between nodes within the stable community
  • Kout: stabilize the outside of the community and point to the sum of the weights of your own community.
  • Clskin: Sum of Interconnection Weights between Nodes in Temporary Community
  • Clstot: to stabilize the sum of all the internal and external connection weights of the community pointing to itself.
  • Eindex: The first pointer of a node’s neighbor list. All left under this list are the node itself

The field on the edge data structure is just as the name implies.

Based on the above structural design, given a given m nodes, the space required for n-trimmed graph is: 60M + 24N.
For example, given 10 million points and 20 million edges of data, the space required is about 10 million60 + 2000000024 = 1080M and the memory environment remains unchanged throughout the iteration.

iterative process

1. Suppose we have 5 points at the beginning, and there is a certain relationship between them (as for any relationship, regardless), as follows:
图片描述
2. Assuming that Node 2 is found after sufficient iteration of Step 1, it should be added to the community in which Node 1 is located (at the beginning, each point is a community and itself is the representative of this community). The new community is represented by Node 1 as follows:
图片描述
At this time, there is no attribution relationship between nodes 3, 4, 5 and with nodes 1, 2.

3. At this time, step 2 should be carried out to fold the new community formed by nodes 1 and 2. The folded community is regarded as a single point and represented by node 1 as follows:
图片描述
At this time, there are 4 nodes (or 4 communities) in the data, one of which contains 2 nodes, while communities 3, 4 and 5 contain only one node, namely themselves.

4. Repeat Step 1 to scan the communities 1, 3, 4, and 5. Assuming that nodes 5, 4, and 3 respectively join the communities where node 1 is located after full iteration, the following is true:

图片描述
5. Step 2 is carried out to fold the newly generated community. The newly folded community is regarded as a single point and is represented by node 1. The structure is as follows:
图片描述
At this time, there is only one community in the whole data, that is, the community represented by node 1.

When step 1 is carried out again, the community affiliation of any node will not change, and step 2 need not be carried out again at this time, thus the iteration ends.

Code Implementation and Testing

A code implementation based on the above structure design is shown in:
https://github.com/liuzhiqian …

The test is carried out on an actual graph (700,000 points and 2 million sides), and the time required for iteration to complete convergence is 1.77 seconds.
In practice, it is often not necessary to iterate until every point does not change, or how many proportion of nodes in the whole graph exit without changing.

This is the third article in the algorithm series, sharing the principle and design of Louvain algorithm.
This article is from the 360 video information flow algorithm team. We will push an algorithm-related article for everyone every week. Welcome everyone to exchange and learn together.

Related recommendations

One Wave and Two Folds of Depth Residual Network
On Gradient Descent Method /Gradient descent

This article is the original 360 technical content, please indicate the source and keep the two-dimensional code at the end of the article, thank you ~
图片描述