Java version of hyperloglog

  java

Order

For massive data, data memory usage becomes very high. The Probabilistic data structure sacrifices accuracy for lower memory usage. For example, a HyperLogLog data structure can calculate the cardinality of 2 64 different elements with an error rate of 1.625% only requiring 12KB of memory.

Scene

A common scene in Hyperlog is to count the UV of websites.

Cardinal number

In a nutshell, cardinality refers to the number of different elements in a set (where duplicate elements are allowed). For example, look at the following collection:
{1,2,3,4,5,2,3,9,7}
This set has 9 elements, but 2 and 3 appear twice each, so the non-repeating elements are 1,2,3,4,5,9,7, so the cardinality of this set is 7.

maven

        <dependency>
            <groupId>net.agkn</groupId>
            <artifactId>hll</artifactId>
            <version>1.6.0</version>
        </dependency>

Use

    @Test
    public void testSimpleUse(){
        final int seed = 123456;
        HashFunction hash = Hashing.murmur3_128(seed);
        // data on which to calculate distinct count
        final Integer[] data = new Integer[]{1, 1, 2, 3, 4, 5, 6, 6,
                6, 7, 7, 7, 7, 8, 10};
        final HLL hll = new HLL(13, 5); //number of bucket and bits per bucket
        for (int item : data) {
            final long value = hash.newHasher().putInt(item).hash().asLong();
            hll.addRaw(value);
        }
        System.out.println("Distinct count="+ hll.cardinality());
    }

principle

Imagine a process in which coins are thrown continuously, either the front or the back (probability of each side is 0.5). In this process, the probability that the number of throws is greater than k is 0.5^k (k negative sides are thrown in succession), and in one process, the probability that the number of throws is less than k is (1-0.5) k.
Therefore, in the process of n times of throwing, the probability that the throwing times are all less than k is

P(x<=k)=(1-0.5^k)^n  
P(x>=k)=1-(1-0.5^k)^n 

From the above formula, it can be seen that the probability when n<=k) is close to 0. When n>>k, the probability of P(x<=k) is close to 0. Therefore, when n>>k, the probability of not throwing more than k is almost zero.

A process is understood as a bit substring, the reverse side is 0, the front side is 1, and the number of throws k corresponds to the position of the first 1. when there are enough statistical substrings, the position of the largest first 1 is j, then when n > > 2 j, P(x<=k) approaches 0, and when n < < 2 j, P(x>=0) also tends to 0. In other words, on the premise of x=k, we can think of n = 2 j.

More generally, if we generate an 8-bit hash string for a data set, then our probability of obtaining 00000111 is very low, that is, our probability of generating a large number of consecutive zeros is very low. The probability of generating 5 consecutive zeros is 1/32, so when we get this string, we can estimate that the base of this data set is 32.

doc