Million Data Aggregated GroupSum Statistics Super Time-consuming Problem, Seek Solution

  mongodb, question

1. Example of Document Structure

{
 _id: xxxx,
 user: 'xiaoming',
 level: 5,
 from: 'iPhone',
 info: 'something wrong'
 }

2. Scene: There are 67 million documents with user as’ minor’

3. Question: How to Increase the Speed of aggregate+group+sum

aggregate([
 {$match:{user: 'xiaoming', info:{$regex:'wrong'}}},
 {$group:{_id:null, count:{$sum:1}}}
 ])

Use the above to count the number of xiaoming documents with wrong. As a result

{"_id": null, "count": 2299999 }

It takes 30s-40s. User, info, user+info three indexes have been tried, and the speed has not improved.
Baidu and google Find’ Conditional Counting Slow Without Solution’
How to Improve Efficiency, Can It Be Achieved within 10s

The first problem to be explained is that the expectation for OLAP-type operations should not be too high. After all, for the operation of a large amount of data, IO alone has far exceeded the normal OLTP operation, so it is unrealistic and meaningless to require the speed and concurrency of OLTP operation. However, this does not mean that there is no room for optimization.
Let’s start with the index. On the premise of no index, find 6 million{user: "xiaoming"}How long does it take? Full table scanCOLLSCANFinding 600w from 700w data and finding 600w from 100 million data are obviously two concepts. Hit indexIXSCAN, the difference will be much smaller, almost negligible. So you said{user: 1}It is wrong that this index has no effect, perhaps because the amount of aggregate data is too small to see any difference. Incidentally, it should be mentioned that whether there is any difference in efficiency depends on the implementation plan, not the execution time, which is inaccurate.
There areuserOn the premise of index, the result still has 600w, the rest isregex,regexThe index cannot be hit, so whether it is right or notinfoThe index of is meaningless. After finding 600w pieces of data, there is another one for 600w datafilterOperation. The only thing that may be helpful to this operation isFull-text indexHowever, full-text indexing cannot completely replace regularization, and specific problems need to be read.Document. Considering the feasibility of full-text indexing, composite indexes can be established:

db.coll.createIndex({
 user: 1,
 info: "text"
 });

The corresponding query should read:

db.coll.aggregate([
 {$match:{user: 'xiaoming', $text: { $search: "wrong" }}},
 {$group:{_id:null, count:{$sum:1}}}
 ])

Introduction and Reference of Composite Full-text IndexHereThere are still some restrictions that need attention. After such optimization, it is expected that it can be reduced to less than 20 seconds under the same hardware, which is still a long way from what you want within 10 seconds. The reason is stated at the beginning, OLAP cannot be expected to be so high. If you really need this, you should start from the source and consider:

  1. Every timeinfoCount fields when they are updated or inserted.
    Or ..

  2. Do complete statistics every once in a while, cache the statistical results, and show them directly to users when querying.