How to do aggregate statistics on a large amount of data?

  mongodb, question

There is a statistical function requirement for wrong problem books. At present, my method is to use mongodb as a set exercise to keep a record of problem making. Each problem is kept a record, and the data is aggregated by aggregate. At present, the exercise collection has more than 20 million records, and the growth rate is very fast. one record structure is as follows:

{
 "_id" : ObjectId("58856c54c10da3925edc9a4b"),
 "eID": NumberLong(1), // question id
 "right": NumberLong(1), // is it correct
 "scene": NumberLong(1), // scene
 "uid": NumberLong(663148), // user
 "total": NumberLong(1), // total test score
 "score": NumberLong(1), // user score
 "rate": NumberLong(100), // correct rate
 "date": isodate ("2017-01-23t12: 17: 00.000z"),//question time
 "subject": NumberLong(1), // Account
 "class": NumberLong(1), // class
 "grade": NumberLong(1), // grade
 "school": NumberLong(1), // school
 "tags" : [
 {
 "tagID": NumberLong(39544), // chapter ID
 "level": NumberLong(1) // chapter level
 },
 {
 "tagID" : NumberLong(39621),
 "level : NumberLong(1)
 }
 ]
 }

The requirements are as follows:
(1) according to grade, subject, date, school to find the wrong question list, and each wrong question to make a statistical number of mistakes
(2) According to grade, subject and school, count the number of wrong questions in each scene.
(3) According to grade, subject and teaching material, statistics are made on the total number of mistakes made for all wrong questions in each chapter of the teaching material.

At present, schools have been indexed, but when there are millions of school data, the query speed for the above requirements is very slow, I do not know how other large websites are designed, and the query speed is very fast. Seek advice from experienced great gods

First, there is the index problem. The three indexes corresponding to your three queries are:

{grade: 1, subject: 1, date: 1}
 {grade: 1, subject: 1, school: 1}
 {grade: 1, subject: 1, "tags.tagId": 1}

But indexing does not completely solve your problem. In my heart, I want to have a concept. The more data involved in statistical problems, the slower the speed. After all, you have to read so much data at a time. The speed cannot be compared with the speed of querying several records.
However, the results of the exercise are increasing with the number of people doing it. If all the data are counted once every time, the amount of data will become more and more large and will certainly become slower and slower. So the question becomes how to control the amount of data counted each time. The following two aspects should be considered:

  1. Limit the number of days of backtracking (for example, counting only within one year? )。
  2. Prepolymerization is carried out according to different particle sizes.

Pre-aggregation is a very common method to deal with the growth of time series data. Take the first question as an example:

(1) according to grade, subject, date, school to find the wrong question list, and each wrong question to make a statistical number of mistakes

Assuming there are 10 such wrong questions every day, there will be 1,000 in the past 100 days. At this time, your query requires 1,000 records at a time to get results.
If I do daily pre-aggregation at the end of each day, I will only have 10 records after 100 days, which record the statistical results of each day. At this time, only the results of these 100 records need to be summarized, which is the result I want.
Furthermore, aggregation granularity such as month and year can also be considered if necessary, thus having more advantages in long-term queries.