The article lists 5 million pieces of data, and 100,000 pieces of data are updated every day. from the 100,000 pieces updated, 3,000 pieces are randomly selected for data research. if it is efficient?

  mysql, question

The topic is a face-to-face question
My idea is to create another table to store the 100,000 articles updated today.
I only think of this first step and what to do next. I don’t know how to realize it.
Assuming that according to my thinking, even if I know which 100,000 pieces of data are updated every day, I still have to find 3,000 out of 5 million pieces of data
I don’t know, brothers, do you have any good ideas?

For 3,000 pieces in 10W, the probability is 3%

Then as long as the article is saved, the updated article is saved to the cache according to the probability of 3%

The set type of redis used for this cache is the best. the set type will not store duplicate elements, so repeated updates of articles will not produce multiple results in the list.

The format of key can be “analyze:list:(Y-m-d)”

Then the cache can be set to expire in 48 hours, and if necessary, the cache of the previous day can be archived to the database every day.

Considering the error of random probability, you can enlarge 3% to 5%, and you will definitely record more than 3,000 in the end, but not too much. anyway, you can only use 3,000 pieces in the end.

If every update is recorded, whether it is recorded in the cache or the database, most of the records are useless, so it is better to filter it first according to the probability.

In fact, the update_time of each article can also be recorded. I think where update_time >=? and update_time <= ? Order by random() limit 3000 is almost enough. Anyway, it only runs once a day. However, since it is an interview, how can it not follow Michael Bu’s thinking?

Advantages:
1. You can play without update_time field. There is no requirement for the existing table structure. Adding fields to the database in the production environment is troublesome
2. In case the database load in the production environment is relatively high, it is not good for the order by random () query to cause the database to get stuck. In this case, it is best to use the read/write splitting architecture. Only by querying on the read-only database, the architecture requirement is generated. My design is completely a bypass record, and there is no requirement except redis
3, how much need to remember how much, less additional IO