Mongodb’s Efficiency of Large Data Deletion

  mongodb, question

There are two sets
Data stores data.
Attach Attachments to Store Data
Data and attach, a one-to-many relationship.
The ratio of data to attach is about 1:10, that is, 1 piece of data, and there may be 10-50 pieces of data below.
When data is tens of millions, attach may be hundreds of millions.

The problem now is that if the user deletes data, attach must delete it accordingly.
It is a long process to operate hundreds of millions of data. The database is also under great pressure.
Therefore, it is considered whether soft deletion is possible. First, the data update status is deleted, and attach does not care. Later, the program will be used to clear it bit by bit in the background.
After all, to update tens of millions of data, the waiting time is not long.
However, there is a problem that attach needs to do statistics. For example, before the user deletes, he calculates the occupied space of his accessories 20G. After deleting, you need to give him the occupied space of the accessories after deleting, otherwise the charging will be inaccurate.

So how can this problem be solved?

This is a very common problem. In big data scenarios, soft deletion is indeed used to replace deletion, so there is no problem with the scheme itself. The first point must be made. If you know about GridFS, in fact, in GridFSfs.filesThe record in is equivalent to yoursdata, andfs.chunksIs equivalent to yoursattach. Infs.filesA series of metadata such as file size, file name, path, etc. are stored in.
In addition, since you have marked the deletion of this file, you can simply filter it according to the deletion mark when counting the size of the file. For example, byisDeleted=trueMarked for deletion, the query you need is:

db.foo.aggregate([
 {$match: {userId: "<userId>", isDeleted: false}},
 {$ group: {_ id: "$ userid", size: "$ size"}}//assuming the file size is stored in size
 ]);

In order to make this query faster, the following aspects can be optimized:

Reasonable index

If you are familiar with the database, you can see that the most suitable index here is:

db.foo.createIndex({userId: 1, isDeleted: 1})

Further optimization

In fact, because you are concerned about “files that have not been deleted” instead of “files that have been deleted”, further optimization can consider indexing only files that have not been deleted, that is, those in MongoDBPartial index

db.foo.createIndex({
 userId: 1,
 isDeleted: 1
 }, {
 partialFilterExpression: {
 isDeleted: false
 }
 });

Further optimization

If the performance of this query is very important and frequent, then further optimization can consider query coverage (Covered Query)。 In other databases, there should be a similar concept, that is, data is directly obtained from the index after being queried by the index, instead of querying the data page. This is often helpful in large-scale statistics. The price is to consume more memory space. Or you must have more memory to benefit from it. The index and corresponding query that should be established at this time should be:

db.foo.createIndex({
 userId: 1,
 isDeleted: 1,
 size: 1
 }, {
 partialFilterExpression: {
 isDeleted: false
 }
 });
 db.foo.aggregate([
 {$match: {userId: "<userId>", isDeleted: false}},
 {$project: {_id: 0, userId: 1, size: 1}},
 {$group: {_id: "$userId", size: "$size"}}
 ]);

EDIT

On the issue of deletion, deletion is indeed a very heavy operation. Every deleted data will cause all relevant indexes to be changed, so the speed will not be too fast. You have also mentioned “soft delete” and then another thread will finish the work. this is essentially an asynchronous delete operation. there is no problem with the idea, but it can be simpler, i.e. an expiration time is set at the same time when deleting, using MongoDB’sTTL IndexHelp you to complete the deletion operation automatically. From the perspective of space saving, this expired index is only valid for deleted data, so it is also a partial index:

db.foo.createIndex({
 expire: 1
 }, {
 partialFilterExpression: {
 isDeleted: true
 },
 expireAfterSeconds: 0
 })

When the operation is deleted, only:

db.foo.update({...}, {$set: {isDeleted: true, expire: new Date()}});

In this way, when deleting, the expiration time is set to now (expire immediately), then the next time TTL Index starts (once a minute), it will help you delete this data. The same operation needs to be completed on attah.