A Difficult Problem in mongodb Database Design

  mongodb, question

I really don’t know how to use a title to describe it.

I have two sets

Data Collection: Stores article information, such as status, classification, time, etc.
Field collection, which stores the field information of articles, because the field of each article is not fixed, such as title (title) content (content), etc.

One of field’s data is as follows

{
 field_key:'title',
 data_id:ObjectId("5c1beb6c559cd15f2d57cbc8"),
 Data:' This is the title',
 Data _ hash:' a46c23269ab827c5f878e766984e4716'//hash value of text
 }

The current problem is that I need to check if a field exists.

For example, to check whether the title field already exists, use such a query

{
 field_key:'title',
 data_hash:'a46c23269ab827c5f878e766984e4716',
 }

It doesn’t seem to be a problem, does it? It’s a simple problem. But it is not.

When there is a large amount of data, such as tens of millions of data tables, the field table may reach hundreds of millions of data.

If the user deletes the article data with uid=100, both the field table and the Data table will be deleted.
At the beginning, in order to reduce the waiting time of users, I only soft deleted the data table, for example, the flag status=-1 indicates that this data has been deleted.
After all, deleting and updating the field table with hundreds of millions of data is a time-consuming operation. Therefore, this deletion mechanism is designed in this way.

So the question arises. According to the above detection results, it is possible to find multiple field conditions. At this time, how do you know if the corresponding article of this field has been deleted?

So every time you have to look up every field you find in the data table, his status, which is a great pressure on the database.

My solution:

The first scenario: I intend to store the value of each field in redis. Such as md5(field_key+data_hash)
It is estimated that hundreds of millions in one set will not take up much space, thus the query performance will definitely be greatly improved.
However, a new problem arises. If the user deletes 1,000 articles, I need to go to the database to find out how many field articles have been affected and then remove them from set. This deletion performance may have some impact. In addition, it is cost-effective.

The second option is to change the deletion mechanism and update the field collectively with status=-1. however, I have not tested whether the field set of hundreds of millions of data, such as updating 1 million data, will block the database and affect the business. So this is more risky.

I don’t know if there are any other simple and effective solutions.

-Supplementary-

For the answer, why do you want to separate the two data tables? There is a reason for this design:

I didn’t express myself clearly. What I am talking about is the article, which only simplifies the problem. (If I had written thousands of words in a voluminous manner, I think you wouldn’t be bothered to look at the problem.)

In fact, it can be articles, commodity data or microblog data. The data models we need to store are diversified.

For example, I store data of a microblog, which may have hundreds of thousands of comments, including comments, comment time, user information, etc.
As long as there is a limit of 16MB, my system is limited, so I can’t design it this way.

2. There are several versions of an article, such as version 1, version 2, and version 3. As for how many versions there are, there is no answer. So 16MB is definitely not enough.

There are various other reasons, which will not be explained in detail. But to be sure, two data tables must be separated.

I finally used redis storage. It seems that this is a better plan. I can’t think of any other better plan besides this.

3. As for the problem that the paradigm is not normal.

Some people may ask, if you store it like this, wouldn’t it be very inconvenient to query the data of an article?

We also have another data stored in elasticsearch. The biggest characteristic of elasticsearch is data redundancy, which does not follow the normal form.

As described downstairs, we store it in elasticsearch as follows:

// data table
 {
 // data table field
 fields: [
 {
 field_key:'title',
 data_id:ObjectId("5c1beb6c559cd15f2d57cbc8"),
 Data:' This is the title',
 },
 {
 field_key:'content',
 data_id:ObjectId("5c1beb6c559cd15f2d57cbc8"),
 Data:' This is the content',
 },
 ...
 ]
 }

Only one inquiry is needed, everything is out, this is not a problem. So we follow the paradigm in mongodb as much as possible.


Using redis storage, I observed for a period of time. At present, there are about 10 million items in all set. The storage space is about 1G.
In other words, if it is one hundred million yuan, it is estimated that 10G will definitely cause a sex bottleneck. This is not a good way either.

-Supplementary-

I suddenly found myself stupid, damn it. Redis’s set collection, isn’t it good to use mongodb to store it?
Just use redis as the cache. Performance and storage are not problems.

It seems that we are still designing the table structure according to the paradigm of relational database. Why not combine the two watches into one?
An article will havefieldIs limited, 16MB of space for content+field is completely enough, in the simplest way:

// data table
 {
 // data table field
 fields: [
 {
 field_key:'title',
 data_id:ObjectId("5c1beb6c559cd15f2d57cbc8"),
 Data:' This is the title',
 Data _ hash:' a46c23269ab827c5f878e766984e4716'//hash value of text
 },
 {
 field_key:'title',
 data_id:ObjectId("5c1beb6c559cd15f2d57cbc8"),
 Data:' This is the title',
 Data _ hash:' a46c23269ab827c5f878e766984e4716'//hash value of text
 },
 ...
 ]
 }

Whether deleting or querying is completed at one time. Is there any special reason why this cannot be done?