Does MongoDB prefer to put all data under one Collection?

  mongodb, question

For example, there is a database of user information and user relationships. According to SQL, two tables of user information and user relationships will be established. So, in MongoDB, does it tend to embed user relationships into user signals to form a single document?

Original address:http://pwhack.me/post/2014-06-25-1Reprint with source

This article is excerpted from Chapter 8 of MongoDB Authoritative Guide and can completely answer the following two questions:

There are many ways to represent data, and one of the most important issues is to what extent to normalize the data. Normalization is to disperse data into multiple different sets, and data can be referenced between different sets. Although many documents can refer to a piece of data, this piece of data is stored in only one set. Therefore, if you want to modify this data, you only need to modify the document that holds this data. However, MongoDB does not provide a join tool, so executing join queries between different sets requires multiple queries.

De-normalization is the opposite of normalization: the data required by each document is embedded inside the document. Each document has its own copy of data instead of all documents referencing the same copy of data. This means that if the information changes, all relevant documents need to be updated, but when executing the query, only one query is needed to obtain all the data.

It is difficult to decide when to adopt normalization and when to adopt anti-normalization. Normalization can improve the data writing speed, while anti-normalization can improve the data reading speed. It needs to be carefully weighed according to more than a dozen needs of its own application.

Examples of Data Representation

Suppose you want to save student and course information. One representation is to use a set of students (one document per student) and a set of classes (one document per course). Then use the third set of studentsClasses to save the connection between students and courses.

> db.studentsClasses.findOne({"studentsId": id});
 {
 "_id": ObjectId("..."),
 "studentId": ObjectId("...");
 "classes": [
 ObjectId("..."),
 ObjectId("..."),
 ObjectId("..."),
 ObjectId("...")
 ]
 }

If you are familiar with relational databases, you may have established this type of table connection before, although there may be only one student and one course (instead of a course “_id” list) in each of your notes. Putting the courses in an array is a bit MongoDB style, but in practice it is usually not the case because it takes many queries to get real information.

Suppose you want to find a course chosen by a student. You need to search the students collection to find the student information, then query the studentClasses to find the course “_id”, and finally query the classes collection to get the desired information. In order to find out the course information, you need to request three queries from the server. It is likely that you do not want to use this data organization method in MongoDB again, unless student information and course information change frequently, and there is no requirement for data reading speed.

If you embed course references in student documents, you can save one query:

{
 "_id": ObjectId("..."),
 "name": "John Doe",
 "classes": [
 ObjectId("..."),
 ObjectId("..."),
 ObjectId("..."),
 ObjectId("...")
 ]
 }

The “classes” field is an array that holds the course “_id” that John Doe needs to attend. When you need to find out information about these courses, you can use these “_id” to query the classes collection. This process requires only two queries. If data does not need to be accessed at any time and will not change at any time (“anytime” is more demanding than “often”), then this data organization method is very good.

If the reading speed needs to be further optimized, the data can be completely denormalized, and the curriculum information can be saved as an embedded document in the “classes” field of the student document, so that the student’s curriculum information can be obtained by only one query:

{
 "_id": ObjectId("..."),
 "name": "John Doe"
 "classes": [
 {
 "class": "Trigonometry",
 "credites": 3,
 "room": "204"
 },
 {
 "class": "Physics",
 "credites": 3,
 "room": "159"
 },
 {
 "class": "Women in Literature",
 "credites": 3,
 "room": "14b"
 },
 {
 "class": "AP European History",
 "credites": 4,
 "room": "321"
 }
 ]
 }

The advantage of the above method is that only one query is needed to obtain the students’ curriculum information, while the disadvantage is that it takes up more storage space and data synchronization is more difficult. For example, if the credit for Physics becomes 4 points (no longer 3 points), then each student taking the physics course needs to update the document, not just the “physics” document.

Finally, embedded data and reference data can be mixed: create a subdocument array to store common information, and find the actual document by reference when more detailed information needs to be queried:

{
 "_id": ObjectId("..."),
 "name": "John Doe",
 "classes": [
 {
 "_id": ObjectId("..."),
 "class": "Trigonometry"
 },
 {
 "_id": ObjectId("..."),
 "class": "Physics"
 }, {
 "_id": ObjectId("..."),
 "class": "Women in Literature"
 }, {
 "_id": ObjectId("..."),
 "class": "AP European History"
 }
 ]
 }

This method is also a good choice, because the embedded information can be modified as the requirements change. If you want to include more (or less) information in a page, you can put more (or less) information in the embedded document.

Another important question to consider is whether information is updated more frequently or read more frequently. If these data are updated regularly, normalization is a better choice. If the data changes infrequently, it is not worth sacrificing the reading and writing speed in order to optimize the update efficiency.

For example, one example of normalization introduced in textbooks may be to save users and user addresses in different sets. However, people hardly change their addresses, so the efficiency of each query should not be sacrificed for this extremely unlikely situation (someone changed their address). In this case, the address should be embedded in the user document.

If you decide to use embedded documents, when updating documents, you need to set a cron job to ensure that all documents are successfully updated for each update. For example, we tried to spread the update to multiple documents, and the server crashed before all the documents were updated. There is a need to be able to detect this problem and renew the unfinished update.

In general, the more frequently data is generated, the less it should be embedded in other documents. If the number of embedded fields or embedded fields increases indefinitely, these contents should be stored in a separate set and accessed by reference instead of embedded in other documents. Information such as comment list or activity list should be stored in a separate set and should not be embedded in other documents.

Finally, if some fields are part of the document data, they need to be embedded in the document. If it is often necessary to exclude a field when querying a document, the field should be placed in another set, not embedded in the current document.

更适合内嵌 更适合引用
子文档较小 子文档较大
数据不会定期改变 数据经常改变
最终数据一致即可 中间阶段的数据必须一致
文档数据小幅增加 文档数据大幅增加
数据通常需要执行二次查询才能获得 数据通常不包含在结果中
快速读取 快速写入

If we have a set of users. The following are some fields that may be required and whether they should be embedded in user documents.

Account preferences

User preferences are only relevant to specific users and are likely to need to be queried together with other user information within the user document. Therefore, user preferences should be embedded in user documents.

Recent activity

This field depends on the frequency of recent activity growth and changes. If this is a fixed-length field (for example, the last 10 events), then this field should be embedded in the user document.

Friends

Friends information should not be embedded in user documents, at least not completely. The following section will introduce the relevant contents of social network applications.

All user-generated content

Should not be embedded in user documentation.

Cardinal number

The number of references to other sets contained in one set is called cardinality. Common relationships are one-to-one, one-to-many, many-to-many. If there is a blog application. Every blog post has a title, which is a one-to-one relationship. Each author can have many articles, which is a one-to-many relationship. Each article can have multiple tags, and each tag can be used in multiple articles, so this is a many-to-many relationship.

In MongoDB, many can be divided into two subcategories: many and few. If, the author and the article may be a pair of few relations: each author only published a few articles. Blog articles and labels may be many to few: the number of articles may actually be more than the number of labels. There is a one-to-many relationship between blog articles and comments: each article can have many comments.

As long as the relationship between less and more is determined, it is relatively easy to make trade-offs between embedded data and reference data. In general, it is better to embed “less” relationships and to reference “more” relationships.

Friends, fans, and other hassles

Keep close friends and keep away from enemies.

Many social applications need to link people, content, fans, friends, and other things. Whether to use embedded or referenced forms for these highly correlated data is not easy to weigh. This section will introduce matters needing attention related to social mapping data. In general, concerns, friends or collections can be simplified as a publishing and subscription system: one user can subscribe to notifications related to another user. In this way, there are two basic operations that need to be more efficient: how to save subscribers and how to notify all subscribers of an event.

There are three common subscription implementations. The first way is to embed the content producer in the subscriber document:

{
 "_id": ObjectId("..."),
 "username": "batman",
 "email": "batman@waynetech.com",
 "following": [
 ObjectId("..."),
 ObjectId("...")
 ]
 }

Now, for a given user document, you can usedb.activities.find({"user": {"$in": user["following"]}})To query all the activity information that the user is interested in. However, for a newly released activity information, if you want to find all users who are interested in this information, you have to query the “following” field of all users.

Another way is to embed subscribers into producer documents:

{
 "_id": ObjectId("..."),
 "username": "joker",
 "email": "joker@mailinator.com",
 "followers": [
 ObjectId("..."),
 ObjectId("..."),
 ObjectId("...")
 ]
 }

When this producer releases a new message, we can immediately know which users need to be notified. The disadvantage of this is that if you need to find a list of users that users are interested in, you must query the entire set of users. The advantages and disadvantages of this method are exactly opposite to those of the first method.

At the same time, there is another problem with both methods: they will make user documents bigger and bigger and change more and more frequently. In general, the “following” and “following” fields do not even need to return: how often do you query the fan list? If users pay more attention to some people or cancel attention to some people, it will also lead to a large number of fragments. Therefore, the final scheme further normalizes the data and saves the subscription information in a separate set to avoid these shortcomings. This Chengdu normalization may be a bit too much, but it is very useful for fields that change frequently and do not need to be returned with other fields in the document. This normalization of the “followers” field makes sense.

Use a collection to save the relationship between publishers and subscribers, where the document structure may look like this:

{
 "_id": ObjectId("..."), // the" _id "of the person concerned
 "followers": [
 ObjectId("..."),
 ObjectId("..."),
 ObjectId("...")
 ]
 }

In this way, the user’s documents can be simplified, but additional queries are required to obtain the fan list. Since the size of the “followers” array often changes, you can enable “usePowerOf2Sizes” on this set to ensure that the users set is as small as possible. If you save the followers collection in another database, you can also compress it without affecting the users collection too much.

Dealing with Wilhelm Effect

No matter what kind of strategy is used, embedded fields can only work effectively when the number of subdocuments or references is not particularly large. For more famous users, it may lead to document overflow for saving fan list. One solution to this situation is to use “continuous” documents when necessary. For example:

> db.users.find({"username": "wil"})
 {
 "_id": ObjectId("..."),
 "username": "wil",
 "email": "wil@example.com",
 "tbc": [
 ObjectId("123"),    // just for example
 ObjectId("456")     // same as above
 ],
 "followers": [
 ObjectId("..."),
 ObjectId("..."),
 ObjectId("..."),
 ...
 ]
 }
 {
 "_id": ObjectId("123"),
 "followers": [
 ObjectId("..."),
 ObjectId("..."),
 ObjectId("..."),
 ...
 ]
 }
 {
 "_id": ObjectId("456"),
 "followers": [
 ObjectId("..."),
 ObjectId("..."),
 ObjectId("..."),
 ...
 ]
 }

In this case, it is necessary to add logic to the application to fetch data from the “to be continued” array.

Say something

No silver bullet.