Mongodb de-weighting

  mongodb, question

Today’s data are captured by crawlers. Now some data are duplicated.
Then now I want to ask how to do it to get heavier.
I think that as long as we can find the corresponding name is ok
For example. I now have a community_name field.
I would like to inquire about the name list with community_name repeating more than once.
How should I inquire?
Thank you
Document format:

{
 "_id" : ObjectId("5732e6f884e079abfa783703"),
 "buildings_num" : "4",
 "community_name": "Jianghe City",
 "address": "Xin 'an Jiang Yang 'an New Town, with Yang 'an Avenue in the south and binjiang road in the north",
 "lat" : "29.511485",
 "building_year": "2014 Completed in 2014",
 "lng" : " 119.329673",
 "house_num" : 224,
 "id" : 84453,
 "category": "Jiande Business Circle",
 "city": "Hangzhou",
 "lj_id" : "187467387072819",
 "area": "Jiande",
 "average_price" : 8408,
 "property_cost": "2 yuan/m2/month",
 "property_company": "Gold Housekeeper",
 "volume_rate" : "1.98",
 "greening_rate" : 0.33,
 "developers": "Hangzhou Harmonious Real Estate Co., Ltd."
 }

See what you mean to achieve RDBMS similar to

SELECT community_name, COUNT(*)
 FROM table
 GROUP BY community_name
 HAVING COUNT(*) > 1

I don’t know if I understand it correctly. If so, the corresponding method should be to useaggregation framework.

db.coll.aggregate([
 {$ group: {_ id: "$ community_name", count: {$ sum: 1}}},//count the number of occurrences of community _ name
 {$match: {count: {$gt: 1}}} // find records that repeat more than once
 ]);

This query can get faster results with the following indexes:

db.coll.createIndex({community_name: 1});

But even so, the query will traverse all records, not too fast.
In fact, it is rather wasteful to count all the records every time. It is better to make a certain cache after obtaining the results. How to cache depends on how you want to use the counted data.
A better way is to make a judgment before inserting, if the same already exists.community_nameThe record, such as

db.community_name_stat.update({
 community_name: 'xxx'
 }, {
 '$set': {
 count: {'$inc': 1}
 },
 '$setOnInsert': {
 community_name: 'xxx',
 count: 1
 }
 }, {
 upsert: true
 });

In this way, one can be directly obtainedcommunity_name_statSet to get eachcommunity_nameHow many times have they appeared? Of course, the final decision depends on your needs. MongoDB is a very flexible thing, which is also one of its important characteristics different from relational databases. Understanding its various functions and customizing a cost-effective solution for your needs is one of the greatest challenges in using MongoDB.