Today’s data are captured by crawlers. Now some data are duplicated.
Then now I want to ask how to do it to get heavier.
I think that as long as we can find the corresponding name is ok
For example. I now have a community_name field.
I would like to inquire about the name list with community_name repeating more than once.
How should I inquire?
Thank you
Document format:
{
"_id" : ObjectId("5732e6f884e079abfa783703"),
"buildings_num" : "4",
"community_name": "Jianghe City",
"address": "Xin 'an Jiang Yang 'an New Town, with Yang 'an Avenue in the south and binjiang road in the north",
"lat" : "29.511485",
"building_year": "2014 Completed in 2014",
"lng" : " 119.329673",
"house_num" : 224,
"id" : 84453,
"category": "Jiande Business Circle",
"city": "Hangzhou",
"lj_id" : "187467387072819",
"area": "Jiande",
"average_price" : 8408,
"property_cost": "2 yuan/m2/month",
"property_company": "Gold Housekeeper",
"volume_rate" : "1.98",
"greening_rate" : 0.33,
"developers": "Hangzhou Harmonious Real Estate Co., Ltd."
}
See what you mean to achieve RDBMS similar to
SELECT community_name, COUNT(*) FROM table GROUP BY community_name HAVING COUNT(*) > 1
I don’t know if I understand it correctly. If so, the corresponding method should be to useaggregation framework.
db.coll.aggregate([ {$ group: {_ id: "$ community_name", count: {$ sum: 1}}},//count the number of occurrences of community _ name {$match: {count: {$gt: 1}}} // find records that repeat more than once ]);
This query can get faster results with the following indexes:
db.coll.createIndex({community_name: 1});
But even so, the query will traverse all records, not too fast.
In fact, it is rather wasteful to count all the records every time. It is better to make a certain cache after obtaining the results. How to cache depends on how you want to use the counted data.
A better way is to make a judgment before inserting, if the same already exists.community_name
The record, such asdb.community_name_stat.update({ community_name: 'xxx' }, { '$set': { count: {'$inc': 1} }, '$setOnInsert': { community_name: 'xxx', count: 1 } }, { upsert: true });
In this way, one can be directly obtained
community_name_stat
Set to get eachcommunity_name
How many times have they appeared? Of course, the final decision depends on your needs. MongoDB is a very flexible thing, which is also one of its important characteristics different from relational databases. Understanding its various functions and customizing a cost-effective solution for your needs is one of the greatest challenges in using MongoDB.