Ask dba to help me look at the problem of mongodb’s high load.

  mongodb, question

Great God, I recently took over a new business of the company. It is a huge pit. A monitoring system stored by mongodb monitors the interface of the whole system of the company and receives the data reported by each business.
On the whole, the reported qps is not higher than 100, but I think mongodb takes up 98~99.5% of CPU time with a frightening load, for a long time!

top - 17:33:14 up 204 days, 14:37,  2 users,  load average: 89.99, 96.20, 100.96
 Tasks: 389 total,   1 running, 386 sleeping,   0 stopped,   2 zombie
 Cpu(s): 94.3%us,  3.3%sy,  0.0%ni,  2.1%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
 Mem:  65855664k total, 60083264k used,  5772400k free,   169764k buffers
 Swap:  8388600k total,        0k used,  8388600k free, 23005656k cached
 
 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 59181 root      20   0  114g  33g 8264 S 2348.1 53.5  34480,54 mongod

Also don’t know how the previous people deployed. I am not familiar with mongodb. I cannot find the problem for the time being. I can only guess it isToo many queries and too slow lead to great pressure.. Moreover, the online version of mongostat does not have the idx miss% field, which is not good for determining whether the index is well built or not.
Below I post some data:
1. mongostat data can be seen from here that there is both reading and writing. Among them, there is no much query or blocking in writing, but the query blocking is serious.

insert  query update delete getmore command faults                locked db     qr|qw   ar|aw  netIn netOut  conn             set repl       time
 57    218      3     *0      59   108|0      0 monitor_1120_minute:0.0%       0|0   194|0   984k     1m   318 replica_monitor  PRI   17:17:37
 26    135      4     *0      29    88|0      0 monitor_1016_minute:0.0%       0|0   192|0   453k   778k   320 replica_monitor  PRI   17:17:38
 62    173      4     *0      66   135|0      0 monitor_1008_minute:0.0%       0|0   195|0   535k   829k   318 replica_monitor  PRI   17:17:39
 48    168      1     *0      44   175|0      0 monitor_1261_minute:0.0%       0|0   192|0     4m     5m   316 replica_monitor  PRI   17:17:41
 32    797      2     *0      34   212|0      0 monitor_1204_minute:0.0%       0|0   185|0   354k   654k   312 replica_monitor  PRI   17:17:42
 25    452      7     *0      29   101|0      0 monitor_1204_minute:0.0%       0|0   185|0   182k   608k   311 replica_monitor  PRI   17:17:43
 14    124      5     *0      14   104|0      0 monitor_1005_minute:0.0%       0|0   177|0   123k   521k   299 replica_monitor  PRI   17:17:44
 22    103      3     *0      27    64|0      0 monitor_1197_minute:0.0%       0|0   163|0    80k   334k   287 replica_monitor  PRI   17:17:45
 20    110      1     *0      21    93|0      0 monitor_1001_minute:0.0%       0|0   156|0    82k   504k   281 replica_monitor  PRI   17:17:46
 14    107      3     *0      16   104|0      0 monitor_1056_minute:0.0%       0|0   154|0   114k   474k   278 replica_monitor  PRI   17:17:47
 insert  query update delete getmore command faults                locked db     qr|qw   ar|aw  netIn netOut  conn             set repl       time
 21    127      5     *0      23    93|0      0 monitor_1022_minute:0.0%       0|0   143|0    83k   425k   267 replica_monitor  PRI   17:17:48
 20    112      6     *0      25   110|0      0 monitor_1131_minute:0.0%       0|0   139|0    91k   523k   261 replica_monitor  PRI   17:17:49
 15     95      1     *0       6    91|0      0 monitor_1036_minute:0.0%       0|0   130|0    54k   322k   252 replica_monitor  PRI   17:17:51
 16    110      7     *0      23   118|0      0 monitor_1113_minute:0.0%       0|0   131|0   152k   804k   257 replica_monitor  PRI   17:17:52
 18    115      2     *0      20   131|0      0 monitor_1125_minute:0.0%       0|0   130|0    72k   375k   254 replica_monitor  PRI   17:17:53
 22     96      2     *0      19    75|0      0 monitor_1316_minute:0.0%       0|0   117|0    61k   323k   236 replica_monitor  PRI   17:17:54

2. this is a simplification of the currentOp command. it outputs the fields item.op, item.secs _ running, item.client, item.desc, item.ns respectively. it can be seen that many queries take a long time. 10.1.16.223 is a local machine and 10.1.16.28 is a second. The main output of the query time of more than 1 second

replica_monitor:PRIMARY> db.currentOp().inprog.forEach(function(item){if(item.secs_running>1){print(item.op,item.secs_running,item.client,item.desc,item.ns);  }})db.currentOp().inprog.forEach(function(item){if(item.secs_running>1){print(item.op,item.secs_running,item.client,item.desc,item.ns);  }})
 query 2 10.1.16.28:55143 conn533341052 monitor_1219_minute.diy_10_1_137_186
 query 4 10.1.16.223:13316 conn533340660 monitor_1093_minute.col_server
 query 4 10.1.16.223:13367 conn533340690 monitor_1178_minute.col_server
 query 2 10.1.16.223:13553 conn533340935 monitor_1226_minute.diy_10_1_1_227
 query 2 10.1.16.28:55254 conn533341125 monitor_1261_minute.diy_10_1_136_199
 query 5 10.1.16.223:13034 conn533340328 monitor_1131_minute.col_10_1_137_196
 query 4 10.1.16.223:13345 conn533340676 monitor_1146_minute.col_server
 query 2 10.1.16.28:54989 conn533340916 monitor_1075_minute.col_server
 query 7 10.1.16.28:53040 conn533339313 monitor_1056_minute.col_10_1_2_134
 query 4 10.1.16.223:13320 conn533340663 monitor_1017_minute.col_server
 query 2 10.1.16.223:13824 conn533341185 monitor_1131_minute.col_10_1_115_129
 query 2 10.1.16.223:13579 conn533340952 monitor_1237_minute.diy_10_1_18_33
 query 5 10.1.16.28:53729 conn533339516 monitor_1434_minute.col_10_1_112_37
 query 3 10.1.16.28:54891 conn533340771 monitor_1209_minute.col_10_1_17_123
 query 4 10.1.16.223:13364 conn533340687 monitor_1169_minute.col_server
 query 2 10.1.16.223:13741 conn533341103 monitor_1271_minute.col_10_1_16_109
 query 5 10.1.16.28:53426 conn533339973 monitor_1131_minute.col_10_1_137_196
 query 3 10.1.16.28:54987 conn533340914 monitor_1013_minute.col_server
 query 3 10.1.16.28:53490 conn533339992 monitor_1342_minute.col_10_1_113_35
 query 5 10.1.16.28:53745 conn533340486 monitor_1446_minute.col_10_1_3_61
 query 3 10.1.16.28:54885 conn533340768 monitor_1204_minute.col_10_1_114_102
 query 4 10.1.16.223:13359 conn533340682 monitor_1160_minute.col_server
 query 3 10.1.16.28:54984 conn533340911 monitor_1003_minute.col_server
 query 2 10.1.16.223:13732 conn533341096 monitor_1261_minute.col_10_1_114_102
 query 3 10.1.16.28:54973 conn533340900 monitor_1113_minute.col_server
 query 4 10.1.16.223:13165 conn533340559 monitor_1367_minute.col_10_1_137_67
 query 3 10.1.16.28:54979 conn533340906 monitor_1004_minute.col_server
 query 4 10.1.16.223:13350 conn533340679 monitor_1139_minute.col_server
 query 3 10.1.16.28:54971 conn533340898 monitor_1120_minute.col_server
 query 4 10.1.16.223:13311 conn533340655 monitor_1140_minute.col_server
 query 2 10.1.16.28:55039 conn533340980 monitor_1169_minute.diy_10_1_19_99
 query 8 10.1.16.223:12862 conn533340167 monitor_1204_minute.col_10_1_114_105
 query 3 10.1.16.28:53129 conn533339357 monitor_1200_minute.col_10_1_137_144
 query 3 10.1.16.223:13224 conn533340585 monitor_1185_minute.col_10_1_137_117
 query 3 10.1.16.223:13067 conn533340351 monitor_1339_minute.col_10_1_168_182
 query 4 10.1.16.223:13310 conn533340654 monitor_1120_minute.col_server
 query 3 10.1.16.28:54983 conn533340910 monitor_1136_minute.col_server
 query 4 10.1.16.223:13326 conn533340667 monitor_1003_minute.col_server
 query 3 10.1.16.28:53178 conn533339383 monitor_1226_minute.diy_10_1_18_119
 query 3 10.1.16.28:54969 conn533340896 monitor_1036_minute.col_server

I’m not familiar with mongo either. I don’t know where to start to pinpoint the problem and ask for guidance from the great god.

The reason why the cpu can’t come down is that the query is too slow. You can open the lower slow query to see which fields and which collection are used to add indexes to the query criteria.
Mongotop can see the reading and writing time of some collection, which is basically time consuming and requires indexing.
When adding indexes, the collection will be blocked and the data will be lost
Mongodb’s official website has an ops platform that can be tried out.
There will be many indicator views, which is more convenient.
Give a portalhttp://www.ttlsa.com/mms/mms- …