Great God, I recently took over a new business of the company. It is a huge pit. A monitoring system stored by mongodb monitors the interface of the whole system of the company and receives the data reported by each business.
On the whole, the reported qps is not higher than 100, but I think mongodb takes up 98~99.5% of CPU time with a frightening load, for a long time!
top - 17:33:14 up 204 days, 14:37, 2 users, load average: 89.99, 96.20, 100.96
Tasks: 389 total, 1 running, 386 sleeping, 0 stopped, 2 zombie
Cpu(s): 94.3%us, 3.3%sy, 0.0%ni, 2.1%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 65855664k total, 60083264k used, 5772400k free, 169764k buffers
Swap: 8388600k total, 0k used, 8388600k free, 23005656k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
59181 root 20 0 114g 33g 8264 S 2348.1 53.5 34480,54 mongod
Also don’t know how the previous people deployed. I am not familiar with mongodb. I cannot find the problem for the time being. I can only guess it isToo many queries and too slow lead to great pressure.. Moreover, the online version of mongostat does not have the idx miss% field, which is not good for determining whether the index is well built or not.
Below I post some data:
1. mongostat data can be seen from here that there is both reading and writing. Among them, there is no much query or blocking in writing, but the query blocking is serious.
insert query update delete getmore command faults locked db qr|qw ar|aw netIn netOut conn set repl time
57 218 3 *0 59 108|0 0 monitor_1120_minute:0.0% 0|0 194|0 984k 1m 318 replica_monitor PRI 17:17:37
26 135 4 *0 29 88|0 0 monitor_1016_minute:0.0% 0|0 192|0 453k 778k 320 replica_monitor PRI 17:17:38
62 173 4 *0 66 135|0 0 monitor_1008_minute:0.0% 0|0 195|0 535k 829k 318 replica_monitor PRI 17:17:39
48 168 1 *0 44 175|0 0 monitor_1261_minute:0.0% 0|0 192|0 4m 5m 316 replica_monitor PRI 17:17:41
32 797 2 *0 34 212|0 0 monitor_1204_minute:0.0% 0|0 185|0 354k 654k 312 replica_monitor PRI 17:17:42
25 452 7 *0 29 101|0 0 monitor_1204_minute:0.0% 0|0 185|0 182k 608k 311 replica_monitor PRI 17:17:43
14 124 5 *0 14 104|0 0 monitor_1005_minute:0.0% 0|0 177|0 123k 521k 299 replica_monitor PRI 17:17:44
22 103 3 *0 27 64|0 0 monitor_1197_minute:0.0% 0|0 163|0 80k 334k 287 replica_monitor PRI 17:17:45
20 110 1 *0 21 93|0 0 monitor_1001_minute:0.0% 0|0 156|0 82k 504k 281 replica_monitor PRI 17:17:46
14 107 3 *0 16 104|0 0 monitor_1056_minute:0.0% 0|0 154|0 114k 474k 278 replica_monitor PRI 17:17:47
insert query update delete getmore command faults locked db qr|qw ar|aw netIn netOut conn set repl time
21 127 5 *0 23 93|0 0 monitor_1022_minute:0.0% 0|0 143|0 83k 425k 267 replica_monitor PRI 17:17:48
20 112 6 *0 25 110|0 0 monitor_1131_minute:0.0% 0|0 139|0 91k 523k 261 replica_monitor PRI 17:17:49
15 95 1 *0 6 91|0 0 monitor_1036_minute:0.0% 0|0 130|0 54k 322k 252 replica_monitor PRI 17:17:51
16 110 7 *0 23 118|0 0 monitor_1113_minute:0.0% 0|0 131|0 152k 804k 257 replica_monitor PRI 17:17:52
18 115 2 *0 20 131|0 0 monitor_1125_minute:0.0% 0|0 130|0 72k 375k 254 replica_monitor PRI 17:17:53
22 96 2 *0 19 75|0 0 monitor_1316_minute:0.0% 0|0 117|0 61k 323k 236 replica_monitor PRI 17:17:54
2. this is a simplification of the currentOp command. it outputs the fields item.op, item.secs _ running, item.client, item.desc, item.ns respectively. it can be seen that many queries take a long time. 10.1.16.223 is a local machine and 10.1.16.28 is a second. The main output of the query time of more than 1 second
replica_monitor:PRIMARY> db.currentOp().inprog.forEach(function(item){if(item.secs_running>1){print(item.op,item.secs_running,item.client,item.desc,item.ns); }})db.currentOp().inprog.forEach(function(item){if(item.secs_running>1){print(item.op,item.secs_running,item.client,item.desc,item.ns); }})
query 2 10.1.16.28:55143 conn533341052 monitor_1219_minute.diy_10_1_137_186
query 4 10.1.16.223:13316 conn533340660 monitor_1093_minute.col_server
query 4 10.1.16.223:13367 conn533340690 monitor_1178_minute.col_server
query 2 10.1.16.223:13553 conn533340935 monitor_1226_minute.diy_10_1_1_227
query 2 10.1.16.28:55254 conn533341125 monitor_1261_minute.diy_10_1_136_199
query 5 10.1.16.223:13034 conn533340328 monitor_1131_minute.col_10_1_137_196
query 4 10.1.16.223:13345 conn533340676 monitor_1146_minute.col_server
query 2 10.1.16.28:54989 conn533340916 monitor_1075_minute.col_server
query 7 10.1.16.28:53040 conn533339313 monitor_1056_minute.col_10_1_2_134
query 4 10.1.16.223:13320 conn533340663 monitor_1017_minute.col_server
query 2 10.1.16.223:13824 conn533341185 monitor_1131_minute.col_10_1_115_129
query 2 10.1.16.223:13579 conn533340952 monitor_1237_minute.diy_10_1_18_33
query 5 10.1.16.28:53729 conn533339516 monitor_1434_minute.col_10_1_112_37
query 3 10.1.16.28:54891 conn533340771 monitor_1209_minute.col_10_1_17_123
query 4 10.1.16.223:13364 conn533340687 monitor_1169_minute.col_server
query 2 10.1.16.223:13741 conn533341103 monitor_1271_minute.col_10_1_16_109
query 5 10.1.16.28:53426 conn533339973 monitor_1131_minute.col_10_1_137_196
query 3 10.1.16.28:54987 conn533340914 monitor_1013_minute.col_server
query 3 10.1.16.28:53490 conn533339992 monitor_1342_minute.col_10_1_113_35
query 5 10.1.16.28:53745 conn533340486 monitor_1446_minute.col_10_1_3_61
query 3 10.1.16.28:54885 conn533340768 monitor_1204_minute.col_10_1_114_102
query 4 10.1.16.223:13359 conn533340682 monitor_1160_minute.col_server
query 3 10.1.16.28:54984 conn533340911 monitor_1003_minute.col_server
query 2 10.1.16.223:13732 conn533341096 monitor_1261_minute.col_10_1_114_102
query 3 10.1.16.28:54973 conn533340900 monitor_1113_minute.col_server
query 4 10.1.16.223:13165 conn533340559 monitor_1367_minute.col_10_1_137_67
query 3 10.1.16.28:54979 conn533340906 monitor_1004_minute.col_server
query 4 10.1.16.223:13350 conn533340679 monitor_1139_minute.col_server
query 3 10.1.16.28:54971 conn533340898 monitor_1120_minute.col_server
query 4 10.1.16.223:13311 conn533340655 monitor_1140_minute.col_server
query 2 10.1.16.28:55039 conn533340980 monitor_1169_minute.diy_10_1_19_99
query 8 10.1.16.223:12862 conn533340167 monitor_1204_minute.col_10_1_114_105
query 3 10.1.16.28:53129 conn533339357 monitor_1200_minute.col_10_1_137_144
query 3 10.1.16.223:13224 conn533340585 monitor_1185_minute.col_10_1_137_117
query 3 10.1.16.223:13067 conn533340351 monitor_1339_minute.col_10_1_168_182
query 4 10.1.16.223:13310 conn533340654 monitor_1120_minute.col_server
query 3 10.1.16.28:54983 conn533340910 monitor_1136_minute.col_server
query 4 10.1.16.223:13326 conn533340667 monitor_1003_minute.col_server
query 3 10.1.16.28:53178 conn533339383 monitor_1226_minute.diy_10_1_18_119
query 3 10.1.16.28:54969 conn533340896 monitor_1036_minute.col_server
I’m not familiar with mongo either. I don’t know where to start to pinpoint the problem and ask for guidance from the great god.
The reason why the cpu can’t come down is that the query is too slow. You can open the lower slow query to see which fields and which collection are used to add indexes to the query criteria.
Mongotop can see the reading and writing time of some collection, which is basically time consuming and requires indexing.
When adding indexes, the collection will be blocked and the data will be lost
Mongodb’s official website has an ops platform that can be tried out.
There will be many indicator views, which is more convenient.
Give a portalhttp://www.ttlsa.com/mms/mms- …