Introduction: Without the core system source code, the time-consuming method of modifying the source code printing cannot be used. Through tcpdump, wireshark, gdb, 010 editor, flame diagram, ida, database sql time-consuming statement, oracle ash report, loadrunner and other tools, the problems that the server tps cannot go up and the C program process hangs randomly were found out, and solved smoothly with a lot of gains.
The company recently launched a new system with the following main structure:
After the test environment system was deployed, two problems occurred:
1.loadrunner can’t load tps, and the single machine of java interface tps can only load more than 100 tps, which increases the time consumption from 100ms for a single visit to about 1s for 110 concurrent visits.
2. The C server 1 is often hung up from time to time during pressure measurement.
For some reasons, the program related to this project C has no source code, only installation and deployment documents. In order to solve the above two problems, several of our colleagues and our colleagues in Chongqing participated in the problem investigation and resolution together. Because there is no source code, the middle has experienced many twists and turns, after a month of efforts, finally solved the above two problems, the whole investigation process learned a lot of knowledge.
Analysis tools used
5. Flame diagram,
7. The database captures sql time-consuming statements.
8.oracle ash reports,
A few words summary
1. There is a performance bottleneck in C program client socket long connection calling C server. After analyzing the transmission protocol by tcpdump,wireshark binary, java is used to call C server. Single tps is doubled and performance is increased by 3 times
2. The existence of for update statement in database statement leads to concurrency failure. After analysis, sequence is adopted to replace for update statement in business, and the relevant logic of binary modification for update statement is directly modified to sequence through 010 editor. The system has the ability to synchronously upgrade tps of expansion server.
3. There is a bottleneck when the database insert statements are concurrent. The problem is solved by expanding the size of the oracle redo log log, and tps40% is continuously improved.
4. The process of the program hangs up randomly. Through gdb analysis of the core dump file, it is found that there is a problem with gethostbyname used in the program under the concurrent situation, and a temporary solution is adopted.
1. First Bottleneck Location
At the beginning of troubleshooting, loadrunner pressed the java interface. When the number of concurrent users gradually increased from 0 to 110, tps would no longer increase to about 100, and the response time would increase from 100ms to 1 s. At this time, our analysis focuses on who is the current main bottleneck
Looking at the architecture diagram again, all five nodes in the diagram may be bottleneck points. At this time, we grab the time-consuming sql through dba management authority of the database. If we don’t grab it, we first eliminate the database problem. In java, we print step-by-step time-consuming logs, and locate jni to call C client, which takes the most time. At this time, the bottleneck point is initially located at the three nodes of C client, C server 1 and C server 2.
Because there is no source code, we use tcpdump to grab the package and analyze it on c server 1.
tcpdump -i eth0 -s 0 -w aa.txt host java客户端ip
Wireshark analysis was used to analyze the captured package.
The time consumption of the server through tracking flow -TCP flow analysis has not changed too much, because the C client and the C server are long connections, and multiple requests may share one connection, so the data analyzed at this time may not be accurate, so we use loadrunner pressure measurement, other conditions remain unchanged, one C server 1 and two C servers 1 respectively check the time consumption changes.
Other conditions remain unchanged, one java server and two java servers respectively check the time-consuming changes.
The final location is the problem of C client. (ps: When wireshark analyzed the data, he also understood the tcp delay acknowledgement with Master Qin Di)
2. transform client c
C client and C server communicate through a long connection. It is very difficult to directly transform C code. So we are going to replace C with java. This requires analyzing what protocol is used when communicating between C and rewriting it with java according to this protocol. Based on previous experience, we have deduced the coding protocol, and wireshark analysis is used to analyze the binary to confirm that this is indeed the coding.
According to this protocol coding, after java rewriting, tps was increased to 210 (twice), and the time consumption was reduced to 330ms (one third of the original) under the same condition of 110 concurrent users.
3. Second bottleneck positioning.
After the second optimization step, tps has increased by two times, but at this time, if tomcat is expanded and C server is expanded, tps will remain at about 210 and will not increase. Therefore, we continue to locate new bottlenecks. At this time, looking for dba needs a statement to view oracle’s time-consuming sql in real time.
select (select b.SQL_TEXT from v$sqlarea b where b.SQL_ID=a.SQL_ID ) sqltxt, (select c.SQL_FULLTEXT from v$sqlarea c where c.SQL_ID=a.SQL_ID ) sqlfulltxt, a.username, a.LAST_CALL_ET,a.MACHINE ,a.command, a.EVENT, a.SQL_ID ,a.SID,a.SERIAL#, 'alter system kill session ''' || a.SID ||','||a.SERIAL# ||''';' as killstment from v$session a where a.STATUS = 'ACTIVE' and a.USERNAME not in ('SYS', 'SYSTEM') order by a.LAST_CALL_ET desc ,a.username,a.MACHINE ,a.command, a.EVENT, a.SQL_ID ,a.SID;
The value of LAST_CALL_ET column of partial requests can reach 6 seconds when there is a large amount of sql concurrency for update. for update causes all requests to be executed serially, affecting concurrency. After analyzing the business logic, we temporarily replaced the for update statement with sequence, but we didn’t have the source code and couldn’t modify it. then we directly modified the binary file through 010 editor and found the for update statement through 010 editor query, which was successfully replaced.
After replacement, the tps of 4 C servers reached 580, up 2.7 times (580/210), and the system initially had the ability to scale out
4. Third bottleneck positioning.
After the previous transformation, the tps of the system increased by 2.7 times when there were 4 C servers, but it did not increase to 4 times (210*4=840). There was no linear increase, indicating that there were still other bottlenecks. It was also found through sql given by dba that insert statements occasionally took a long time. In about 1s, the EVENT waiting event was IO event. The DBA colleague modified the redo log file size (this is the test environment Oracle, which was not optimized before). From the default 50M to 1G, tps was increased to 640 (it has not been increased to 4x, which means there is still a bottleneck. It may still be the database, but because the database cannot capture millisecond-level time-consuming sql for the time being, it has not continued to track down)
After these performance improvements and our test server configuration is not high, if the online server performance is estimated to reach 1000tps, which basically meets the immediate needs, we will not continue performance optimization.
5. The program process hangs up randomly.
In the process of pressure measurement, the C server process is often hung up randomly. Through tail -f /var/log/messages, it is found that the core dump file is generated, but it is automatically deleted by the system. Dong Jian found the way to open the core dupm file, as follows:
Check whether it is 0. if it is 0, it means that the coredump file is set to 0 and needs to be modified to unlimited.
ulimit -c unlimited
ProcessUnpackaged = yes
When the modified process crashes again, the core dump file is generated and enters the core dump directory for debugging.
Gdb script path coredump
Bt displays stack information
Continue to execute the following command
set print pretty on info local //显示当前函数中的局部变量信息。
View the value of the internal variable through the p command
Found null value for variable thishost->h_addr_list
We analyze that there may be methods that are not thread safe when concurrent requests occur, resulting in this value being null, thus causing the process crash to continue debugging.
Set logging on in gdb to output debugging information to file
Thread applies allbt outputs all thread information.
grep --color -i clientconnect -C5 gdb.txt
There are indeed two threads accessing concurrently
Through the ida tool to decompile so, the following statement is finally found to have problems in concurrency, and variables in thishost may be initialized to null by another thread in an instant.
thishost = gethostbyname((const char *)hostname); ip = inet_ntoa(*(struct in_addr *)*thishost->h_addr_list);
According to the characteristics of our project, because we didn’t call remotely, c server 1 and c server 2 were both deployed on the same server, so we temporarily wrote the address to 127.0.0.1 by modifying the binary, and ip = inet_ntoa ((struct in_addr)*thishost->h_addr_list); It was changed to an empty command, and there was no system crash after redeployment.
Author’s brief introduction: yang Zhen-Yixin engineer, former micro blog feed engineer, interested in source code learning; Dong Jian-Yixin engineer, former microblog engineer, focuses on big data and highly available technologies
The original text was published in a highly available architecture.