Availability as high as 5 9s! High Availability Architecture Design of Payment System

I. background

For Internet applications and large-scale enterprise applications, most of them require uninterrupted operation for 7*24 hours as much as possible, but it can be said that it is “difficult to get to the top of the sky” to achieve complete uninterrupted operation. For this reason, there are generally 3 9 to 5 9 measures for application usability.

Usability index Calculation Unavailable Time (Minutes)
99.9% 0.1%*365*24*60 525.6
99.99% 0.01%*365*24*60 52.56
99.999% 0.001%*365*24*60 5.256

For an application with increasing functions and data volume, it is not easy to maintain relatively high availability. In order to realize high availability, Yixin payment system has done a lot of exploration and practice in avoiding single point of failure, ensuring high availability of application itself, and solving the growth of transaction volume.

The service capacity of Yixin payment system can reach 99.999% without considering unexpected failures of external dependent systems, such as network problems, three-way payment and large area unavailability of banks.

This article focuses on how to improve the usability of the application itself, and how to avoid single point of failure and solve the problem of transaction volume growth will be discussed in other series.

In order to improve the usability of the application, the first thing to do is to avoid application failure as much as possible, but it is impossible to completely fail. The Internet is a place where the “butterfly effect” is easy to occur. Any seemingly small accident with a probability of 0 may occur and then be magnified indefinitely.

Everyone knows RabbitMQ itself is very stable and reliable, Yixin payment system has been using single-point RabbitMQ at the beginning, and has never had any operation failure, so everyone thinks it is unlikely to have any problem psychologically.

Until one day, the physical host hardware where this node is located was broken due to disrepair. At that time, this RabbitMQ was unable to provide services, resulting in the instantaneous unavailability of system services.

The occurrence of a fault is not terrible. The most important thing is to find and solve the fault in time. What Yixin payment system requires of its own system is to find faults in seconds, diagnose and solve the faults quickly, so as to reduce the negative impact brought by the faults.

II. Issues

Learn from history. First of all, let’s briefly review some problems encountered by Yixin payment system:

(1) When dealing with the newly accessed three-way channel, the new development colleague neglected the importance of setting timeout due to lack of experience. It is such a small detail that all the transactions in this three-way queue are blocked, affecting the transactions in other channels at the same time.

(2) Yixin payment system is distributed and supports gray scale publishing, so the environment and deployment modules are very numerous and complex. A new module was added at one time. As there are multiple environments and each environment is double-node, the number of database connections is not enough after the new module goes online, thus affecting the functions of other modules.

(3) It is also a timeout problem. The timeout of one third party has exhausted all currently configured worker threads, so that other transactions have no threads to process.

(4) Party A provides authentication, payment and other interfaces at the same time. One of the interfaces triggered the DDoS restriction of Party A on the network operator side due to the sudden increase in transaction volume of Yixin payment system. Usually, the exit IP of the computer room is fixed, which is mistaken by the network operator as a traffic attack, resulting in the unavailability of the A-party authentication and payment interface at the same time.

(5) Another database problem is also caused by the sudden increase in transaction volume of Yixin payment system. The upper limit given to a certain sequence by the colleague who created the sequence is 999,999,999, but the length of this field in the data inventory is 32 bits. When the transaction volume is small, the value generated by the system matches the 32 bits in the field, and the sequence will not be upgraded. However, with the increase of transaction volume, the sequence unknowingly increases the number of digits, resulting in 32 digits being insufficient for storage.

Problems like this are very common for Internet systems and have concealment, so how to avoid them is very important.

III. Solutions

Below we will look at the changes made by Yixin’s payment system from three aspects.

3.1 Avoid failures as much as possible

3.1.1 Design of Fault Tolerant System

For example, rerouting, for user payment, the user does not care from which channel his money is paid, the user only cares about success or failure. Yixin payment system connects more than 30 channels, and it is possible that channel a payment is unsuccessful. at this time, it needs to be dynamically rerouted to channel b or channel c, so that user payment failure can be avoided through system rerouting and payment fault tolerance can be realized.

There is also fault tolerance for OOM, like Tomcat. The system memory always runs out. If you reserve some memory for the application itself at the beginning, when the system has OOM, you can catch the exception to avoid this OOM.

3.1.2 “fail fast Principle” for Fast Failure of Some Links

Fail fast principle is that when there is a problem in any step of the main process, the whole process should be finished quickly and reasonably, instead of waiting for negative impacts to occur.

To give a few examples:

(1) The payment system needs to load some queue information and configuration information into the cache when it starts up. If the load fails or the queue configuration is incorrect, the request processing process will fail. The best way to deal with this problem is to load the data and the JVM will exit directly to avoid the unavailability of subsequent starts.

(2) The longest response time for real-time transaction processing in the payment system is 40s. If the response time exceeds 40s, the front-end system will no longer wait, release the thread, and inform the merchant that the processing is in progress. Subsequent processing results will be obtained through notification or active inquiry by the business line.

(3) Yixin payment system uses redis as cache database, where it uses functions such as real-time alarm embedding point and weight check. If the connection of redis exceeds 50ms, then this redis operation will be abandoned automatically. In the worst case, the impact of this operation on payment is 50ms, which is controlled within the scope allowed by the system.

3.1.3 Design a system with self-protection capability

The system generally has third-party dependencies, such as databases, three-party interfaces, etc. When developing the system, it is necessary to keep doubts on the third party, so as to avoid the chain reaction when the third party has problems, resulting in downtime.

(1) Split the message queue

Yixin payment system provides a variety of payment interfaces to merchants, commonly used are quick, personal online banking, corporate online banking, refund, revocation, batch payment, batch withholding, single payment, single withholding, voice payment, balance inquiry, identity card authentication, bank card authentication, card authentication, etc. The corresponding payment channels include WeChat payment, ApplePay, Alipay and other more than 30 payment channels, and have access to hundreds of merchants. In these three dimensions, what Yixin payment system does is to split the message queue to ensure that different businesses, three parties, merchants and payment types do not affect each other. The following figure is a split diagram of some service message queues:

(2) restrict the use of resources

The design of restriction on the use of resources is the most important point of highly available systems, and it is also a point that is easy to be ignored. Resources are relatively limited and overused, which will naturally lead to application downtime. To this end, Yixin Payment System has done the following homework:

  • Limit connections

With distributed scale-out, the number of database connections needs to be considered instead of the endless maximization. The number of database connections is limited and all modules need to be considered globally, especially the increase brought by scale-out.

  • Limit memory usage

Excessive memory usage will lead to frequent GC and OOM. The memory usage mainly comes from the following two aspects:

A: the aggregate capacity is too large;

B: Objects that are no longer referenced are not released. For example, objects placed in ThreadLocal will be recycled until the thread exits.

  • Restrict thread creation

Unlimited creation of threads eventually leads to uncontrollable creation, especially the method of creating threads hidden in code.

When the SY value of the system is too high, linux needs to spend more time on thread switching. The main reason why Java creates this phenomenon is that more threads are created, and these threads are constantly blocking (lock waiting, IO waiting) and changing the execution state, which results in a large number of context switches.

In addition, Java applications will operate the physical memory outside the JVM heap when creating threads, and too many threads will also use too much physical memory.

For the creation of threads, it is best to implement it through thread pool to avoid context switching caused by too many threads.

  • Restrict concurrency

It should be clear to those who have worked in the payment system that some tripartite payment companies have requirements for concurrent business. The number of concurrent transactions opened by the three parties is evaluated according to the actual transaction volume, so if concurrency is not controlled and all transactions are sent to the three parties, the three parties will only reply “please reduce the submission frequency”.

Therefore, special attention should be paid in both the system design phase and the code review phase to limit concurrency to the range allowed by the three parties.

We talked about three changes that Yixin payment system has made in order to realize the usability of the system. One is to avoid failures as much as possible. Next, we will talk about the following two points.

3.2 timely detection of faults

Failure is like a devil entering a village and coming off guard. When the defense line of prevention is broken through, how to pull up the second defense line in time and find faults to ensure availability, the alarm monitoring system starts to play a role. A car without an instrument panel cannot know the speed and fuel level, and whether the turn signal is on or not, even if the “old driver” level is high, it is quite dangerous. Similarly, the system also needs to be monitored, and it is best to give an alarm in advance when there is danger, so that the problem can be solved before it really causes risks.

3.2.1 Real-time Alarm System

If there is no real-time alarm, the uncertainty of the system’s operating state will cause an unquantifiable disaster. The monitoring system indexes of Yixin payment system are as follows:

  • Real-time-to achieve second-level monitoring;
  • Comprehensiveness-covering all system services to ensure no dead angle coverage;
  • Practicality-Early warning is divided into multiple levels, and the monitoring personnel can make accurate decisions conveniently and practically according to the severity of early warning.
  • Diversity-early warning mode provides push-pull mode, including SMS, email and visual interface, which is convenient for monitoring personnel to find problems in time.

Alarms are mainly divided into single alarm and cluster alarm, while Yixin payment system belongs to cluster deployment. Real-time early warning is mainly realized by statistical analysis of real-time buried point data of various business systems, so the difficulty is mainly on the data buried point and analysis system.

3.2.2 Burial Point Data

In order to achieve real-time analysis without affecting the response time of the trading system, Yixin payment system uses redis to make data embedding points in each module in real time, and then aggregates the embedding point data to the analysis system, which analyzes and alarms according to the rules.

3.2.3 Analysis System

The most difficult thing for the analysis system is the business alarm points, such as which alarms must be called out as soon as they come out, and which alarms need only be paid attention to as soon as they come out. Let’s give a detailed introduction to the analysis system:

(1) System Operation Architecture

(2) the system operation process

(3) System Business Monitoring Point

The business monitoring points of Yixin payment system are summed up bit by bit in the daily operation process, and are divided into two major categories, namely, the alarm type and the attention type.

A: police officers

  • Early warning of network anomalies;
  • The alert has not been completed due to the timeout of a single order.
  • Real-time transaction success rate early warning;
  • Early warning of abnormal state;
  • Early warning of failure to return the disc;
  • Failure notification alert;
  • Early warning of abnormal failure;
  • Response code frequent warning;
  • Check the inconsistency warning;
  • Early warning of special conditions;

B: class of concern

  • Early warning of abnormal trading volume;
  • Early warning of trading volume exceeding 500W;
  • Short message backfilling overtime warning;
  • Illegal IP early warning;

3.2.4 Non-business monitoring points

Non-service monitoring points mainly refer to monitoring from the perspective of operation and maintenance, including network, host, storage, log, etc. The details are as follows:

(1) Service Availability Monitoring

The JVM is used to collect information such as the number and time of Young GC/Full GC, heap memory, and time-consuming Top 10 thread stack, including the length of cache buffer.

(2) flow monitoring

The Agent monitoring agent is deployed on each server to collect the traffic situation in real time.

(3) external system monitoring

The stability of the three parties or the network is observed through intermittent detection.

(4) Middleware Monitoring

  • For MQ consumption queue, the depth of queue is analyzed in real time through RabbitMQ script detection.
  • For the database part, the plug-in xdb is installed to monitor the database performance in real time.

(5) Real-time log monitoring

The collection of distributed logs is completed through rsyslog, and then real-time monitoring and analysis of logs are completed through system analysis and processing. Finally, through the development of visual pages to show users.

(6) System Resource Monitoring

Zabbix is used to monitor the CPU load, memory utilization rate, uplink and downlink traffic of each network card, read/write rate of each disk, read/write times (IOPS) of each disk, utilization rate of each disk space, etc.

The above is what the real-time monitoring system of Yixin payment system has done. It is mainly divided into two aspects: business point monitoring and operation and maintenance monitoring. Although the system is distributed, each early warning point is a second-level response. In addition, there is also a difficulty in the alarm points of the business system, that is, some alarms are reported in small quantities but not necessarily have problems, and a large number of alarms have problems, that is, the so-called quantitative changes cause qualitative changes.

For example, if there is a network anomaly, one occurrence may be network jitter, but if there are multiple occurrences, attention should be paid to whether there is a real problem with the network. The following is an example of an alarm for an appropriate credit payment system for network anomaly:

  • Single channel network anomaly warning: 12 consecutive A channel network anomalies occurred within 1 minute, triggering the warning threshold;
  • Within 1: 10 minutes of multi-channel network anomaly warning, three network anomalies occurred in each minute, involving three channels, triggering the warning threshold;
  • Within 2: 10 minutes of multi-channel network anomaly warning, a total of 25 network anomalies involving 3 channels occurred, triggering the warning threshold.

3.2.5 Logging and Analysis System

For a large system, it is difficult to record a large number of logs and analysis logs every day. The Yixin payment system has an average of 200W orders per day. A transaction flows through more than a dozen modules. Assuming that an order records 30 logs, one can imagine how large the log volume will be every day.

The analysis of Yixin payment system log has two functions, one is real-time log anomaly warning, and the other is to provide order tracks for operators to use.

(1) Real-time Log Warning

Real-time log alert refers to all real-time transaction logs, capturing keywords with Exception or Error in real time and then giving an alarm. The advantage of this is that if there is any abnormal operation in the code, it will be found in the first place. The processing method of Yixin payment system for real-time log warning is: firstly, rsyslog is used to complete log collection, then real-time capture is performed through analysis system, and then real-time warning is performed.

(2) Order Track

For the trading system, it is very necessary to know the status flow of an order in real time. The initial method of Yixin payment system is to record the order track through the database, but after running for a period of time, it is found that the database table is too large due to the sharp increase in order volume, which is not conducive to maintenance.

The current practice of Yixin payment system is that each module prints the log track, and the format of the log track printing is printed according to the database table structure. After all logs are printed, rsyslog completes the log collection. The analysis system will capture the printed standard logs in real time, analyze them, store them in the database on a daily basis, and display them to the operator’s visual interface.

Log printing specifications are as follows:

2016-07-22 18: 15: 00.512 | | Pool-73-Thread-4 | | Channel Adapter | | Channel Adapter-After Sending Three Parties | | CEX 16XXXXXXX 5751 | | 16201XXXXXX 337 | | | | | 04 | | 9000 | | [Settlement Platform Message] Processing | | 0000105 | | 98X543210 | | GHT | | 03 | |11||2016-07-22 18:15:00.512|| sheets | | | 01 | | tunnel query | | true | | | | pending | | | | | 8cff785d-0d01-4ed4-b771-cb0b1faa7f95 | | 10.999.140.101 | | o001 | | | | 0.01 | | | | | | | | | | | |http://10.100.444.59:8080/regression/notice||||240||2016-07-20 19:06:13.000xxxxxxx

||2016-07-22 18:15:00.170||2016-07-22 18:15:00.496xxxxxxxxxxxxxxxxxxxx





The brief log visualization track is as follows:

In addition to the above two points, the logging and analysis system also provides downloading and viewing of transaction and response messages.

3.2.6 7*24 hour monitoring room

Alarm items above Yixin payment system provide operators with push and pull methods, one is short message and mail push, the other is report presentation. In addition, due to the importance of the payment system compared with other Internet systems, Yixin payment system adopts a 7*24-hour monitoring room to ensure the safety and stability of the system.

3.3 Timely Handling of Faults

After the fault occurs, especially in the production environment, the first thing to do is not to find the cause of the fault, but to deal with the fault as quickly as possible to ensure the availability of the system. Common faults and treatment measures of Yixin payment system are as follows:

3.3.1 Automatic Repair

For the automatic repair part, the common faults of Yixin payment system are caused by the instability of three parties. In this case, the above-mentioned system will automatically reroute.

3.3.2 Service Degradation

Service degradation refers to shutting down certain functions to ensure the use of core functions in case of failure and failure to quickly repair. When the Yixin payment system promotes sales to merchants, if the transaction volume of a certain merchant is too large, it will adjust the traffic volume of this merchant in real time to downgrade the service of this merchant, thus not affecting other merchants. There are many scenarios like this, and the specific service downgrade function will be introduced in the following series.

Iv. Q&A

Q1: Can you tell us the specific details and treatment plan of the RabbitMQ that went down that year?

A1: The downtime of RabbitMQ triggered the thinking of system availability. At that time, our RabbitMQ itself was not down (RabbitMQ was still stable) and the hardware machine where RabbitMQ was located was down. However, the problem was that the deployment of RabbitMQ was single-point deployment at that time, and everyone thought that RabbitMQ would not be down, thus ignoring the container where it was located. Therefore, the thinking of this problem is that all businesses cannot have single points, including application servers, middleware, network equipment, etc. Single point not only needs to be considered from the single point itself, for example, the whole service should be duplicated, then AB test should be carried out, of course there are also double computer rooms.

Q2: Is your company’s development and operation together?

A2: Our development and operation are separated. Today’s sharing is mainly considered from the aspect of the availability of the whole system. There are too many developments and some operations. I have witnessed all the way through these Yixin payment systems.

Q3: Does your background all use Java? Have you considered other languages?

A3: Most of our current systems are java, and there are a few python, php and C++. This depends on the type of business. java is the most suitable for us at this stage. Other languages may be considered as the business expands.

Q4: Suspicion of third-party dependence. Can you give a specific example to illustrate how to do it? What if the third party is completely useless

A4: The system generally has third-party dependencies, such as databases, three-party interfaces, etc. When developing the system, it is necessary to keep doubts on the third party to avoid the chain reaction when the third party has problems, resulting in downtime. Everyone knows that problems in the system will snowball and get bigger and bigger. For example, if there is only one code scanning channel, there is no way when this code scanning channel has problems, so we doubted it at the beginning. If there is any abnormality, the real-time monitoring system will automatically switch the routing channel after triggering the alarm to ensure the availability of services by accessing multiple channels. Second, asynchronous message splitting is conducted for different payment types, merchants and transaction types to ensure that other channels will not be affected once unpredictable abnormalities occur in one type of transaction. This is just like multi-lane expressway, and express and slow lanes will not affect each other. In fact, the overall idea is fault tolerance+splitting+isolation, and this specific problem is treated concretely.

Q5: After overtime payment, there will be network problems. Will there be any problems such as paid money, lost orders, disaster tolerance and data consistency, replay logs and repair data?

A5: The most important thing in making payment is security, so we are conservative in dealing with the order status. Therefore, for orders with abnormal network, we set the status in process, and finally complete the final consistency with the bank or the three parties through active inquiry or passive acceptance of notification. In the payment system, besides the order status, there is a response code problem. Everyone knows that the bank or the three parties respond through the response code. The translation of the response code and the order status must also be a conservative strategy to ensure that there will be no problems such as overpayment and underpayment of funds. In a word, the overall idea of this point is that capital security comes first, and all strategies are based on the white list principle.

Q6: Just mentioned, if a payment channel times out, the routing strategy will be distributed to another channel. According to that channel diagram, there are different payment methods, such as Alipay or WeChat payment. If I only want to pay through WeChat, why not try again and switch to another channel? Or does the channel itself mean the requesting node?

A6: First, rerouting cannot be done for overtime because socket timeout cannot determine whether the transaction has been sent to the three parties, whether it has been successful or failed. If it has been successful, try again. If it has been successful, the payment will be overpaid. The capital loss in this case is not allowed for the company. Secondly, according to the routing function, there are different types of services. If it is a single collection and payment transaction, the user does not care which channel the money goes out through and can route it. If it is a code scanning channel, the user will definitely go to WeChat eventually if he scans the code using WeChat. However, we have many intermediate channels. WeChat goes out through intermediate channels. Here, we can route different intermediate channels, so it is ultimately WeChat payment for the user.

Q7: Can you give an example of the process of automatic repair? How to find the details of unstable to rerouting?

A7: Automatic repair means fault tolerance through rerouting. This problem is very good. If instability is found, then rerouting is decided. Re-routing must be to make it clear that the currently rerouted transaction cannot be routed without success, otherwise it will cause the problem of overpayment and overcharge. At present, the rerouting of our system is mainly decided by two ways: afterwards and during the event. For instance, if a channel is found unstable within 5 minutes after the event through the real-time early warning system, then the transactions after the current period will be routed to other channels. In view of the incident, it is mainly through analyzing the failure response codes returned by each order, sorting out the status of the response codes, and making it clear that only those that can be retransmitted can be rerouted. I refer to these two points here. There are still many other service points. For reasons of space, we will not go into details. However, the overall idea is that there must be a real-time memory analysis system. Second-level decision-making must be fast. Then, we combine real-time analysis and offline analysis to make decision support. Our real-time second-level early warning system does this.

Q8: Is there any regular promotion for merchants? How much difference will the peak value of promotion be compared with normal? Is there a technical exercise? What is the priority of demotion?

A8: In general, we will keep in touch with merchants regularly in advance for promotion. We will know the promotion time and quantity in advance, and then do some specific things. There is a big gap between the peak value of sales promotion and normal time. Sales promotion is usually more than 2 hours. For example, some sales of financial products are concentrated within 1 hour, so the peak value is very high. The technical exercise is that we know the sales promotion volume of the merchants, then estimate the processing capacity of the system, and then do the exercise in advance. The downgrade priority is mainly for merchants. As there are many payment scenarios for merchants accessing us, such as financial management, collection and payment, quickness, code scanning, etc., our overall principle is that different merchants must not influence each other, because your promotion cannot affect other merchants.

Q9: rHow is the syslog collection log stored?

A9: This is a good problem. At first, our log, that is, the order trace LOG, was recorded in the database table. As a result, it was found that an order flow requires many modules. The log trace of such an order is about 10 transactions. If 400w transactions are made a day, this database table has a problem. Even splitting will affect the performance of the database, and this is an auxiliary business. It should not be done. Then, we found that writing log is better than writing database, so we printed the real-time log in the form of a table and printed it on the hard disk. This is only a real-time log, so the amount of log is not large, which is in a fixed directory of the log server. Since the logs are all on distributed machines, and then they are collected into a centralized place, this one is stored by mounting, and then the programs written by the special operation and maintenance team are used to analyze the logs in the form of forms in real time, and finally they are displayed to the operation page through the visual page, so that the order tracks seen by the operators are almost real-time, and what you care about is actually not a problem, because we divide the real-time logs and offline logs, and then the offline logs that exceed a certain period of time will be cut and finally deleted.

Q10: How do system monitoring and performance monitoring work together?

A10: The system monitoring I understand includes the system performance monitoring. The system performance monitoring is part of the overall monitoring of the system. There is no coordination problem. The system performance monitoring has multiple dimensions, such as application level, middleware, container, etc. Non-business monitoring of the system can be viewed for article sharing.

Author: Feng Zhongqi

Source: Yixin Institute of Technology