Part I: Architecture and Components
I. Development of Data Platform
1.1 Background Introduction
With the advent of the data age, the increase of data volume and data complexity has promoted the rapid development of data engineering. In order to meet various data acquisition/calculation requirements, many solutions have emerged in the industry. However, most of the plans follow the following principles:
- Reduce data processing costs
- Reasonable Improvement of Data Use/Calculation Efficiency
- Provides a unified programming paradigm
The data service platform of Yi Ren Loan also follows these three principles. I had the privilege of experiencing the entire development process of the personal loan data platform Genie. looking at the personal loan and the industry, it can be said that the development of Genie is a microcosm of the development of the industrial data platform.
Google’s three major papers and the release of Apache Hadoop open source ecosystem should be the starting point for big data processing technology to enter “ordinary people’s homes”. Hadoop’s components can be run on ordinary cheap machines, and its code is open source, so it has been highly praised by many companies. So what did these companies do with it in the beginning?
The answer is data warehouse.
Note: Google’s three major papers: Bigtable: A Distributed Storage System for Structured Data; The Google File System； MapReduce: Simplefied Data Processing on Large Clusters
Therefore, the basic architecture of the early data platform is composed of Sqoop+HDFS+Hive, because this is the cheapest and most efficient way to build a data warehouse. At this time, the data warehouse can only answer what happened in the past (offline phase), because Sqoop offline extraction generally adopts the t+1 snapshot scheme, that is, only yesterday’s data.
Then, due to the increased demand for real-time data, complex operations such as association aggregation of incremental data need to be done in real time. At this time, the data platform will join the distributed stream computing architecture, such as Strom, Flink, Spark Streaming, etc. The data warehouse at this time can answer what is happening (real-time phase).
As the calculation logic of offline data processing flow (e.g. Sqoop+HDFS+Hive) and real-time data processing flow (e.g. BINLOG+SPARK STEMING+HBase) are coupled greatly, and real-time full-volume data analysis can be supported through combination, many architectures have been generated, such as early Lambda, Kappa, etc. At this time, historical data and real-time data combined with data warehouse can answer what will happen eventually (prediction stage).
The development of the data platform can no longer be explained by the fact that it is no longer a data warehouse. It has worked closely with various business departments (such as marketing, electricity sales and operation) to create many data products. At this time, the data warehouse (data platform) has entered the active decision-making phase.
In fact, the prediction and real-time development order of different companies are different, only historical data can be used to make predictions.
1.2 Data Platform Positioning
The data platform should be an important part of the infrastructure. Once many companies in the Internet industry followed the trend of building big data clusters and found it difficult to give full play to its real value. In fact, the most important reason should be the positioning of data usage and the positioning of the data platform. The current data platform positioning has the following points:
- Decision enabling
Empowering the decision-making level, the decision-making level can quickly understand the company’s operation by using BI reports, because the data will not tell lies.
- Business Data Analysis/Business Data Products
The platform can provide Adhoc real-time analysis to help analysts quickly analyze business, quickly locate problems and quickly feed back.
- Computational storage
Business data products can also make full use of the platform’s computing and storage resources to create data products, such as recommendation, intelligent marketing, etc.
Improve the efficiency of data processing, thus saving the time and cost of data mining/processing.
The early staffing structure of most companies is as follows:
Operating, marketing and decision-making levels directly use the platform, most of which is to view BI reports directly. After the business analysts sort out the business requirements, they will provide the requirements to the data warehouse engineers, and then the professional data warehouse engineers will add the new requirements to the existing company-level data warehouse. The data engineering team is mainly responsible for the operation and maintenance of the cluster.
1.3 Shortcomings of Initial Architecture
Why is this structure in the early days not described too much here, let’s talk about its shortcomings directly.
- When the decision-makers use the report, they find that it is always a beat slower and there will always be new demands. The reason is very simple: in fact, the business of internet companies is not as stable as that of traditional industries (such as banking, insurance, etc.), because the development of internet companies is relatively fast and the business renewal and iteration are also very fast.
- There are always various temporary needs for business analysis for reasons similar to 1.
- Data warehouse engineers become dogs. The data warehouse is huge and heavy, and it is difficult to operate flexibly. It is always full of surprises.
- Cluster operations are difficult to operate and maintain, and the coupling between operations is too large. For example, Table A of A business does not run out, which directly affects all operations of the whole company.
1.4 Common Solutions
I believe many companies have encountered these headache problems and the solution should be similar. In general terms:
- Build a product-oriented data service platform.
- The energy of data warehouse is transferred to more basic and lower-level data issues, such as data quality issues, data usage specifications, data security issues, model architecture design, etc.
- Business analysts directly use the platform to build business data marts to improve agility and specificity.
- The main responsibility of data engineering is no longer to operate and maintain clusters, but to build data service platforms and business data products.
The advantage of this is:
- The bottleneck problem of data warehouse is solved.
- It is more efficient to let those who are most familiar with their own data set up their own data marts.
- Business data products can directly use the data service platform to improve efficiency and reduce company costs.
Second, the architecture and characteristics of the data platform Genie
2.1 Genie architecture
Yi Ren Loan belongs to Internet Finance Company. Due to its financial nature, it requires higher security, stability and data quality of the platform than other Internet companies. At present, in the data structure of Yi Ren Loan, the total amount of data is PB level and the daily increment is TB level. In addition to structured data, there are also logs, voice and other data. The data application types are divided into two categories: operation and marketing, such as intelligent electricity sales and intelligent marketing. The data service platform needs to ensure that thousands of batch jobs run on time every day, and ensure the efficiency and accuracy of real-time data calculation by data products. At the same time, it also needs to ensure the effectiveness of a large number of Adhoc queries every day.
The above is the technical architecture diagram of the bottom layer of the platform. The whole is a Lambda architecture. Batch layer is responsible for calculating t+1 data. Most of the regular reports and the main tasks of data warehouse/bazaar are processed at this layer. Speed layer is responsible for calculating real-time incremental data, real-time warehouse counting, incremental real-time data synchronization, data products, etc. This layer of data is mainly used. Batchelayer uses sqoop timing synchronization to HDFS cluster, and then uses Hive and Spark SQL to calculate. The stability of Batch layer is more important than the operation speed, so we mainly optimize the stability. The output of Batch layer is Batch view. Speed layer has a longer data link and a relatively complex architecture than Batch layer.
DBus and Wormhole are open source projects of Yixin, which are mainly used as data pipelines. The basic principle of DBus is to carry out real-time incremental data synchronization by reading the binlog of the database. The main problem to be solved is to carry out non-invasive incremental data synchronization. Of course, there are also other schemes, such as card timestamp and adding trigger, which can also realize incremental data synchronization, but the pressure and intrusiveness to the service library are too great. Wormhole’s basic principle is to consume incremental data synchronized by DBus and synchronize these data to different storage, supporting homogeneous and heterogeneous synchronization methods.
Generally speaking, Speed layer synchronizes the data into our various distributed databases, which are collectively called Speed view. Then we abstract the metadata of Batch and Speed into a layer called Service layer. Service layer provides unified services to the outside world through NDB. Because data has two main attributes, namely data=when+what. In the time dimension of when, the data is immutable, and all additions, deletions and modifications actually generate new data. In normal use of data, we often only pay attention to the attribute of what, in fact when+what can determine the only immutable characteristic of data. Therefore, according to the time dimension, we can divide the data into abstract time dimensions, i.e. t+1 data in Batch view and t+0 data in Speed view. This is the intention of the standard Lambda architecture: to separate offline and real-time calculations. However, there are some differences in our Lambda architecture (not too much here).
To realize that cluster resources are limited, placing offline and real-time computing architectures in a cluster will inevitably lead to the problem of resource preemption. Because each company’s computing and storage plan may be different, I will only take our plan as an example, hoping to play a role in attracting valuable contributions.
To solve the problem of preemption, let’s first have a clear understanding of preemption. From the perspective of user usage, if the platform is multi-tenant, there is the possibility of preemption among tenants. In terms of data architecture, if offline computing and real-time computing are not deployed separately, there is also the possibility of preemption. It should be emphasized that preemption refers not only to cpu and memory resources, but also to io of network io disks.
At present, the resource scheduling systems in the open source market, such as yarn, mesos and other resource isolation systems, are not very mature and can only do some slight isolation on cpu and memory (yarn of hadoop3.0 has already joined the isolation mechanism of disk and network io). Since our work is basically “everything on yarn”, we have revised thorn. The modification of yarn is similar to the official solution by cgroup. Cgroup should also be used to isolate service processes, such as when datanode nodemanager is on a machine.
The above figure well illustrates the composition of the data platform Genie and the data usage process. First, let’s talk about the data usage process. First, all data (including structured data and unstructured data) will be standardized in the data warehouse, such as unified units, unified dictionaries, unified data formats, unified data naming, etc. Unified and standardized data will be directly or indirectly used by the data mart as the entrance to the data mart. The business coupling between data marts is very low, so the data coupling is also low, which can well avoid the coupling degree of overall operations. The data application of each business will also directly use its own data mart.
2.2 Functional Modules of Genie
Besides the composition of Genie, Genie is divided into seven subsystems.
- Meta data: Metadata management is the core of the core, metadata service is the foundation of the data platform, and almost all the required functions will depend on it.
- Authority: unified authority, unified management and flexible configuration. The permissions here include the configuration of data access permissions.
- Monitor: Monitoring, counting cluster usage by tenant dimension, etc.
- Triangle: Self-developed dispatching system, distributed, service-oriented, highly available and user-friendly. The above diagram is the architecture diagram of Triangle scheduling system. The whole is a Master Slave archite cture. The concept of Job Runtime Dir refers to the complete packaging of the environment required for the current Job to run, such as Python environment.
- Data Dev: the above figure is a data development process. Data development platform-a one-stop platform for online development and testing, which is safe, fast and supports SQL, Python, Spark Shell.
- Data Pipeline: data pipeline is used for offline data pipeline configuration management and real-time data pipeline configuration management. The offl ine warehousing configuration and the real-time warehousing configuration can be completed in 1 minute.
- Data Knowledge: data knowledge, used for blood relationship query and data index management.
There is no best architecture, only a more suitable one. Each company has different situations and different business models. Although ETL data processing, data warehouse and machine learning are all used, how many requirements are data warehouse? What is the application scenario of machine learning? What is the real-time requirement of ETL? These details have many complicated objective constraints.
There are two crucial factors in the selection of technical architecture, namely scene and cost. In short, the scene is what to do, to be implemented in a low-cost way, not over-designed. If the scene is complex, it can be abstractly subdivided from multiple dimensions, such as: time dimension (problems to be solved in history, current problems, and problems that may be faced in the future). Similarly, in terms of cost, there are many dimensions that should be considered, such as: development cycle, operation and maintenance complexity, stability, technology stack of existing personnel, etc.
In the next article, we will continue to explain the PaaS data service platform Genie of yi ren loan from “technical details of real-time data warehouse” and “brief introduction of data platform functions”. please pay close attention.
Part II: Technical Details and Functions
Reading Guidance: In the previous article, we have simply understood the characteristics of Genie, the appropriate personal loan data platform, and have mastered some information about the development process of the data platform. As the next part of this article, we will first focus on the technical details of real-time data warehouse, and then introduce the functions of the data platform. Let’s take a look at this knowledge together ~
IV. Technical Details of Real-time Data Warehouse
The offline data warehouse is t+1 data, which means that the timeliness of data is to process the data of the previous day. Generally speaking, the strategy of synchronizing data in offline scheme is to synchronize data regularly once a day, and basically synchronize full data once a day, that is to say, a mirror image of full data (service library) every day.
In addition to timeliness, there is another point that there is only one mirrored data state, so to know the historical change process of a certain value, one needs to go through a zipper table (very time-consuming and resource-consuming). There are many ways to implement real-time data warehouse, but most of them all come to the same end.
Real-time data warehouse has two characteristics: first, accessing real-time data; The second result can be returned approximately in real time. Of course, if the offline warehouse is optimized well, the second point can be realized. Thinking about two questions, why use real-time data? Why is there a real-time data warehouse?
In recent years, data engineers have made a lot of efforts and attempts on how to improve the timeliness of data. Of course, it is the scenes and requirements that drive the development of these real-time data synchronization and processing technologies. The competition in China’s big Internet environment is very fierce, and how to improve the conversion rate of users becomes particularly critical.
Data-related products such as user profiles, recommendation systems, funnel analysis, intelligent marketing, etc. cannot be separated from the processing and calculation of real-time data.
The most direct way to obtain real-time data is to directly connect the business database, which has obvious advantages and disadvantages. Some logic requires cross-database multi-source query association. It is not feasible to directly connect the business database. Therefore, it is necessary to synchronize the data from multiple sources in a centralized way. This synchronization process is a very challenging place, considering many difficulties such as timeliness of data, intrusiveness to business systems, data security and data consistency.
Therefore, we need a tool to synchronize data. It needs the following characteristics:
- The data of the production library and the log data can be synchronized approximately in real time
- And the application server
- The synchronized data can be distributed to other storage
- The whole synchronization process ensures that data is not lost, or can be re-synchronized in batches at any time.
DBus and Wormhole developed by Yixin Agile Big Data Team can well meet the above four points.
DBus uses binlog of the database for data extraction, and binlog generally has a relatively low delay, which not only ensures real-time characteristics, but also ensures zero intrusion into the production library.
In fact, using logs to build a robust data system is a very common solution. Hbase uses wal to ensure reliability, MySQL uses binlog for master and backup synchronization, Raft uses logs to ensure consistency, and Apache Kafka also uses logs to achieve consistency.
DBus makes good use of the binlog log of the database and carries out unified schema transformation to form its own log standard so as to support various data sources. DBus is defined as a commercial data bus system. It can extract data from the data source and send it to Kafka in real time.
Wormhole is responsible for synchronously writing data to other storage. Kafka has become a real data bus. Wormhole supports sink to start consuming data in Kafka at any time, so that data can be traced back well.
Genie’s real-time architecture is as follows:
With DBus and Wormhole, we can easily synchronize the data from the production backup library to our Cassandra cluster in real time, and then synchronize Presto to provide SQL language calculation for users.
Through this simple structure, we have efficiently completed the construction of real-time data warehouse, and realized the company’s real-time report platform and some real-time marketing data products.
I can give the following answer to why Presto is used:
- Presto has interactive data calculation and query experience
- Presto supports horizontal expansion, presto on yarn (slider)
- Supports standard SQL and is easy to expand.
- Facebook, uber, netflix
- The open source language java conforms to our team technology stack, custom function
- Support multi-data source association join logic push-down, Presto can connect Cassandra, Hdfs, etc
- Pipelined executions-reduces unnecessary I/O overhead
Presto is an m/s architecture, and the overall details are not much discussed. Presto has a data store abstraction layer that supports SQL calculations on different data stores. Presto provides meta data api, data location api, data stream api, and supports self-development of pluggable connector.
In our scheme, it is preston Cassandra, because Cassandra has better usability than Hbase and is more suitable for adhoc query scenarios. In Hbase CAP, the bias is c, in Cassandra CAP, the bias is a. Cassandra is a very excellent database, which is convenient and easy to use. Log-Structured Merge-Tree is used as the core data structure for storing indexes.
Five, the overall data processing architecture
To sum up, I have roughly introduced the real-time data processing architecture of Yilin Loan. Let’s take a look at the overall data processing architecture.
The speedlayer of the whole Lambda architecture is assembled into a set of real-time data buses by using DBus and Wormhole, and the speed layer can directly support real-time data products. DataLake is an abstract concept implementation method. We mainly use Hdfs+Cassandra to store data. The calculation engine is mainly Hive and Presto, and metadata is integrated and provided through the unified metadata of the platform, thus realizing a complete DataLake. DataLake’s main application scenario is advanced and flexible analysis, with query scenarios such as ml.
The difference between DataLake and data warehouse is that DataLake is more agile and flexible, focusing on data acquisition, while data warehouse focuses on standards, management, security and fast indexing.
Six, the data platform Genie function module
The entire Genie data service platform consists of 7 large sub-platform modules:
- query and pivot
- Data knowledge
- Real-time report
- Data development
- Job scheduling
- Authority management
- Cluster monitoring management
Let’s introduce some of these modules.
6.1 Data Query Module
- Users can query the data of data warehouse, data mart and real-time data warehouse.
- Fine-grained rights management is realized by analyzing SQL
- Provide a variety of query engines
- Data export
6.2 Data Knowledge Module
- Metadata monitoring and management
- Provide management inquiry function for metadata of the whole company
- You can monitor metadata changes and alert mail
- Blood relationship analysis query engine
- SQL analysis engine
- Analyze all jobs/tables/fields in the warehouse
- Provide blood relationship analysis/impact analysis
6.3 Data Report Module
- Real-time data warehouse
- Presto on Cassandra Direct Presto
- Hundreds of tables, real-time synchronization (DBus+WHurl)
- Da Vinci Report Platform (Da Vinci url)
- Nearly a thousand statements have been used by the whole company.
6.4 Data Development Module
- Data programming Genie-ide
- Provide Genie-ide for data program development.
- Provide network disk for script saving management
- Can be tested/launched in real time
- One-click offline warehousing
- One-key real-time warehouse entry
6.5 job scheduling Triangle module
- The micro-service architecture is designed such that each module is a service.
- Providing restful interface can facilitate the integration of secondary development with other platforms
- Provide health monitoring job management background
- Provide public and private jobs
- Logical isolation between job flows
- Concurrency Control, Failure Policy Management
VII. Functions of Data Platform Genie
The above is a brief introduction to the functions of the Genie module of the data platform. What can the Genie platform do specifically?
First of all, it can realize off-line warehouse entry and real-time warehouse entry can be completed within 1 minute (data warehouse, data mart);
Secondly, real-time report presentation push (BI analysis) can be directly configured after real-time warehousing;
Third, real-time data supports a variety of isomorphic docking methods with permission security: api ,kafka, jdbc (Business Data Products);
Fourth, one-stop data development supports hive, spark-sql, Preston Cassandra, python (data development);
Fifth, the service-oriented dispatching system supports external system access (basic technical components).
Author: Sun Lizhe
Source: Yixin Institute of Technology