Introduction: this article will discuss an important and common big data infrastructure platform, namely “real-time data platform”, in two parts.
In the last design chapter, we first introduce the real-time data platform from two dimensions: viewing the real-time data platform from the perspective of modern data warehouse architecture and viewing the real-time data processing from the perspective of typical data processing; Then we will discuss the overall design framework of the real-time data platform, consideration of specific issues and solutions.
In the next technical article, we will further give the technical selection and related components of the real-time data platform, and discuss which application scenarios are applicable to different modes. I hope that through the discussion of this article, readers can get a rule-based, real-time data platform construction scheme that can be actually landed.
I. Background of Relevant Concepts
1.1 Viewing Real-time Data Platform from the Perspective of Modern Data Warehouse Architecture
Modern data warehouse is developed from traditional data warehouse. Compared with traditional data warehouse, modern data warehouse has both similarities and many development points. First of all, let’s look at the module architecture of the traditional data warehouse (Figure 1) and the modern data warehouse (Figure 2):
Fig. 1 conventional data warehouse
Fig. 2 modern data warehouse
The traditional data warehouse is very familiar to everyone and will not be introduced here too much. Generally speaking, the traditional data warehouse can only support data processing with time delay of T+1 day. The data processing process is mainly ETL and the final output is mainly report form.
The modern data warehouse is built on the traditional data warehouse. At the same time, more diversified data sources are imported and stored, more diversified data processing methods and aging (T+0 day aging is supported), more diversified data usage methods and more diversified data terminal services are added.
Modern data warehouse is a very big topic, here we will show its new characteristic ability in the way of concept module. First, let’s look at the summary of Melissa Coates in Figure 3:
In the summary of Melissa Coates in Figure 3, we can conclude that the reason why modern digital warehouse is “modern” is that it has a series of characteristics such as multi-platform architecture, data virtualization, near real-time analysis of data, agile delivery method, etc.
On the basis of referring to Melissa Coates’s summary of modern data warehouse and with our own understanding, we have also summarized and extracted several important capabilities of modern data warehouse, namely:
- Real-time data (real-time synchronization and streaming capability)
- Data Virtualization (Virtual Mix and Unified Service Capability)
- Data Popularization (Visualization and Self-service Configuration Capability)
- Data Collaboration (Multi-Tenant and Division of Labor Collaboration)
1) Real-time data (real-time synchronization and streaming capability)
Real-time data refers to data from generation (update to business database or log) to final consumption (data report, dashboard, analysis, mining, data application, etc.), supporting millisecond/second/minute delay (strictly speaking, second/minute is quasi-real-time, which is collectively referred to as real-time). This involves how to extract the data from the data source in real time. How to flow in real time; In order to improve timeliness and reduce end-to-end delay, it is also necessary to have the ability to support calculation processing in the flow process. How to drop the warehouse in real time; How to provide subsequent consumption in real time. Real-time synchronization refers to the end-to-end synchronization from multiple sources to multiple targets. Streaming refers to logical conversion processing on streams.
However, we need to know that not all data processing and calculation can be carried out on the stream, and our aim is to reduce end-to-end data delay as much as possible. We need to cooperate with other data flow processing methods here, and we will discuss them further later.
2) Data Virtualization (Virtual Mix and Unified Service Capability)
Data virtualization refers to a technology for users or user programs to face a unified interaction mode and query language without paying attention to the physical library, dialect and interaction mode (heterogeneous system/heterogeneous query language) where the data is actually located. The user experience is to operate against a single database, but in fact this is a virtualized database and the data itself is not stored in the virtual database.
Virtual miscalculation refers to the ability of virtualization technology to support transparent miscalculation of heterogeneous system data. Unified service refers to providing unified service interfaces and methods for users.
Figure 4 Data Virtualization
(Figures 1-4 are all selected from “Designing a Modern Data Warehouse+Data Lake”-Melissa Coates, Solution Architect, Blue Granite)
3) data popularization (visualization and self-service configuration capability)
Ordinary users (data practitioners without professional big data technology background) can use data to complete their own work and requirements through visual user interface, self-service through configuration and SQL, and do not need to pay attention to the underlying technical issues (through cloud computing resources, data virtualization and other technologies). The above is our interpretation of the popularization of data.
For the interpretation of Data Democratization, see also the following link:
This paper mentions how the technical level supports the popularization of data, and gives several examples: Data virtualization software, Data federation software, Cloud storage, Self-service BI applications, etc. Data virtualization and data federation are essentially similar technical solutions, and the concept of self-service BI is mentioned.
4) Data Collaboration (Multi-Tenant and Division of Labor Collaboration)
Should technicians know more about business, or should business personnel know more about technology? This has always been a controversial issue in enterprises. However, we believe that modern BI is a process that allows in-depth collaboration. Technical personnel and business personnel can use their respective strengths on the same platform to complete daily BI activities through division of labor and collaboration. This puts forward higher requirements for the platform’s multi-tenant ability and division of labor and cooperation ability. A good modern data platform can support better data cooperation ability.
We hope that we can design a modern real-time data platform to meet the above-mentioned capabilities of real-time, virtualization, civilians, collaboration and so on, and become a very important and essential component of modern digital warehouse.
1.2 Viewing Real-time Data Processing from the Perspective of Typical Data Processing
Typical data processing can be divided into OLTP, OLAP, streaming, ad hoc, machine learning, etc. The definition and comparison of OLTP and OLAP are given here:
(figure 5 is selected from the article “relational databases are not designed for mixed work loads”-mattallen)
From a certain point of view, OLTP activities mainly occur in the business transaction database and OLAP activities mainly occur in the data analysis database. So, how does the data flow from OLTP library to OLAP library? If this data flow requires high timeliness, the traditional T+1 batch ETL method cannot be satisfied.
We call the OLTP to OLAP flow process Data Pipeline, which refers to all flow and processing links between the production end and the consumption end of data, including data extraction, data synchronization, on-stream processing, data storage, data query, etc. There may be very complicated data processing conversion (such as conversion from repeated semantic multi-source heterogeneous data sources to unified Star Schema, conversion from detailed tables to summary tables, and combination of multi-entity tables into wide tables, etc.). How to support the real-time Pipeline processing capability has become a challenging topic. We describe this topic as the “OLPP, Online Pipeline Processing) problem.
Therefore, the real-time data platform discussed in this paper hopes to solve the OLPP problem from the perspective of data processing and become a solution to the problem of missing real-time flow from OLTP to OLAP. Next, we will discuss how to design such a real-time data platform from the architecture level.
II. Architecture Design Scheme
2.1 Positioning and Targets
The Real-time Data Platform (hereinafter referred to as RTDP) aims to provide end-to-end real-time data processing capability (millisecond/second/minute delay), can extract real-time data from multiple data sources, and can provide real-time data consumption for multiple data application scenarios. As a part of modern data warehouse, RTDP can support real-time, virtualization, civilians, collaboration and other capabilities, making real-time data application development lower in threshold, faster in iteration, better in quality, more stable in operation, simpler in operation and maintenance, and stronger in capability.
2.2 Overall Design Framework
Concept module architecture is the hierarchical architecture and capability combing of the concept layer of real-time data processing Pipeline. It is universal and referential in itself, more like a requirement module. Fig. 6 shows the overall conceptual module architecture of RTDP. the meaning of each module can be explained by itself, and will not be described in detail here.
Figure 6 RTDP Overall Concept Module Architecture
Next, we will make further design discussions according to the above figure and give high-level design ideas from the technical level.
Figure 7 Overall Design Idea
As can be seen from fig. 7, we have unified and abstracted the four levels of conceptual module architecture:
- Unified data acquisition platform
- Unified streaming platform
- Unified Computing Service Platform
- Unified Data Visualization Platform
At the same time, the principle of opening the storage layer is also maintained, which means that users can select different storage layers to meet the needs of specific projects without damaging the overall architecture design. Users can even select multiple heterogeneous storage to provide support in the Pipeline at the same time. The following is an explanation of the four abstract layers.
1) Unified Data Acquisition Platform
The unified data acquisition platform can not only support full extraction from different data sources, but also support enhanced extraction. Among them, incremental extraction of business database will choose to read the database log to reduce the reading pressure on the business database. The platform can also process the extracted data uniformly and then publish it to the data bus in a uniform format. Here we choose a custom standardized UMS(Unified Message Schema (UMS) as the data level protocol between the unified data acquisition platform and the unified streaming platform.
UMS comes with Namespace information and Schema information, which is a self-localization and self-interpretation message protocol format. The advantages of this method are:
- The whole architecture does not need to rely on an external metadata management platform;
- The message is decoupled from the physical medium (here the physical medium refers to the topic of Kafka, the Stream of Spark Streaming, etc.), so the parallel of multiple message flows and the free drift of message flows can be supported through the physical medium.
The platform also supports multi-tenant system and configurable simple processing and cleaning capability.
2) Unified Streaming Platform
The unified streaming processing platform will consume messages from the data bus and can support UMS protocol messages as well as common JSON format messages. At the same time, the platform also supports the following capabilities:
- Support visualization/configuration/／SQL to reduce flow logic development/deployment/management threshold
- Support configuration mode idempotent falling into multiple heterogeneous target libraries to ensure the final consistency of data
- Support multi-tenant system to achieve project-level isolation of computing resources/table resources/user resources, etc.
3) Unified Computing Service Platform
The unified computing service platform is a realization of data virtualization/data federation. The platform supports push-down computation and pull-mix computation of multiple heterogeneous data sources internally, and also supports external unified service interface (JDBC／REST) and unified query language (SQL). Because the platform can provide unified convergent services, it can build modules such as unified metadata management/data quality management/data security audit/data security policy based on the platform. The platform also supports multi-tenant system.
4) Unified Data Visualization Platform
The unified data visualization platform, together with multi-tenant and perfect user system/authority system, can support the division of labor and cooperation ability of cross-department data practitioners, and enable users to give full play to their respective strengths to complete the application of the last ten kilometers of the data platform in a visual environment through close cooperation.
The above is based on the overall module architecture, with unified abstract design and open storage options to improve flexibility and requirement adaptability. Such RTDP platform design embodies the real-time/virtualization/civilians/collaboration capabilities of modern digital warehouses, and covers the end-to-end OLPP data flow link.
2.3 Specific Issues and Considerations
Next, we will discuss the problem considerations and solutions that this design needs to face from different dimensions based on the overall architecture design of RTDP.
1) Functional considerations
Functional considerations mainly discuss such a question: can real-time Pipeline handle all ETL complex logic?
We know that streaming computing engines such as Storm／Flink are processed on a per-line basis. For Spark Streaming streaming streaming engine, it is processed by each mini-batch. For offline batch running tasks, it is processed according to daily data. Therefore, the processing range is one dimension (range dimension) of data.
In addition, streaming processing is oriented to incremental data. If the data source is from a relational database, then incremental data often refers to incremental change data (addition, deletion and revision); The opposite batch processing is directed at snapshot data. Therefore, the presentation form is another dimension of data (change dimension).
The change dimension of a single piece of data can be projected and converged into a single snapshot, so the change dimension can be converged into a range dimension. Therefore, the essential difference between streaming processing and batch processing is that the unit of streaming processing is “limited range” and the unit of batch processing is “full table range” due to the different dimensions of data range. “Full Table Range” data can support various SQL operators, while “Limited Range” data can only support some SQL operators. The specific support is as follows:
Left join: support. “Restricted Range” can be left join external lookup table (by pushing down, similar to hashjoin effect)
Rightjoin: not supported. Every time all lookup table data is retrieved from lookup, this calculation is neither feasible nor reasonable.
Interjoin: support. Can be converted to left join ＋filter, can support
OuterJoin: Not supported. There is a right join, so it is not reasonable.
- Union: support. Can be applied to pull back local range data for window aggregation operation.
- Agg: not supported. Union can be used for local window aggregation, but full table aggregation cannot be supported.
- Filter: support. No shuffle, very suitable.
- Map: support. No shuffle, very suitable.
- Project: support. No shuffle, very suitable.
Join often needs shuffle operation, which costs the most computing resources and time, while left join on the stream converts join operation into queue operation of hashjoin, and distributes the centralized data computing resources and time for batch processing join in the data flow process, join(left join on the stream is the most cost-effective computing method.
Complex ETL is not a single operator, but is often composed of multiple operators. It can be seen from the above that simple streaming processing cannot support all ETL complex logic well. So how to support more complex ETL operators in the real-time Pipeline while maintaining timeliness? This requires the mutual conversion capability of “limited range” and “full table range” processing.
Imagine that the streaming processing platform can support suitable processing on the stream, and then drop different heterogeneous databases in real time. The computing service platform can batch mix multi-source heterogeneous databases at regular intervals (the time setting can be every few minutes or less), and send each batch of computing results to the data bus to continue to flow. Thus, the streaming processing platform and the computing service platform form a computing closed loop, each doing good operator processing, and the data performs various operator transformations in the process of triggering flow at different frequencies. In theory, this architecture mode can support all ETL complex logic.
Figure 8 Evolution of Data Processing Architecture
Fig. 8 shows the evolution of data processing architecture and an architecture mode of OLPP. Wormhole and moonbox are our open source streaming processing platform and computing service platform respectively, which will be described in detail later.
2) Quality considerations
The above figure also leads to two mainstream real-time data processing architectures: Lambda architecture and Kappa architecture. the introduction of the two architectures has a lot of information on the internet, which will not be repeated here. Lambda architecture and Kappa architecture each have their own advantages and disadvantages, but they both support the final consistency of data and ensure the data quality to some extent. how to learn from each other’s strong points to form a certain fusion architecture in Lambda architecture and Kappa architecture will be discussed in detail in the new article.
Of course, data quality is also a very big topic, only supporting re-run and recharge cannot completely solve all the data quality problems, only the engineering scheme of supplementing data is given from the technical framework level. We will also discuss a new topic about the quality of big data.
3) Stability considerations
This topic involves, but is not limited to, the following points. Here are some simple ideas to deal with:
- High availability HA
The whole real-time Pipeline link should select highly available components to ensure the overall high availability in theory. Supporting data backup and replay mechanism on data key links; Supporting Dual Run Fusion Mechanism on Service Critical Links
- SLA guarantee
On the premise of ensuring high availability of clusters and real-time Pipeline, it supports dynamic expansion and automatic drift of data processing flow.
- Elasticity against fragility
Resilient Expansion of Resources Based on Rules and Algorithms
Support event triggered action engine failure handling
- Monitoring and early warning
Multi-aspect monitoring and early warning capabilities at cluster facility level, physical pipeline level and data logic level
- Automatic operation and maintenance
Can capture and archive missing data and handle exceptions, and has a regular automatic retry mechanism to repair problem data
- Upstream metadata change resistance
Upstream Business Library Requires Compatibility Metadata Changes
Real-time Pipeline processes explicit fields
4) Cost considerations
This topic involves, but is not limited to, the following points. Here are some simple ideas to deal with:
- Human cost
To Reduce the Labor Cost of Talents by Supporting the Popularization of Data Application
- Resource cost
Reducing Resource Waste Caused by Static Resource Occupation by Supporting Dynamic Resource Utilization
- Operational costs
Reduce operation and maintenance costs by supporting mechanisms such as automatic operation and maintenance/high availability/elasticity and anti-vulnerability
- Cost of trial and error
Reduce trial and error costs by supporting agile development/rapid iteration
5) Agile Considerations
Agile Big Data is a complete set of theoretical system and methodology. It has been described in the previous section. From the perspective of data usage, agile considerations mean: configurability, SQL and popularization.
6) management considerations
Data management is also a very big topic, here we will focus on two aspects: metadata management and data security management. If unified management of metadata and data security is a very challenging topic under the environment of modern multi-warehouse and multi-data storage selection, we will consider these two aspects separately and give built-in support in each link platform on the real-time Pipeline, and at the same time we can support the docking of external unified metadata management platform and unified data security policy.
In this paper, we discussed the concept background and architecture design of RTDP. In the architecture design scheme, we especially emphasized the positioning and objectives of RTDP, the overall design architecture, and the specific issues and considerations involved. Some topics are very big and can be discussed in a separate article, but on the whole, we have given a set of RTDP design ideas and plans. In the next technical article, we will concretize the RTDP architecture design, give the recommended technology selection and our open source platform scheme, and discuss the application of different modes of RTDP in combination with the requirements of different scenarios.
Author: Lu Shanwei
Source: Yixin Institute of Technology