How to Design a Real-Time Data Platform (Technology)

  Big data, data modeling

Song of Agility

I count, therefore I exist | DBus

Everyone Plays with Streaming

Think of me as a database | Moonbox

Yan Worth Last Ten Kilometers | Davinci

Introduction: RTDP (Real-time Data Platform) is an important and common big data infrastructure platform. In the last part (design part), we introduced RTDP from the perspective of modern data warehouse architecture and typical data processing, and discussed the overall design architecture of RTDP. As the next (technical) part, this article introduces the technical selection and related components of RTDP from a technical point of view, and discusses the related modes applicable to different application scenarios. RTDP’s Agile Road Opens ~

Expand reading:Take enterprise real-time data platform as an example to understand what agile big data is.

How to Design Real-time Data Platform (Design)

I. Introduction of Technical Selection

In the design chapter, we gave an overall architecture design of RTDP (Figure 1). In the technical section, we will recommend the selection of overall technical components. A brief introduction is made to each technical component, especially to the four technical platforms (unified data acquisition platform, unified streaming processing platform, unified computing service platform, unified data visualization platform) that we abstract and implement. The end-to-end topic of Pipeline is discussed, including function integration, data management, data security, etc.

Figure 1 RTDP Architecture

1.1 Overall Technical Selection

Figure 2 Overall Technical Selection

First, let’s briefly read Figure 2:

  • Data sources and clients list the common data source types of most data application projects.
  • Data bus platform DBus, as a unified data acquisition platform, is responsible for docking various data sources. DBus extracts the data in increments or in full quantities, performs some conventional data processing, and finally publishes the processed message on Kafka.
  • Kafka, a distributed messaging system, connects producers and consumers of messages with distributed, highly available, high throughput, and publish-subscribe capabilities.
  • The streaming platform Wormhole, as a unified streaming platform, is responsible for streaming and docking various data target storage. Wormhole consumes messages from Kafka, supports configuring SQL on the stream to realize data processing logic on the stream, and supports configuring to drop data into different data target storage (Sink) with final consistency (idempotent) effect.
  • In the data calculation storage layer, RTDP architecture selects open technology component selection. Users can select appropriate storage according to actual data characteristics, calculation mode, access mode, data volume and other information to solve specific data project problems. RTDP also supports simultaneous selection of multiple different data store s, thus more flexibly supporting different project requirements.
  • The computing service platform Moonbox, as a unified computing service platform, is responsible for integration, computation push-down optimization, heterogeneous data storage miscalculation, etc. (data virtualization technology) on the heterogeneous data storage end, and for convergent unified metadata query, unified data computation and distribution, unified data query language (SQL), unified data service interface, etc. on the data display and interaction end.
  • The visual application platform Davinci, as a unified data visualization platform, supports various data visualization and interaction requirements in a configurable manner, and can integrate other data applications to provide partial demand solutions for data visualization. in addition, it also supports different data practitioners to collaborate on the platform to complete various daily data applications. Other data terminal consump tion systems such as data development platform Zeppelin and data algorithm platform Jupyter are not introduced in this paper.
  • Cross-cutting topics such as data management, data security, development, operation and maintenance, and driving engines can be integrated and secondary developed through the service interfaces of DBus, Wormhole, Moonbox, and Davinci to support end-to-end management and control and governance requirements.

Next, we will further refine the technical components and cross-section topics involved in the above figure, introduce the functional characteristics of the technical components, focus on the design idea of our self-developed technical components, and discuss the cross-section topics.

1.2 Introduction of Technical Components

1.2.1 Data Bus Platform DBus

Figure 3 DBus of RTDP Architecture

1.2.1.1 DBus设计思想

1) look at design ideas from an external perspective

  • Be responsible for docking different data sources and extracting incremental data in real time. For databases, operation log extraction method will be adopted. For log types, docking with multiple Agent is supported.
  • All messages are published on Kafka in a unified UMS message format. UMS is a standardized JSON format with metadata information. Logical messages are decoupled from physical Kafka Topic through unified UMS, so that the same Topic can flow multiple UMS message tables.
  • It supports full data retrieval of the database, and integrates it with incremental data into UMS messages. It is transparent and insensitive to downstream consumption.

2) Look at the design idea from the internal perspective

  • Storm-based computing engine formats data to ensure the lowest end-to-end delay of messages.
  • Standardize and format data from different data sources to generate UMS information, including:

Generates a unique monotonically increasing id for each message, corresponding to the system field ums_id_

Confirm the event timestamp of each message, corresponding to the system field ums_ts_

Confirm the operation mode of each message (add, delete, or insert only), corresponding to the system field ums_op_

  • The change of database table structure is sensed in real time and managed by version number to ensure that the upstream metadata change is clear during downstream consumption.
  • When launching Kafka, ensure that messages are strongly ordered (not absolutely ordered) and atlast once semantics.
  • The heartbeat table mechanism is used to ensure the end-to-end detection of messages.
1.2.1.2 DBus功能特性
  • Supports configurable full data pull
  • Supports configurable incremental data pull
  • Support configurable online log formatting
  • Support visual monitoring and early warning
  • Support configurable multi-tenant security control
  • Support the grouping of sub-table data into a single logical table
1.2.1.3 DBus技术架构

Figure 4 DBus Data Flow Architecture Diagram

For more technical details and user interface of DBus, please refer to:

GitHub:https://github.com/BriData

1.2.2 Distributed Message System Kafka

Kafka has become a de facto standard distributed message system for big data streaming. Of course, Kafka is continuously expanding and improving, and now it also has certain storage and streaming capabilities. There are a lot of articles and information about Kafka’s own functions and technologies that can be consulted. This article will not elaborate on Kafka’s own capabilities.

Here we will specifically discuss the topics of Metadata Management and Schema Evolution on Kafka.

Figure 5

Source of picture:http://cloudurable.com/images …

Figure 5 shows that in Confluent’s solution behind Kafka, a metadata management component: Schema Registry has been introduced. This component is mainly responsible for managing metadata information and Topic information of messages flowing on Kafka and providing a series of metadata management services. The reason for introducing such a component is that Kafka’s consumers can understand what data is flowing on different Topic and metadata information of the data, and make effective analysis and consumption.

Any data flow link, no matter what system it flows on, will have metadata management problems for this data link, and Kafka is no exception. Schema Registry is a centralized Kafka data link metadata management solution, and based on Schema Registry, Confluent provides corresponding Kafka data security mechanism and schema evolution mechanism.

For more information on Schema Registry, please refer to:

Kafka Tutorial:Kafka, Avro Serialization and the Schema Registry

http://cloudurable.com/blog/k …

So in RTDP architecture, how to solve the problem of Kafka message metadata management and schema evolution?

1.2.2.1 元数据管理(Metadata Management)
  • DBus will automatically record real-time perceived changes in database metadata and provide services
  • DBus will automatically record online formatted log metadata information and provide services
  • DBus will publish a unified UMS message on Kafka. UMS itself has its own message metadata information. Therefore, it does not need to call a centralized metadata service when consuming downstream, and can directly get the metadata information of the data from UMS message.
1.2.2.2 模式演变(Schema Evolution)
  • UMS messages will carry the Namespace information of Schema. Namespace is a 7-layer positioning string that can uniquely position any life cycle of any table, equivalent to the IP address of the data table, in the following form:

[Datastore].[Datastore Instance].[Database].[Table].[TableVersion].[Database Partition].[Table Partition]

Example: Oracle.Oracle 01.db1.table1.v2.dbpar01.tablepar01

Where [Table Version] represents the version number of a Schema of this table. If the data source is a database, then this version number is automatically maintained by DBus.

  • In RTDP architecture, the downstream of Kafka is consumed by WormOLE. When WormOLE consumes UMS, [TableVersion] will be treated as * which means that when the Schema upstream of a table changes, Version will be automatically incremented. However, WormOLE will ignore this version change and will consume the incremental/full data of all versions of this table. How can WormOLE achieve compatibility mode evolution support? Processing SQL and output fields on the stream can be configured in WormRole. When the upstream Schema change is a “compatibility change” (referring to adding fields or modifying the type of expanded fields, etc.), it will not affect the correct execution of Wormhole SQL. When incompatible changes occur upstream, Wormhole will report an error, and manual intervention is required to repair the logic of the new Schema.

As can be seen from the above, Schema Registry and DBus+UMS are two different design ideas to solve metadata management and schema evolution. Each has its advantages and disadvantages. Please refer to Table 1 for a simple comparison.

Table 1 Schema Registry vs. DBus+UMS

Here is an example of UMS:

Figure 6 UMS Message Example

1.2.3 Streaming Platform Wormhole

Figure 7 Wormhole for RTDP Architecture

1.2.3.1 Wormhole设计思想

1) look at design ideas from an external perspective

  • Consumes UMS messages and custom JSON messages from Kafka
  • Responsible for docking different data target storage (Sink) and realizing the final consistency of Sink through idempotent logic
  • Support SQL configuration to implement on-stream processing logic
  • Provides Flow abstraction. Flow is defined by a Source Namespace and a Sink Namespace and is unique. Flow can define processing logic, which is a logical abstraction of Flow processing. Through decoupling with physical Spark Streaming and Flink Streaming, the same Stream can process multiple Flow processing flows, and flow can be switched on different Streams at will.
  • Support Kappa architecture based on backfill; Lambda architecture based on Wormhole Job is supported.

2) Look at the design idea from the internal perspective

  • Processing on data stream based on Spark Streaming and Flink calculation engines. Spark Streaming can support high throughput, batch Lookup, batch Sink writing, etc. Flink can support scenarios such as low latency and CEP rules.
  • Through ums_id_, ums_op_, the idempotent warehousing logic of different Sink is realized
  • Lookup Logic Optimization by Computational Push-Down
  • Abstract several unifications to support functional flexibility and design consistency

Unified DAG Higher Order Fractal Abstraction

Unified Universal Streaming Message UMS Protocol Abstraction

Unified Data Logical Table Namespace Abstraction

  • Abstract Several Interfaces to Support Scalability

Sinkprocessor: expand more sinkprocessor support

Swifts interface: custom on-stream processing logic support

UDF: More Upstream Processing UDF Support

  • The dynamic indexes and statistics of streaming jobs are collected in real time through Feedback messages
1.2.3.2 Wormhole功能特性
  • Support visualization, configuration, SQL development and implementation of streaming projects
  • Support the management, operation and maintenance, diagnosis and monitoring of mandatory dynamic streaming processing.
  • Support unified structured UMS messages and custom semi-structured JSON messages
  • Support processing of add-delete-modify tri-state event message flow
  • Support a single physical stream to process multiple logical traffic streams in parallel at the same time
  • Support Lookup Anywhere, Pushdown Anywhere on stream
  • Support event timestamp streaming based on business policy
  • Support UDF registration management and dynamic loading
  • Support concurrent idempotent warehousing of multi-target data systems
  • Support multi-level data quality management based on incremental messages
  • Supports streaming and batch processing based on incremental messages
  • Support Lambda architecture and Kappa architecture
  • It supports seamless integration with the three-party system and can be used as a flow control engine for the three-party system.
  • Support private cloud deployment, security rights control and multi-tenant resource management
1.2.3.3 Wormhole技术架构

Fig. 8 Wormhole data flow architecture diagram

For more details of Wormhole technology and user interface, please refer to:

GitHub:https://github.com/edp963/wor …

1.2.4 Selection of Common Data Calculation and Storage

RTDP architecture adopts an open and integrated attitude towards the selection of data computing and storage. Different data systems have their own advantages and suitable scenarios, but none of them can be suitable for various storage and computing scenarios. Therefore, when a suitable, mature and mainstream data system appears, Wormhole and Moonbox will expand and integrate support as needed.

Here are some general types of selection:

  • Relational Database (Oracle/MySQL, etc.): Suitable for complex relational computation with small data volume
  • Distributed column storage system

Kudu: scan optimization, suitable for OLAP analysis and calculation scenarios

Hbase: random reading and writing, suitable for providing data service scenarios.

Cassandra: High-performance writing, suitable for high-frequency writing scenarios of mass data

Clickhouse: high-performance computing, suitable for scenarios with only insert writes (update and delete operations will be supported later)

  • distributed file system

HDFS/Parquet/Hive: Append Only, suitable for mass data batch computing scenarios

  • Distributed document system

MongoDB: Balance ability, suitable for medium and complex calculations with large data volume

  • Distributed index system

Elastic search: index capability, suitable for fuzzy query and OLAP analysis scenarios

  • Distributed precomputing system

Druid/Kylin: Predictability, Suitable for High Performance OLAP Analysis Scenarios

1.2.5 computing service platform Moonbox

Figure 9 Moonbox of rtdp architecture

1.2.5.1 Moonbox设计思想

1) look at design ideas from an external perspective

  • Be responsible for docking different data systems, and support unified method for ad hoc miscalculation across heterogeneous data systems
  • Three Client calling methods are provided: RESTful service, JDBC connection and ODBC connection
  • Unified metadata closing; Unified query language SQL convergent; Unified Authority Control Closing
  • Two writing modes of query results are provided: Merge and Replace
  • Two interactive modes are provided: Batch mode and Adhoc mode
  • The realization of data virtualization and multi-tenant can be regarded as a virtual database.

2) Look at the design idea from the internal perspective

  • The SQL is analyzed, and the logic execution subtree of the pushdown data system is finally generated for pushdown calculation through the conventional Catalyst processing and analysis process, and then the result is pulled back for miscalculation and return
  • Two tiers of Namespace:database.table are supported to provide a virtual database experience.
  • A distributed service module Moonbox Grid provides high availability and concurrency.
  • Provide fast execution channel for all push-down logic (no miscalculation)
1.2.5.2 Moonbox功能特性
  • Support seamless miscalculation across heterogeneous systems
  • Support unified SQL syntax query calculation and writing
  • Three calling methods are supported: RESTful service, JDBC connection and ODBC connection
  • Two interactive modes are supported: Batch mode and Adhoc mode
  • Support for Cli Command tools and Zeppelin
  • Support multi-tenant user rights system
  • Support table-level permissions, column-level permissions, read permissions, write permissions, UDF permissions
  • Support YARN Scheduler Resource Management
  • Support metadata service
  • Support for timed tasks
  • Support security policies
1.2.5.3 Moonbox技术架构

Figure 10 Moonbox logic module

For more technical details and user interface of Moonbox, please refer to:

GitHub:https://github.com/edp963/moo …

1.2.6 visual application platform Davinci

Figure 11 Davinci of rtdp architecture

1.2.6.1 Davinci设计思想

1) look at design ideas from an external perspective

  • Responsible for various data visualization display functions
  • Support JDBC data sources
  • Provide a system of equal rights users, each user can establish their own Org, Team and Project.
  • Support SQL writing data processing logic, support drag-and-drop editing visual display, and provide a multi-user social division of labor and cooperation environment
  • Provide a variety of different chart interaction capabilities and customization capabilities to meet different data visualization requirements
  • Provides the ability to embed and integrate into other data applications

2) Look at the design idea from the internal perspective

  • Expand around View and Widget. View is a logical view of data; Widget is a visual view of data
  • Through user-defined selection of classified data, ordered data and quantitative data, views are automatically displayed according to reasonable visualization logic
1.2.6.2 Davinci功能特性

1) Data source

  • Support JDBC data sources
  • Support CSV file upload

2) Data View

  • Support to define SQL templates
  • Support SQL highlighting
  • Support SQL testing
  • Write-back operation is supported

3) visual components

  • Supports predefined charts
  • Support controller components
  • Supports free styles

4) Interactive ability

  • Support full screen display of visual components
  • Local control supporting visual component
  • Support filtering linkage between visual components
  • Visual component support group control controller
  • Local Advanced Filters Supporting Visual Components
  • Supports large data volume display paging and sliders

5) integration capability

  • Support CSV download of visual components
  • Support public sharing of visual components
  • Support visual component authorization sharing
  • Support public sharing of dashboards
  • Support dashboard authorization sharing

6) security authority

  • Support data row and column permissions
  • Support LDAP login integration

For more technical details and user interface of Davinci, please refer to:

GitHub:https://github.com/edp963/dav …

1.3 Cross-section Topic Discussion

1.3.1 Data Management

1) metadata management

  • DBus can get metadata of data source and provide service query in real time
  • Moonbox can get the metadata of the data system in real time and provide service query.
  • For RTDP architecture, metadata information of real-time data sources and ad hoc data sources can be collected by calling RESTful services of DBus and Moonbox, and an enterprise-level metadata management system can be built based on this

2) Data Quality

  • Wormhole can configure messages to fall into HDFS(hdfslog) in real time. Wormhole Job based on hdfslog supports Lambda architecture; Backfill based on hdfslog supports Kappa architecture. You can select Lambda architecture or Kappa architecture to refresh Sink periodically by settin g timing tasks to ensure the final consistency of data. WormOLE also supports the real-time Feedback of message information on stream handling excepti ons or Sink writing exceptions to WormOLE system, and provides RESTful service for three-party applications to call and handle.
  • Moonbox can perform ad hoc miscalculation on heterogeneous systems, which gives Moonbox “Swiss Army Knife” convenience. You can write timing SQL script logic through Moonbox, compare the data of heterogeneous systems concerned, or make statistics on the data table fields concerned, etc. you can develop a data quality detection system for the second time based on Moonbox’s ability.

3) blood relationship analysis

  • Wormhole’s on-stream processing logic is usually satisfied by SQL, which can be collected through RESTful services.
  • Moonbox is in charge of the unified entry of data query, and all logic is SQL, which can be collected through Moonbox log.
  • For RTDP architecture, SQL of real-time processing logic and ad hoc processing logic can be used to build an enterprise-level blood relationship analysis system based on thi s by calling the RESTful service of Wormhole and the log collection of Moonbox.

1.3.2 Data Security

Figure 12 RTDP Data Security

The above figure shows that in RTDP architecture, four open source platforms cover end-to-end data flow links, and each node has considerations and support for various aspects of data security, ensuring end-to-end data security of real-time data pipelines.

In addition, because Moonbox has become a unified portal for application-level data access, the operation audit log based on Moonbox can obtain a lot of information at the security level, and a data security early warning mechanism can be established around the operation audit log, thus building an enterprise-level data security system.

1.3.3 Development and Operation

1) Operation and maintenance management

  • The operation and maintenance management of real-time data processing has always been a pain point. DBus and Wormhole provide visual operation and main tenance management capability through visual UI, making labor and transportation easier.
  • DBus and Wormhole provide RESTful services such as health check, operation management, Backfill, Flow drift, etc. based on which an automated operation and maintenance system can be developed.

2) Monitoring and early warning

  • Both DBus and Wormhole provide visual monitoring interfaces, which can see the throughput and delay information at the logical table level in real time.
  • DBus and Wormhole provide RESTful services such as heartbeat, Stats, status, etc. Based on which an automatic early warning system can be developed.

Second, the mode scene discussion

In the previous chapter, we introduced the design framework and functional characteristics of various technical components of RTDP architecture, so far readers have had a specific understanding and understanding of how RTDP architecture is implemented. What common data application scenarios can RTDP architecture solve? Next, we will discuss several usage modes and what kind of demand scenarios the different modes are suitable for.

2.1 Synchronization Mode

2.1.1 Mode Description

Synchronization mode refers to a usage mode that only configures real-time synchronization of data between heterogeneous data systems without any processing logic on the stream.

Specifically, DBus is configured to extract the data from the data source in real time and put it on Kafka, and Wormhole is configured to write the data on Kafka into Sink storage in real time. Synchronization mode mainly provides two capabilities:

  • Subsequent data processing logic is no longer executed on the business backup library, thus reducing the use pressure on the business backup library
  • The invention provides the possibility of synchronizing different physical service backup data to the same physical data storage in real time

2.1.2 Technical Difficulties

The specific implementation is relatively simple.

IT implementers do not need to know too many common problems of streaming processing, do not need to consider the design and implementation of on-stream processing logic implementation, only need to know the basic configuration of flow control parameters.

2.1.3 Operation and Maintenance Management

Operation and maintenance management is relatively simple.

Need manual operation and maintenance. However, since there is no processing logic on the stream, it is easy to control the flow rate without considering the power consumption of the processing logic on the stream, and a relatively stable synchronous pipeline configuration can be given. And it is also easy to do timing end-to-end data comparison to ensure the data quality, because the data of the source and the target are completely consistent.

2.1.4 Applicable Scenarios

  • Real-time and Synchronous Sharing of Cross-departmental Data
  • Decoupling of Transaction Database and Analysis Database
  • Support the construction of real-time ODS layer for several warehouses
  • User Self-service Real-time Simple Report Development
  • Wait

2.2 Flow Calculation Mode

2.2.1 Model Description

Flow calculation mode refers to the usage mode of configuring processing logic on the flow based on synchronization mode.

In RTDP architecture, the configuration and support of on-stream processing logic are mainly carried out on Wormhole platform. In addition to the ability of synchronous mode, flow calculation mode mainly provides two capabilities:

  • On-stream computation disperses the power consumption in batch computation set over the stream, and incrementally computes the continuous power consumption, thus greatly reducing the time delay of the result snapshot
  • On-stream computation provides a new Lookup for cross-heterogeneous system miscalculation.

2.2.2 Technical Difficulties

Implementation is relatively difficult.

Users need to know what the upstream processing can do, what it is suitable to do, and how to transform the full-quantity calculation logic into incremental calculation logic. The power consumption of on-stream processing logic itself and dependent external data systems should also be considered to adjust and configure more parameters.

2.2.3 Operation and Maintenance Management

Operation and maintenance management is relatively difficult.

Need manual operation and maintenance. However, it is more difficult than synchronous mode operation and maintenance management, which is mainly reflected in many aspects such as flow control parameter configuration considerations, inability to support end-to-end data comparison, selection of final consistency implementation strategy for result snapshots, and consideration of Lookup time alignment strategy on the stream.

2.2.4 Applicable Scenarios

  • Apply projects or reports to data requiring high low latency.
  • Need to call external services with low delay (e.g. call external rule engine on stream, use of online algorithm model, etc.)
  • Support the construction of wide tables of real-time fact tables+dimension tables for several warehouses
  • Real-time multi-table fusion, splitting, cleaning and standardization Mapping scenarios
  • Wait

2.3 Rotation Mode

2.3.1 Model Description

Rotation mode refers to the usage mode of transferring flow calculation to batch calculation and batch calculation to flow calculation based on flow calculation mode and after running short-term timed tasks on the database for further calculation, the results are put on Kafka again for calculation on the next rotation.

In RTDP architecture, the integration method of kafka-> wormhole-> sink-> moonbox-> kafka can be used to realize rotation calculation of any round and any frequency. On top of the capability of flow calculation mode, the main capability provided by rotation mode is to theoretically support any complex flow calculation logic with low delay.

2.3.2 Technical Difficulties

The concrete implementation is difficult.

The introduction of Moonbox’s ability to convert to WormOLE further increases the variables to be considered than the flow calculation mode, such as the selection of multiple Sink, the frequency setting of Moonbox calculation, how to split the calculation division between WormOLE and Moonbox, etc.

2.3.3 Operation and Maintenance Management

Operation and maintenance management is difficult.

Need manual operation and maintenance. Compared with the flow calculation mode, it requires more consideration of data system factors, configuration and optimization of more parameters, and more difficult data quality management and diagnostic monitoring.

2.3.4 Applicable Scenarios

  • Low latency multi-step complex data processing logic scenario
  • Construction of Company-level Real-time Data Flow Processing Network

2.4 Smart Mode

2.4.1 Model Description

Intelligent mode refers to a usage mode that uses rules or algorithm models to optimize and increase efficiency.

Points that can be intelligentized:

  • Intelligent Drift of Wormhole Flow (Intelligent Automatic Operation and Maintenance)
  • Intelligent optimization of Moonbox precomputation (intelligent automatic optimization)
  • Full-volume computing logic is intelligently converted into streaming computing logic, and then deployed in Wormhole+Moonbox (intelligent automatic development and deployment)
  • Wait

2.4.2 Technical Difficulties

Concrete implementation is the simplest in theory, but effective technology is the most difficult to implement.

Users only need to complete offline logic development, leaving the rest to intelligent tools to complete development, deployment, optimization, operation and maintenance.

2.4.3 Operation and Maintenance Management

Zero operation and maintenance.

2.4.4 Applicable Scenarios

The whole scene.

Since then, our discussion on the topic of “how to design a real-time data platform” has come to an end temporarily. From the conceptual background, we discussed the architecture design, then introduced the technical components, and finally discussed the pattern scenario. Since every topic involved here is very big, this article only makes a brief introduction and discussion. Later, we will discuss a specific topic in detail from time to time to present our practice and experience, so as to attract valuable ideas and brainstorm ideas. If you are interested in the four open source platforms in RTDP architecture, please find us on GitHub to learn about usage and exchange suggestions.

Author: Lu Shanwei

Source: Yixin Institute of Technology