Source: The 2nd Technical Salon of Yixin Institute of Technology-Live Online | Construction Practice of Yixin Agile Data Center
Guest of honour: Lu Shanwei, head of the Taiwan platform team in Yixin Data
Introduction: Yixin launched a series of open source tools for big data in 2017, including familiar DBus, Wormhole, Moonbox, Davinci, etc., which have received extensive attention and favorable comments in the technology community. How are these tools applied in Yixin? How do they relate to Taiwan in Yixin data? How does it drive various daily data business scenarios?
This sharing answered these questions, and focused on the design, architecture and application scenarios of Yixin agile data center, and proposed a construction idea of agile data center for reference and discussion. The following is a true record of this sharing.
Second, the top-level design of Yixin Data Center
Third, from middleware tools to platforms
Four, typical case analysis
Video playback address:https://v.qq.com/x/page/r0874 …
PPT download address:https://pan.baidu.com/s/1jRum …
At present, the concept of “China Taiwan” is very popular, including data China Taiwan, AI China Taiwan, business China Taiwan, technology China Taiwan, etc. Dr. Jing Yuxin shared Yixin’s AI Zhongtai in the first technical salon of Yixin Institute of Technology. In this technical salon, I will share “Yixin Agile Data Zhongtai Construction Practice”.
Why do we need to add “agility” to our data? My friends who know us all know that my team is Yixin Agile Big Data Team. We advocate “Agile Civilisation” and integrate agile ideas into system construction. We have also developed four open source platforms: DBus, Wormhole, Moonbox and Davinci. Yixin’s data center was developed and built by our agile big data team based on four open source platforms, so we call Yixin’s data center “agile data center.”
This sharing is divided into three parts:
- Top-level design of Taiwan in Yixin agile data center. Datacenter is a company-level platform system, so it cannot be designed only from the technical level, but also top-level design including process, standardization, etc.
- From middleware tools to platforms, this paper introduces how Yixin designed and built the agile data middleware platform.
- Combined with typical cases, this paper introduces the application and practice of the data supported by Yixin Agile Data Center.
Second, the top-level design of Yixin Agile Data Center
2.1 Characteristics and Requirements
There is no standard solution for the construction of data center, and there is no data center that can be applied to all companies. Each company should combine its own business scale and current data demand to develop a data center that is suitable for its own company.
Before introducing the top-level design of Yixin Agile Data Center, let’s first understand its background:
- There are many business sectors and business lines. Yixin’s business can be broadly divided into four major sectors: Pratt & Whitney Finance, Wealth Management, Asset Management and Financial Technology, with nearly 100 business lines and product lines.
- There are many types of technology. Different business parties have different data requirements. According to these objective requirements and subject ive preferences, different data components will be selected during technology selection, including MySQL, Oracle, HBase, KUDU, Cassandra, Elasticsearch, MongoDB, Hive, Spark, Presto, Impala, Clickhouse, etc.
- Data needs are diverse. The diversity of business lines leads to diverse data requirements, including reports, visualization, services, push, migration, synchronization, data application, etc.
- Data demand is changeable. In order to adapt to the rapid changes of the Internet, the data requirements of business parties are also changeable, and there are often weekly output data requirements and data applications.
- Data management considerations. It requires that data meta-information can be checked, data definition and process standardization, data management can be controlled, etc.
- Data security considerations. As a company with both Internet and financial attributes, Yixin has high requirements on data security and authority. We have done a lot of work on data security, including: multilevel data security strategy, traceability of data links, non-disclosure of sensitive data, etc.
- Data permission considerations. Work on data permissions includes: table-level, column-level and row-level data permissions, organization structure, roles, and automation of permission policies.
- Data cost considerations. Including cluster cost, operation and maintenance cost, labor cost, time cost, risk cost, etc.
Regarding the positioning of Taiwan in the data, each company is different. Some companies are more focused on their business and have only one line of business. When building a data center, they may need a vertical platform to reach the front line and better support the front line.
As mentioned earlier, Yixin has many business lines and there is no main business in many businesses, which is equivalent to all business lines being the main business. Based on this background, we need a platform-based data center to support the needs and operations of all business lines.
Figure 1 Positioning
As shown in the above figure, the green part is Yixin Agile Data Center, which we call “ADX Data Center Platform”, “A” or “Agile” and “Platform” because we want to build it into a platform system serving all business lines to help business development.
The platform in agile data is located in the middle, with various data clusters at the bottom and data teams in various business fields at the top. The data center provides self-service, real-time, unified, service-oriented, management-oriented and traceable data services for data teams in the business field by integrating and processing data from data clusters.
The three blue panels on the right are the Data Management Committee, the Data Operation and Maintenance Team and the Data Security Team. As mentioned earlier, Yixin has very high requirements for data security, so it has set up a special data security team to plan the company’s data security process and strategy. The Data Management Committee is responsible for the standardization and flow of data, making up for the efficiency of the promotion of technology-driven data centers, and ensuring the effective deposition and presentation of data assets.
We are rightLocation of Stations in Yixin Agile DataYes:From data technology and computing capability reuse to data asset and data service reuse, agile data middleware will enable data to directly enable services with greater value bandwidth, fast, accurate and precise.
The value of Taiwan in Yixin agile data is concentrated in three aspects: fast, accurate and economical.
Figure 2 Value
|Existing problem||The “Fast” of Taiwan in Agile Data|
|Customization Requirements Cause Repeated Development||Platformization, Transparent Package Reuse Technology Components|
|The in-package implementation team needs to schedule||Self-service, simple configuration, month = > day|
|T+1 delay cannot satisfy real-time and fine operation.||Real-time, driving business growth, days = > points|
|Existing problem||“Accuracy” of Taiwan in Agile Data|
|Data storage is different, retrieval methods are different, and cleaning logic is different.||Unification, unification of data lake collection and export|
|The data island has not been integrated||Management, metadata, data map, blood relationship|
|Demand-driven implementation, unable to precipitate data assets||Capitalization, Model Management Makes Data Trustworthy, Standardized Model Processing Makes Data Assets Precipitate|
|Existing problem||Taiwan’s “Province” in Agile Data|
|Time Cost, Demand Scheduling and Repeated Development||Self-help, saving time means saving cost.|
|Labor cost, repeated development and lack of reuse||Platformization, High Reuse of Mature Technology Components|
|Hardware Cost, Waste Caused by Abuse of Cluster Resources||Refinement, cluster resources can be estimated, checked and quantified|
2.4 Module Architecture Dimension
Figure 3 Module Architecture Dimensions
As shown in the figure, the construction of Yixin agile data center is also based on the consensus of “small front desk, large and medium-sized stations”. The whole middle part belongs to the content contained in the agile data platform, the green part on the left looks at the whole platform based on the data dimension, and the blue part on the right looks at the platform based on the platform dimension.
- Data dimension. All kinds of internal data and external data are collected to the data source layer first, and then stored in a unified, real-time, standardized and safe manner to form a data lake layer. The data lake processes and systematizes these original data and converts them into data asset s. The data asset layer includes several warehouse systems, index systems, label systems, feature systems, master data, etc. Finally, the deposited reusable data assets are provided to the data application layer for BI, AI and data products.
- Platform dimension. Each blue box represents a technical module, and the entire Yixin Agile Data Center is composed of these technical modules. The DataHub data hub can help users to complete self-service data appl ication, release, desensitization, cleaning and service, etc. The DataWorks data workshop can carry out self-service inquiry, operation, visualization and other processing on data. There are also DataStar data model, DataTag data tag, DataMgt data management, ADXMgt medium station management, etc.
It is worth mentioning that these modules are not developed from 0, but based on our existing open source tools. First of all, developing based on mature middleware tools can save development time and cost. Secondly, open source tools become engines that can work together to support a larger one-stop platform.
2.5 Data Capability Dimension
Figure 4 Data Capability Dimension
The above architecture module can be divided into several layers according to the capability dimension again, and each layer contains several capabilities. As shown in the figure, you can clearly see what data capabilities are required for the construction of the data center, which functional modules these capabilities correspond to, and what problems can be solved respectively. The details will not be expanded here.
Third, from middleware tools to platforms
3.1 ABD overview
Figure 5 ABD overview
Middleware tools refer to the four open source platforms DBus, Wormhole, Moonbox and Davinci. They are abstracted from the concept of Agile BigData (ABD) to form the ABD platform stack. The platform in Agile Data is called ADX(Agile Data X Platform). In other words, we experienced the process from ABD to ADX.
At first, based on the abstraction and summary of the commonness of business requirements, we hatched several common middleware to solve various problems. When there are more complex requirements, we try to combine these common middleware. In practice, we find that certain specific combinations are often used. At the same time, from the perspective of users, they prefer to realize self-service and use them directly instead of choosing and combining them by themselves every time. Based on these two points, we have packaged these open source tools.
DBus (data bus platform) is a DBaaS(Data Bus as a Service) platform solution.
DBus is dedicated to providing real-time data acquisition and distribution solutions for big data project development and management operation and maintenance personnel. The platform uses a highly available streaming computing framework to provide real-time transmission of massive data and reliable multi-channel message subscription and distribution. Through simple and flexible configuration, data generated by various IT systems in business processes are collected and unified processed and converted into UMS format described by JSON to be provided to different downstream customers for subscription and consumption. DBus can serve as a data source for several warehouse platforms, big data analysis platforms, real-time reports, real-time marketing and other businesses.
Open source address:https://github.com/BriData
Figure 6 DBus Function and Location
As shown in the figure, DBus can non-invasively interface with data sources of various databases, extract incremental data in real time, perform unified cleaning and processing, and store in Kafka in UMS format.
The functions of DBus also include batch extraction, monitoring, distribution, multi-tenancy, and clear rules for configuration. The specific functional characteristics are shown in the figure.
The bottom right corner of the above figure shows a screenshot of DBus. Users can use a visual page on DBus to pull incremental data, configure logs and cleaning methods, and complete real-time data extraction.
Figure 7 DBus Architecture
As can be seen from the above architecture diagram, DBus includes several different processing modules and supports different functions. (GitHub has specific introduction and will not be expanded here. )
Wormhole (streaming platform) is a solution to the spaas (streaming processing as a service) platform.
Wormhole is dedicated to providing data streaming solutions for big data project development and management operations personnel. The platform focuses on simplifying and unifying the development and management process, providing a visual operation interface, a business development method based on configuration and SQL, shielding the details of underlying technology implementation, greatly reducing the development threshold, and making the development and management of big data streaming projects more lightweight, agile, controllable and reliable.
Open source address:https://github.com/edp963/wor …
Figure 8 Wormhole Function and Location
DBus stores real-time data in Kafka in UMS format. We need Wormhole to use these real-time streaming data.
Wormhole supports the configuration of streaming processing logic and can write processed data to different data stores. The above figure shows many features of Wormhole, and we are still developing more new features.
In the lower right corner of the above figure is a screenshot of WormOLE’s work. WormOLE, as a streaming platform, does not redevelop its own streaming engine. It mainly relies on Spark Streaming and Flink Streaming, two streaming computing engines. Users can select one of the streaming computing engines, such as Spark, configure the streaming logic, determine the way to Lookup the library, and express this logic by writing SQL. If CEP is involved, it is certainly based on Flink.
It can be seen from this that the threshold for using Wormhole is configuration plus SQL. This is also in line with our long-standing concept of supporting users to play with big data by themselves in an agile way.
Figure 9 Wormhole Architecture
The above figure shows Wormhole’s architecture diagram, which contains many functional modules. Introduce several of these functions:
- Wormhole supports heterogeneous Sink idempotent, which can help users solve the problem of data consistency.
- Anyone who has used Spark Streaming knows that launching a Spark Streaming may only do one thing. Wormhole abstracted a layer of “logical Flow” concep t in Spark Streaming’s physical computing pipeline, that is, what to do from where to where and in the middle, which is a “logical Flow”. After this decoupling and abstraction, Wormhole supports running multiple Flow of different business logic simultaneously in a physical Spark Streaming pipeline. So theoretically, for example, there are 1,000 different Source tables, and after 1,000 different streaming processes, 1, 000 different result tables can be finally obtained, which can be realized by only initiating a Spark Streaming in Wormhole and running 1, 000 logical Flow inside. Of course, doing so may cause each Flow delay to increase, because they are all crowded in the same pipeline, but the setting here is very flexible, I can let a certain Flow monopolize a VIP Stream, and if some Flow flows are small or delay has less influence on them, they can share a Stream. Flexibility is a great feature of Wormhole.
- Wormhole has its own set of instructions and feedback system. Users can dynamically change logic online without restarting or stopping the flow, and get the job and feedback results in real time.
Moonbox (computing service platform) is a dvtaas (datavirtualization asaservice) platform solution.
Moonbox is aimed at data warehouse engineers/data analysts/data scientists, etc. Based on the design idea of data virtualization, it is dedicated to providing batch computing service solutions. Moonbox is responsible for shielding the physical and usage details of the underlying data sources, bringing users a virtual database-like usage experience. Users can transparently mix and write across heterogeneous data systems by only using a unified SQL language. In addition, Moonbox also provides basic support such as data service, data management, data tools, data development, etc., which can support more agile and flexible data application architecture and logical data warehouse practice.
Open source address:https://github.com/edp963/moo …
Figure 10 function and location of moonbox
Data from DBus may fall into different data stores after being streamed by Wormhole. We need to miscalculate these data. Moonbox supports seamless miscalculation of multi-source heterogeneous systems. The above figure shows the functional features of Moonbox.
The usual ad hoc inquiry does not really “ad hoc” because it requires users to manually guide the data to Hive before making calculations. This is a preset job. Moonbox does not need to lead the data to one place in advance, thus achieving real ad hoc query. The data can be scattered into different storage. When users need it, they only need to write one SQL. Moonbox can automatically split the SQL to know which tables are where, then plan the execution plan of SQL and finally get the results.
Moonbox provides standard REST, API, JDBC, ODBC and so on, so it can also be regarded as a virtual database.
Figure 11 Moonbox architecture
The above figure shows the architecture diagram of Moonbox. It can be seen that Moonbox’s computing engine is also based on Spark engine and has not been researched by itself. Moonbox expands and optimizes Spark, increasing many enterprise database capabilities, such as users, tenants, permissions, class stored procedures, etc.
From the above figure, Moonbox’s entire server is a distributed architecture, so it is also highly available.
Davinci (visual application platform) is a dvaas (data visualization as a service) platform solution.
Davinci is dedicated to providing one-stop data visualization solutions for business personnel/data engineers/data analysts/data scientists. It can be used for independent deployment of public cloud/private cloud or integrated into a three-party system as a visual plug-in. Users can serve a variety of data visualization applications by simply configuring on the visualization UI and support visualization functions such as advanced interaction/industry analysis/pattern exploration/social intelligence.
Open source address:https://github.com/edp963/dav …
Figure 12 Davinci function and location
Davinci is a visualization tool, and its functional characteristics are shown in the figure.
Figure 13 Davinci architecture
From the design level, Davinci has its own internal logic of completeness and consistency. Including Source, View and Widget, it supports various data visualization applications.
Figure 14 Davinci rich client application
Davinci is a rich client application, so it mainly depends on its front-end experience, richness and ease of use. Davinci supports chart-driven and perspective-driven editing Widget. The above figure is an example of perspective-driven effect. It can be seen that the horizontal and vertical coordinates are perspective. They will cut the whole figure into different cells. Each cell can select a different figure.
3.2 ABD architecture
Figure 15 ABD architecture
In ABD era, we support various data application requirements through DIY combination of four open source tools. As shown in the above figure, the whole end-to-end process is strung together. This architecture diagram shows our concept of “opening up the whole link with receiving and placing”.
- Take it. For example, functions such as collection, architecture, flow, injection, and calculation service query need to converge into one platform.
- Let go. In the face of complex business environment, data sources are also various and cannot be unified. It is difficult to have a storage or data system that can meet all the requirements, making it unnecessary for everyone to select models. Therefore, this piece of practice is open, and everyone can freely choose open source tools and components to adapt and be compatible.
3.3 ADX overview
At a certain stage of development, we need a one-stop platform to package the basic components so that users can finish data-related work more simply on this platform, thus entering the construction stage of ADX data center.
Figure 16 ADX overview
The above figure is an overview of ADX, which is equivalent to a first-level function menu. Users can do the following things when logging into the platform:
- Project Kanban: You can see the Kanban of your project, including statistics of health and other aspects.
- Project management: can do project-related management, including asset management, authority management, approval management, etc.
- Data management: data management can be done, such as viewing metadata, viewing data kinship, etc.
- Data application: the project has been configured and the data are understood, so the actual work can be done. Based on security and authority considerations, not everyone can use the data stored in it, so data application must be made first. The blue module on the right is the five functional modules of the ADX data center that will be highlighted i n this sharing. Data application is more realized by DataHub data hub, which supports self-service application, publishing, standardization, cleaning, desensitization, etc.
- Ad hoc query, batch job and streaming job are implemented based on DataWorks data workshop.
- The data model is implemented based on DataStar, a model management platform.
- The application market includes data visualization (after data processing, the final presentation style can be configured as a graph or a dashboard, etc. Davinci)； may be used here); Common analysis methods such as label portrait and behavior analysis; Intelligent toolbox (to help data scientists do better data set analysis, mining and algorithm model work) and intelligent services, intelligent dialogue (such as intelligent chat robots), etc.
3.3.1 ADX-DataHub data hub
Figure 17 DataHub workflow
The dashed blue box above shows the process architecture of DataHub, while the orange box is our open source tool, where “tria” represents Triangle and is a job scheduling tool developed by another team of Yixin.
DataHub does not simply encapsulate links, but enables users to get better services at a higher level. For example, if a user needs a snapshot accurate to seconds at a certain historical moment, or wants to get a real-time incremental data for streaming processing, DataHub can provide it.
How does it do it? By turning open source tools into engines and then integrating them. For example, different data sources are extracted in real time through DBus, and dropped into HDFS Log data lake after Wormhole streaming processing. We store all real-time incremental data in this area, which means we can get all historical change data from it, and these data are synchronized in real time. Then through Moonbox, some logic is defined above, which can be calculated and provided immediately when the user proposes to want snapshots or incremental data at a certain historical moment. If you want to make real-time reports, you need to maintain real-time snapshots of the data in a storage. Here we choose Kudu.
Streaming processing has many advantages, but it also has short boards, such as high operation and maintenance costs, poor stability, etc. Considering these problems, we set Sqoop as Plan B in DataHub. If there is a problem with the real-time line at night, it can automatically switch to Plan B to support T+1 report for the next day through traditional Sqoop. When we find and resolve the problem, Plan B will switch to a suspended state.
Assuming that users have their own data sources, put them in Elasticsearch or Mongo, and also hope to publish them through DataHub and share them with others. We should not physically copy the Elasticsearch data or Mongo data to one place, because first of all, these data are NoSQL and the amount of data is relatively large. Secondly, users may want others to use the Elasticsearch data through fuzzy query, which may continue to put the data in the Elasticsearch better. At this time, what we did was to make a logical release through Moonbox, but the user did not perceive the process.
To sum up, it can be seen that DataHub organically integrates and encapsulates the common modes of several open source platforms internally, and provides consistent and convenient data acquisition and distribution services externally. The user can also be a variety of different roles:
- The data owner can approve the data here.
- Data engineers can apply for data and process the data here after application.
- APP users can view Davinci reports;
- Data analysts can directly use their own tools to pick up data from DataHub and then do data analysis.
- Data users may wish to make a data product themselves, and DataHub can provide interfaces for them.
Figure 18 DataHub architecture
As shown in the figure, open DataHub to see its architecture design. From the perspective of functional modules, DataHub implements different functions based on different open source components. Including batch collection, streaming collection, desensitization, standardization, etc., and can also output subscriptions based on different protocols.
The relationship between DataHub and several other components is also very close. The data it outputs is used by DataWorks. at the same time, it relies on the management of the central station and data management to meet its needs.
3.3.2 ADX-DataLake real-time data lake
In a broad sense, the “data lake” refers to putting all the data together, focusing on storage and collection first, and then providing different usage methods according to different data when using.
What we are referring to here is a narrow data lake, which only supports two types of data collection, structured data source and natural language text, and has a unified way of storage.
Figure 19 DataLake
In other words, our real-time data lake is restricted. All structured data sources and natural language texts of the company will be consolidated into UbiLog in real time, and ADX-DataHub will provide unified external access. The access and use of UbiLog can only be output through the capabilities provided by ADX, thus ensuring multi-tenancy, security and authority control.
3.3.3 ADX-DataWorks data workshop
The main data processing is done in DataWorks.
Figure 20 DataWorks workflow
Take a look at DataWorks’ workflow as shown in the figure. First, after DataHub data comes out, DataWorks must receive DataHub data. DataWorks supports real-time reports, and we use Kudu internally, so if we solidify this mode, users will not need to choose their own models, and they can write their own logic directly on it. For example, there is a real-time DM or batch DM. We think this is a very good data asset with reuse value. We hope that other businesses can reuse this data, so we can publish it through DataHub and other businesses can apply for use.
Therefore, the data center encapsulated by DataHub, DataWorks and other components can achieve the effects of data sharing and data operation. The central station contains database components such as Kudu, Kafka, Hive, MySQL, etc. However, users do not need to choose their own type. We have made the best choice and packaged it into a platform that can be used directly.
On the left side of the above figure is a data modeler, who manages and develops models in DataStar and is mainly responsible for creating logic and models in DataWorks. Needless to say, data engineers are the most common roles that use DataWorks. End users can use Davinci directly.
Figure 21 DataWorks architecture
As shown in the figure, open DataWorks to see its architecture, and DataWorks also supports various functions through different modules. There will be more articles and sharing about this part of the content in the future, which will not be introduced in detail here.
3.3.4 ADX-DataStar Data Model
Figure 22 DataStar Workflow
DataStar is related to data indicator model or data asset. Each company has its own internal data modeling process and tools. DataStar can be divided into two parts:
- Model design, management and creation. Management of model life cycle and precipitation of process flow.
- From DW (Data Warehouse) layer to DM (Data Mart) layer, configuration mode is supported, and corresponding SQL logic is automatically generated underneath, without requiring users to write by themselves.
DataStar is a star model composed of DW layer facts and dimension tables, which can be finally precipitated. However, we believe that from DW layer to DM layer or APP layer, there is no need to write SQL development, only by selecting dimensions and configuring indicators, it can be automatically configured visually.
In this way, the requirements for users have changed. A modeler or a business person is needed to do this and give him a basic data layer. He configures the desired indicators according to his own requirements. In the whole process, data implementers only need to pay attention to ODS layer to DW layer.
3.3.5 ADXMgt/DataMgt central station management/data management
Figure 23 ADXMgt/DataMgt
The China-Taiwan management module focuses on tenant management, project management, resource management, authority management, approval management, etc. The data management module mainly focuses on the topics of data management layer or data governance layer. These two modules provide support and generate rule constraints for the three main components in the middle from different dimensions.
3.4 ADX architecture
Figure 24 ADX Architecture
The association between several modules of the platform in ADX data is shown in the figure. At the bottom are five open source tools, and each module is an organic integration and encapsulation of the five open source tools. It can be seen from the figure that the correlation between the components is very close, in which the dashed black line represents the dependency relationship and the green line represents the data flow relationship.
Four, typical case analysis
As mentioned above, we have organically integrated and packaged open source tools to build a more modern, self-help and complete one-stop data platform. How does this platform play its role in providing services for business? This section will list five typical cases.
4.1 Case 1-Self-service Real-time Report
The data team of the business area group needs to urgently produce a batch of reports. They don’t want to schedule, they want to complete them by themselves, and some reports need T+0 timeliness.
- The business group’s data team has limited engineering capacity and can only use simple SQL. It was either transferred to BI for scheduling, or directly connected to the business preparation database through tools to make reports, or made through Excel.
- Data sources may come from heterogeneous databases, and there is no good platform to support self-service derivatives.
- The demand for timeliness of data is very high, and data processing logic needs to be done on stream.
Figure 25 Workflow of Self-service Real-time Report
Using ADX Data Center to Solve the Problem of Self-service Real-time Reports.
- Data engineers log into the platform, create new projects and apply for data resources.
- The data engineer searches and selects the tables through metadata, selects the DataWorks method to use, fills in other information, and applies for these tables that need to be used. For example, I need to use 100 tables, of which 70 are used in a T+1 mode and 30 are used in a real-time mode.
- By default, the central station will make standardized desensitization encryption policies. After receiving these applications, the central station administrator will approve them in turn according to the policies.
- After the approval is passed, China Taiwan will automatically prepare and output the applied data resources. Data engineers can use the obtained data resources for self-service que ry, development, configuration, SQL arrangement, batch or streaming processing, DV configuration, etc.
- Finally, submit the self-service report or dashboard to users.
- Each role interacts with the other through one-stop data center and unifies the process. All actions are recorded and can be queried.
- The full self-service capability of the platform greatly improves the digital driving process of the business. No waiting time is required. After short training, each person can complete a real-time report by himself within 3-5 days. The real-time report does not need to ask for help.
- Platform support personnel also need not participate too much and will no longer be a bottleneck.
This scenario requires many data capabilities, including: ad hoc query capability, batch processing capability, real-time processing capability, report board capability, data permission capability, data security capability, data management capability, tenant management capability, project management capability, job management capability, and resource management capability.
4.2 Case 2-Collaboration Model Indicators
Business lines need to build their own basic data marts to share with other businesses or front-line systems.
- How to effectively build and manage data models.
- How to support not only the construction of data models in one’s own domain, but also the sharing of data models.
- How to solidify the process of data sharing and publishing and realize the unified control of technical safety.
- How to operate data to ensure effective data asset precipitation and management.
Figure 26 Collaboration Model Indicator Workflow
Using ADX data center to solve the problem of collaboration model index.
- The data modeler logs into the platform, creates new projects, and applies for resources. Then look up the selected table, design a DW model of one or several dimension tables, and push it to DataWorks project.
- The data engineer selects the required Source table, completes ETL development from ODS to DW based on DataStar project, then submits the job, publishes it to DataHub and runs.
- The data modeler continuously visualizes the configuration, maintenance and management of DW/APP level indicator sets, including dimension aggregation, calculation, etc.
- This is a typical case of data asset management and data asset operation. Through unified and collaborative model index management, model maintenance, index configuration and quality traceability are ensured.
- DataStar also supports consistency dimension sharing, data dictionary standardization, business line sorting, etc. It can further flexibly support the construction and precipitation of the company’s unified data infrastructure layer.
The capabilities required in this case include: data service capability, ad hoc query capability, batch processing capability, data authority capability, data security capability, data management capability, data asset capability, tenant management capability, project management capability, job management capability and resource management capability.
4.3 Case 3-Agile Analysis Mining
The business area group data analysis team needs to conduct fast data analysis and mining on its own.
- The analysis team uses different tools, such as SAS, R, Python, SQL, etc.
- Analysis teams often need raw data for analysis (non-desensitization) and full historical data.
- The analysis team hopes to get the required data quickly (often without knowing what data is needed) and focus on the data analysis itself promptly and efficiently.
Figure 27 Agile Analysis Mining Workflow
Solving the problem of agile analysis mining with ADX data middleware.
- Data analysts log into the platform, create new projects and apply for resources. According to the requirements, look up the selection form, choose the customary tool usage method, fill in other information, and apply for use.
- All parties shall examine and approve in turn according to the policies.
- After the approval is passed, the data analyst obtains resources and uses tools to conduct self-service analysis.
- Moonbox itself is a data virtualization solution, which is very suitable for ad hoc data reading and calculation of various heterogeneous data sources, and can save a lot of data engineering work for data analysts.
- Datahub/DataLake provides a real-time synchronous full-increment data lake, and can also carry out security policies such as configurable desensitization encryption, providing safe, reliable and comprehensive data support for data analysis scenarios.
- Moonbox also provides mbpy(Moonbox Python) library specifically to support Python users to quickly and seamlessly view data, perform ad hoc calculations and perform common algorithm operations under security control.
Figure 28 Agile Analysis Mining Example
For example, a user opens Jupyter, import an mbpy library package, and logs into Moonbox as a user to view the table authorized to him by the administrator. He can use the obtained data and tables for analysis and calculation without paying attention to where the data come from, which is a seamless experience for users.
As shown above, there are two tables, one of which is more than 50 million pieces of data stored in Kudu. Another table is more than 6 million pieces of data stored in Oracle. The data is stored in heterogeneous systems and kudu itself does not support SQL. We use Moonbox to formulate logic and think that the data are all in a virtual database, and it only takes 1 minute and 40 seconds to calculate the results.
The capabilities required in this case include: analysis and drilling capability, data service capability, algorithm model capability, ad hoc query capability, multidimensional analysis capability, data authority capability, data security capability, data management capability, tenant management capability, project management capability and resource management capability.
4.4 Case 4-Scenario Multi-screen Linkage
In order to support all-round scenario and digital drive, sometimes it is necessary to link multiple screens in large and medium-sized Xiao Zhi. The large screen is the screening large screen, the middle screen is the computer screen, the small screen is the mobile phone screen, and the smart screen is the chat client screen.
- Due to different positioning, different display sizes and different operations, multiple screens can require different degrees of visualization and customization, bringing a certain amount of development.
- Multiple screens also need to be highly consistent at the data permission level.
- Among them, intelligent screens need NLP, chat robots, task robots and other intelligent capabilities, as well as the ability to dynamically generate charts.
- Through Davinci’s Display function, configuration can be well supported to meet the customization requirements of large and small screens.
- Through Davinci unified data permission system, consistent data permission conditions can be maintained among multiple screens.
- ConvoAI’s Chatbot/NLP capability can support intelligent microbi capability, i.e. smart screen.
Figure 29 Display edit page for davinci
The above image shows Davinci’s Display editing page. you can freely define the desired display style by selecting different components, adjusting transparency, placing any position, adjusting foreground background, color scaling, etc.
Figure 30 Davinci configuration large screen
The above picture is an example of Davinci configuring a large screen. (The picture comes from the practice of Davinci open source community netizens, and the data has been processed). It can be seen that Davinci can configure a large screen by itself without development.
Figure 31 Davinci configuration small screen
The above figure shows an example of Davinci configuration small screen. The picture comes from the Zun Wei Hui in Yixin. On-site staff check real-time data through mobile phones to understand the on-site situation.
Figure 32 Smart Screen
The above figure shows an example of a smart screen. Our company has a chat robot based on ConvoAI, which can interact with users through a chat window and return results, including charts, etc. according to user needs.
4.5 Case 5-Data Security and Management
Figure 33 Data Security Management Workflow
This case is relatively simple, a complete data center, not only has the application customer scenario, but also manages the customer scenario, manages the customer typical example data security team and the data committee.
- The data security team needs to manage security policies, scan sensitive fields, approve data resource applications, etc. Yixin Agile Data Center provides automatic scanning function and returns the scanning results to the security team for confirmation in time. Th e security team can also define several layers of different security policies, view audit logs, investigate data flow links, etc.
- The data committee needs to do data research, data map viewing, blood relationship analysis, and formulate standardized and procedural cleaning rules. They can also log into the data center to complete these tasks.
This sharing mainly introduces the top-level design and positioning, internal module architecture and functions, and typical application scenarios and cases of Yixin Agile Data Center. Based on the current situation of Yixin’s business requirements and the development background of the data platform, we have organically combined and packaged the five open source tools, combined with the concept of agile big data, and built a one-stop agile data center suitable for Yixin’s own business, which will be applied and implemented in business and management, hoping to bring inspiration and reference to everyone.
Q: Can an enterprise rely solely on the open source tools of the open source community to build a data center?
A: The data center should be built according to the actual situation and goals of enterprises. Some good open source tools are already mature and do not need to build wheels repeatedly. At the same time, some enterprises need customized development according to their own environment and needs. Therefore, in general data, Taiwan will have both open source tool selection and development of common components within the enterprise combined with its own situation.
Q: Which detours and pits should be avoided in the construction of Taiwan in the data?
A: Data-based platforms require more capacity-building directly enabling businesses than pure technology platforms, such as data asset precipitation, data service construction, data processing process process abstraction, enterprise data standardization and safe management, etc. These may not be driven from bottom to top by pure technology, but require consensus and support at the corporate and business levels, and are built iteratively by the actual needs of businesses. This top-down and bottom-up iterative method can effectively avoid unnecessary shortsightedness and over-design.
Q: How will the maturity and effectiveness of the data center be evaluated after its construction?
A: The value of Taiwan in the data is measured by the driven business objectives. In qualitative terms, it means whether the fast, accurate and economical results have been achieved. In quantitative terms, maturity can be evaluated through indicators such as platform component reuse, data asset reuse, and data service reuse.
Q: How is the metadata of the platform managed?
A: Metadata is an independent topic. From the classification of metadata categories, to how to collect and maintain various metadata, and to how to build various metadata applications based on metadata information, a complete sharing can be discussed separately. As for the metadata management of Yixin ADX, we also follow the above ideas. First, we sorted out the classification of panoramic metadata, and then it is very important to “build a metadata system driven by business pain points”. We will prioritize according to the most urgent needs of the company for metadata at present, and then we can collect basic technical metadata of various data sources through Moonbox at the technical level. Based on Moonbox’s SQL parsing ability, the execution blood relationship is generated. Finally, metadata application modules are iteratively developed one by one according to the actual pain points of the business, such as how the structural change of the upstream source data table will affect the downstream data application (blood relationship impact analysis), how the downstream data problem traces the upstream data flow link (data quality diagnosis analysis), etc.
Q: What is the methodology for data modelers to model? What is the difference between dimension modeling and number warehouse?
A: Our modeling methodology is also based on the well-known “Data Warehouse Toolbox” to guide the construction. According to the actual situation of Yixin, Kimball’s dimensional modeling has been designed in a simplified, standardized and universal way. At the same time, we also refer to Ali’s OneData system experience. We do not have much originality in this area. The more important goal of DataStar is how to attract and help data modelers easily and effectively. It can make the model construction unified, online and managed from the process. At the same time, it strives to reduce the burden on ETL developers, and transfers the personalized index work from DW to DM/APP layer to non-data developers through configuration and self-help. Therefore, DataStar as a whole still aims at management and efficiency improvement.
Q: is the Q：Triangle task scheduling system open source?
A: Triangle is developed and maintained by another team. They have an open source plan. We are still too sure when to open source.
Q: when will Q：Davinci be released?
A: This is an eternal problem. Thank you for your continuous attention and recognition of Davinci. We have plans to push Davinci to Apache Incubation, so I hope you can continue to support Davinci and make Davinci the best choice for open source visualization tools.
Q: Does the data service control all data reads and writes? The best situation is that all businesses can access data through data services, so data management, links and maps are easier to do. The problem is that if you know the connection information in many cases, the business party can be directly connected. How can you prevent the business party from using API for direct connection?
A: Yes, the goal of DataHub is to unify convergent data collection, data application, data release and data service, so that data security management, link management and standardized management can be realized more easily. How to prevent the business party from bypassing the direct-connected source library of DataHub is probably controlled in the management process. for DataHub itself, as DataHub encapsulates the real-time data lake, DataHub has all the capability features that are not available in the direct-connected service backup library, and with continuous improvement of the use experience and functions of DataHub, I believe the business party will be more willing to connect data from DataHub.
Q: does Q：DBus support Postgres data sources?
A: DBus currently supports MySQL, Oracle, DB2, logs and Mongo data sources. Because of the characteristics of Mongo’s logs, DBus can only receive incomplete incremental logs (only updated columns will be output), which puts forward high requirements for strong sequential consumption. Internally, there are not too many scenes of DBUS receiving Mongo. The community has put forward the demand of DBus docking PostgreSQL and SQLServer, which can be extended theoretically. However, at present, the team is all devoted to the construction of data center and docking of more data source types. If necessary, we can directly contact our team for discussion.
Q: The bottom layer of Q：Moonbox is the hybrid computation implemented by Spark SQL, which requires a lot of resources. How is it optimized?
A: Moonbox’s blending engine is based on Spark, and has done some optimization work on Spark. The biggest optimization is to support more computational Pushdown. Spark itself has data federation blending capability, but Spark only supports partial operator push down. For example, Projection and Predict, Moonbox has bypassed Spark. More operators such as Aggregation, Join, Union, etc. are supported to push down, and a strategic push-down execution plan will be carried out according to the calculation characteristics of the data source when analyzing SQL, so as to make the data source do more suitable calculation work as far as possible and reduce the calculation cost of miscalculation in Spark.
Moonbox also supports that if SQL itself has no miscalculation logic and the data source is suitable for the whole SQL calculation, Moonbox can bypass Spark and directly push the whole SQL to the data source as a whole. In addition, Moonbox supports Batch computing, distributed Interactive computing and Local Interactive computing modes, each with different optimization and strategies.
Q: How do offline computing and real-time computing work together? Offline computing can be used for tiered storage. How can real-time computing achieve tiered storage?
A: One way to calculate real-time tiering is through Kafka. Of course, if the timeliness requirement for real-time tiered data is not too high (e.g. minutes), you can also select some real-time NoSQL storage, such as Kudu. “How to Match Off-line Computing with Real-time Computing” With Moonbox, in fact, no matter where the data of batch computing and streaming computing are stored, they can be seamlessly mixed through Moonbox. It can be said that Moonbox simplifies and smoothes the complexity of many data flow architectures.
Q: What is the positioning of China Taiwan and is it another buzzword? In Yixin, what is the relationship between Taiwan and traditional backstage in data?
A: The positioning of the Taiwan in Yixin Data has already been discussed at the beginning of the speech. Simply put, it is to make unified management and transparency for the lower level, standardization and process for the middle level and self-service for the upper level. Buzzword is also divided into two parts. Some waves leave more lessons and some waves bring more progress. “The relationship between the data center and the traditional backstage”, I understand the traditional backstage here to mean the business backstage. A good business backstage can better cooperate with and support the data center, while a bad business backstage will leave more data-level challenges to the data center to face and solve.
Q: Heterogeneous data is stored in so many storage components, how to ensure the efficiency of personalized query?
A: this question should refer to the architecture of Moonbox, how to ensure the efficiency of ad hoc queries. Pure ad hoc query (the source data directly calculates the result), the query efficiency will not exceed the memory MPP query engine. For us, Moonbox is mainly used for unified batch computing portal, unified ad hoc query portal, unified data service, unified metadata collection, unified data permission, unified blood relationship generation, unified data toolbox, etc. If you are looking for millisecond/second query efficiency, you can either use precomputing engines such as Kylin, Druid, etc., or ES, Clickhouse, etc., but all of these have the premise that the basic data are ready. Therefore, our data medium-to-medium link supports physical writing of DW/DM data into ES and Clickhouse after ETL and unified DataHub publishing, which can ensure “personalized” query efficiency to a certain extent. From the point of view of Moonbox alone, minute/hour precomputation on heterogeneous storage and writing the results to Clickhouse can support minute/hour data delay and millisecond/second query delay.
Q: If there is new data entering the system, is the whole process from data collection to storage controlled by the developer, or is it controlled by the specialized data manager through the interface combination of various component Pattern?
A: If the new data source comes from the standby database of the business database and DBus has already connected with the standby database, a special DBMS administrator will configure and publish a new ODS on the DBMS management interface for downstream users to apply and use on the DataHub; If the new data source comes from the business’s own NoSQL library, the business personnel can initiate the data publishing process on the DataHub by themselves, and then the downstream users can see it on the metadata and apply for and use it on the DataHub.
The so-called “data acquisition to storage” is also divided into real-time acquisition, batch acquisition, logical acquisition, etc. These commonly used data source types, data docking methods, user usage methods, etc. are encapsulated and integrated by DataHub. Whether the data owner or the data user is right, it is a one-stop DataHub user interface. All data link Pattern, automation processes, and best technology selection and practice are transparently encapsulated in DataHub, which is also the value of instrumentalization to platform.
Source: Yixin Institute of Technology