Author Introduction: feng han: Chief Engineer of Yixin Database Development and Management
Acme ug presidium member, CCIA (China computer industry association) executive director, Oracle ACE, DBAplus co-founder, ODF advisory panel member, ACMUG, acme ug, dbgeek writer, author of “SQL optimization best practices”. In his early years, he was engaged in software development, but later he switched to the database field due to his personal interest. With many years of experience in front-line database architecture, design and development, he has served as chief DBA and database architect of many companies.
First, what is modeling?
Data is almost always used for two purposes:Preservation of Operational RecordsAndMaking Analytical Decision. In short, the operating system stores data and the typing system uses data.
- The former generally only reflects the latest status of data and is processed according to the transaction of a single record. The core of its optimization is to process transactions faster.
- The latter usually reflects the state change of the data over a period of time and processes the data in large quantities. Its core is high performance and multi-dimensional data processing.
Generally, we refer to the operating system as OLTP (on-line transaction processing)-online transaction processing, and the analytical system as OLAP (on-line analytical processing)-online analytical processing.
According to these two different data uses, how to organize the data to better meet the needs of data usage. This involves the problem of data modeling. That is, a data organization method (model) is designed to meet different scenarios. In OLTP scenarios, entity relationship model (ER) is commonly used to store data, thus solving the problem of data redundancy and consistency in transaction processing. In OLAP scenarios, there are many modeling methods: ER model, star model and multi-dimensional model. The following are separately explained:
ER model in OLAP is different from OLTP. The essential difference is the abstraction of the subject from the perspective of the enterprise, rather than the abstraction of the entity-object relationship for a specific business process.
Star model is an implementation of dimension model on relational database. The model indicates that each business process contains a fact table, which stores the numerical measurement of events, and a plurality of dimension tables surrounding the fact table, which contain the actual text environment when the events occur. This star-like structure is usually called a “star connection.” It focuses on how users can complete requirements analysis more quickly and has better response performance to large-scale complex queries. On the basis of star model, snowflake model can be further derived in complex scenes.
Multidimensional model is another implementation of dimensional model. When the data is loaded into the OLAP multidimensional database, the storage index of the data adopts the format and technology involved in dimensional data. Performance aggregation or pre-calculation summary tables are usually established and managed by the multidimensional database engine. Due to the use of precomputation, indexing strategy and other optimization methods, multidimensional database can achieve high performance queries.
Among the three methods, star model is used more frequently, and this method is also explained in detail below.
Second, dimensional modeling
1. Basic concepts
Many concepts are involved in the modeling process. The following is a scene to illustrate them. For example, in the common e-commerce ordering process, each user submitting an order (only one item) corresponds to an order record.
[Business Process]: Place an Order
“Granularity”: each order (split into individual items)
“Dimension”: region, age, channel, etc. (from an analysis perspective)
[Facts/Measures]: Order Amount, etc. (Data Available for Analysis)
2. Modeling Steps
Collection of Business Requirements and Data Implementation
Before starting the dimension modeling work, we need to understand the business requirements and the actual situation as the underlying source data. Find requirements through communication with business parties, review existing reports, etc. to understand their objectives based on key performance indicators, competitive business issues, decision-making process, and support analysis of requirements. At the same time, the actual situation of the data can be communicated with experts in the database system to understand the feasibility of accessing the data.
Select a business process
Business process is an operational activity completed by an organization. Business process time establishes or acquires performance metrics and converts them into facts in the fact table. Most facts show concern for the results of a business process. The choice of process is very important, because the process defines specific design goals and the definition of granularity, dimension and fact.
Declaring granularity is an important step in dimension design. Granularity is used to determine what rows in a fact table represent. Granularity must be declared before selecting a dimension or fact, because each candidate dimension or fact must be consistent with the defined granularity. Atomic granularity is the lowest level of granularity when obtaining data from a given business process. It is strongly recommended to start the design by focusing on atomic-level granularity data, because atomic granularity data can withstand unexpected user queries.
Confirm Dimension (Describe Environment)
Dimensions provide the background of “who, what, where, when, why, how” involved in a business process event. The dimension table contains descriptive attributes needed by the analysis application for filtering and classifying facts. Firmly grasping the granularity of the fact table can distinguish all possible dimensions.
Confirm facts (for measurement)
In fact, the measures related to business process events are basically expressed in data values. There is a one-to-one relationship between a fact table row and metric events described according to the granularity of the fact table, so the fact table corresponds to a physically observable event. In the fact table, all facts are only allowed to be consistent with the granularity of the declaration.
Deployment Method-Star Model or Multidimensional Model
Select a landing method for the dimension model. The star model can be selected and deployed on the relational database through the fact table and the dimension table related through the primary and foreign keys. You can also choose a multi-dimensional model and land it in a multi-dimensional database.
3. Modeling Specification
Based on dimensional modeling, a series of terms are defined to describe modeling objects. The following figure is taken from Alibaba’s Road to Big Data Practice.
Refers to the collection of abstract business processes or dimensions oriented to business analysis. When dividing the data domain, it can not only cover all the current business requirements, but also be included into the existing data domain and expand the new data domain without influence when the new business enters.
Refers to the business activities of an enterprise. The following documents, payments and refunds are all business processes. Please note that the business process is an inseparable behavior event. In common terms, the business process is an event in an enterprise’s activities.
The time range or point in time used to specify data statistics, such as the last 30 days, natural week, up to the current day, etc.
It is an abstract division of modifiers and is subordinate to a business domain.
Refers to the business scenario definition abstraction of indicators other than statistical dimensions. Modifiers belong to a modification type.
Atomic indicators and metrics have the same meaning. Metrics based on a certain business event behavior are inseparable indicators in the business definition and nouns with clear business meaning, such as payment amount.
A dimension is an environment of measurement and is used to reflect a class of attributes of a business. The collection of such attributes forms a dimension, which can also be called an entity object. The dimension belongs to a data field, such as geographical dimension (which includes contents of country, region, province and city level) and time dimension (which includes contents of year, quarter, month, week and day level).
Dimension attributes belong to a dimension, such as the country name, country ID, province name, etc. in the geographic dimension belong to dimension attributes.
Derivative index = one atomic index+multiple modifiers (optional)+time period. It can be understood as delineating the statistical scope of atomic index business.
Three, the design points
1. Dimension table design
Dimension is the foundation and soul of dimension modeling. In dimension modeling, the measurement is called “facts” and the environment is described as “dimensions”. Dimensions are various environments needed for analyzing facts. The columns that a dimension contains that represent the dimension are called dimension attributes. Dimension attribute is the basic source of query constraints, grouping and report label generation, and is the key to data usability. Dimensions generally function as query constraints, subtotals, and sorting. The design process of dimension is the process of determining dimension attributes. How to generate dimension attributes and the advantages and disadvantages of the generated dimension attributes determine the convenience of dimension usage and become the key to the ease of use of data warehouse. As Kimball said, the capability of data warehouse is directly proportional to the quality and depth of dimension attributes.
In the whole design process, the following principles should be followed:
- Dimension attributes are as rich as possible to lay the foundation for data usage.
- Give a detailed and meaningful text description.
- Precipitate common dimension attributes to pave the way for establishing consistent dimensions.
- Strictly distinguish facts and dimensions, and distinguish them by using scenes.
2. Design of Fact Sheet
As the core of data warehouse dimension modeling, fact table is designed closely around business process. Business process is expressed by obtaining metrics describing business process, including referenced dimensions and metrics related to business process. In the design process, different types of fact tables can be selected, and they have their own applicable scenarios.
In the whole design process, the following principles should be followed:
- Select a suitable fact table type.
- The facts are as complete as possible and include all the facts of the entire business process.
- Ensure that every fact measure is consistent, and repeated calculations will yield the same results. Try to record some “atomic” facts instead of the processed results.
- Some “dimension degeneration attributes” can be properly done to improve the query performance of fact tables.
- In order to improve the aggregation performance, it is appropriate to do some roll-up aggregation fact sheets.
Source: Public Number-feng han Channel, welcome to pay attention.