[Yixin Open Source] Moonbox_v0.3_beta Major Release, Grid New Reconstruction, Faster and Decoupling

  Big data, Open source software

Introduction: The idea of data virtualization has always been a focus of the Agile Big Data team. Moonbox is designed on this basis and is dedicated to providing batch computing service solutions. Today, Moonbox pleasantly surprised to release the 0.3beta version of 0.3 (review v0.2 please stamp here:Introduction to #Moonbox# computing service platform), read the full text, understand Moonbox, and watch the magic of version 0.3 with Xiaobian.

I. Moonbox positioning

Before we get to know the new version of Moonbox, let’s recall the positioning of Moonbox.

Moonbox is a dvtaas platform solution. It is based on the design idea of data virtualization and is committed to providing batch computing service solutions. Moonbox is responsible for shielding the physical and usage details of the underlying data sources, bringing users a virtual database-like usage experience. Users can transparently mix and write across heterogeneous data systems by only using a unified SQL language. In addition, Moonbox also provides basic support such as data service, data management, data tools, data development, etc., which can support more agile and flexible data application architecture and logical data warehouse practice.

Second, Moonbox function

The idea of data virtualization is an important design principle of Moonbox. on this basis, Moonbox has realized many functions. Let’s take a look at what Moonbox does:


Moonbox has established a complete user system and introduced the concept of Organization to divide user space. The system administrator ROOT account can create multiple Organizations and specify the Organization’s administrator (SA) in the organization. SA can be one or more, SA is responsible for creating and managing ordinary users.

Moonbox abstracts the abilities of ordinary users into six attributes, namely, whether they can execute Account management statements, DDL statements, DCL statements, whether they have the ability to authorize other users to execute Account statements, whether they have the ability to authorize other users to execute DDL statements, and whether they have the ability to authorize other users to execute DCL statements. Through the free combination of attributes, a user system model can be built to meet various roles and requirements, and multi-tenancy can be realized.

Extended SQL

Moonbox unifies the query language into Spark SQL, uses Spark at the bottom for calculation, and extends a set of DDL and DCL statements. Including the creation, deletion and authorization of users, access authorization of data tables or data columns, mounting and dismounting of physical data sources or data tables, creation and deletion of logical databases, creation and deletion of UDF/UDAF, creation and deletion of scheduled tasks, etc.

Optimization strategy

Moonbox performs miscalculation based on Spark. Spark SQL supports multiple data sources, but Spark SQL only pushes down the project and filter operators when pulling data from the data sources, without considering the computational force characteristics of the data sources.

For example, Elasticsearch is very friendly to aggregation operations. If aggregation operations can be pushed down to Elasticsearch for calculation, it will be much faster than pulling all the data back to Spark for calculation.

Another example is that if limit operator is pushed down to the data source for calculation, it can greatly reduce the amount of data returned and save the time for data retrieval and calculation.

Moonbox further optimizes Spark Optimizer’s optimized LogicalPlan, splits subtrees that can be pushed down according to rules, maps subtrees into data source query language, and pulls the pushed down results back to Spark to participate in further calculations.

In addition, if LogicalPlan can push down the calculation as a whole, Moonbox will not use Spark for calculation, but directly use the data source client to run the query statement generated by LogicalPlan mapping, so as to reduce the overhead of starting distributed jobs and save distributed computing resources.

Column permission control

Moonbox defines DCL statements to implement column-level permission control of data. The Moonbox administrator authorizes the data table or data column to the user through DCL statement, and Moonbox saves the permission relationship between the user and the table and column into catalog. When users use SQL queries, they will be intercepted, and whether unauthorized tables or columns are referenced in the LogicalPlan after SQL is parsed will be analyzed, and if so, an error will be reported back to the user.

Various forms of UDF/UDAF

Moonbox supports the creation of UDF/UDAF in the form of jar package, as well as source code, including Java language and Scala language, which brings convenience to UDF development and verification.

Scheduled task

Moonbox provides the function of timing tasks. Users use DDL statements to define timing tasks, define scheduling policies in the form of crontab expressions, and embed quartz in the background to schedule tasks regularly.

Multiple clients

Moonbox supports access by command line tools, JDBC, Rest, ODBC, etc.

Support for multiple data sources

Moonbox supports a variety of data sources, including MySQL, Oracle, SQL Server, Clickhouse, ElasticSearch, MongoDB, Cassandra, HDFS, Hive, Kudu, etc., and supports custom extensions.

Two task modes

Moonbox supports Batch and Interactive task modes. Batch mode supports Spark Yarn Cluster Mode, while Interactive mode supports Spark Local and Spark Yarn Client Mode.

Cluster working mode

Moonbox works in master-slave cluster mode and supports master master-standby switching.

Iii. Moonbox_v0.3 VS v0.2

Moonbox_v0.3 has made several important changes based on v0.2, including:

Removing redis Dependency

V0.2 is to write the query result into Redis and then the client obtains the result from Redis; V0.3 returns the result directly to the client.

Change the data transmission mode

V0.2 client obtains result data in rest mode; V0.3 uses netty plus protobuf to obtain the result data.

The reconstruction of the main selection strategy of Moonbox Master.

Changing Moonbox Master from akka singleton to zk for master selection and information persistence.

Moonbox Worker decouples from Spark.

In v0.2, spark appdriver is run directly in Worker; V0.3 is changed to run Spark APP Driver in the new process, so that the Worker is decoupled from Spark, and one Worker node can run multiple spark appdrivers and other apps.

Moonbox typical scenario case

Finally, in order to let everyone know more about Moonbox, let’s introduce some typical Moonbox application scenarios.

Building real-time ETL based on DBus, Wormhole, Kudu, Moonbox

DBus writes the database changes to Kafka in real time, Wormhole consumes Kafka for streaming processing, and other tables of lookup on the stream form large-width tables, or performs partial processing logic to write to Kudu. Use Moonbox to query Kudu and save or display the results.

Batch operation

Batch jobs can be run using batch job scripts provided by Moonbox, asynchronous rest interface or scheduled tasks.

Visualization of ad hoc queries based on Davinci and Moonbox

By putting Moonbox’s JDBC driver into Davinci lib, Moonbox can be queried like other databases, and the results can be graphically displayed.

SAS query

SAS users can use ODBC to connect to Moonbox for data query, and can push the calculation directly to Moonbox for distributed calculation.

Convenient Data Operation Toolbox

Because Moonbox can interface with a variety of data sources, and can use Spark to perform mixed calculations among various data sources, Moonbox can be used for various convenient operations. For example, one SQL can be used to import the data of one table in one data source into another data source, compare the data of two tables, and so on.

More usage scenarios, you can experience mining!

With more and more attention paid to data virtualization, a reliable tool has become a common demand for everyone to explore the world of data virtualization. Moonbox is such a tool. What are you waiting for? Use it quickly ~

Project Open Source Address





Author: Wang Hao

Source: Agile Big Data

Yixin Institute of Technology