Explore streaming computing

  Big data

I static data and stream data

Static data: A data warehouse system built to support decision analysis, in which a large amount of historical data is stored as static data.

Stream data: data arriving continuously in the form of large, fast and time-varying streams. (e.g. log generated in real time, user real-time transaction information)

Stream data has the following characteristics:

(1) Data arrive rapidly and continuously, and the potential size may be endless.
(2) There are many data sources and complicated formats.
(3) Large amount of data, but not very concerned about storage. Once processed, it will either be discarded or archived (stored in data warehouse).
(4), pay attention to the overall value of the data, but not too much attention to individual data.
(5) The data sequence is reversed or incomplete, and the system cannot control the sequence of newly arrived data elements to be processed.

In the traditional data processing flow, data is always collected first and then put into DB. The data in DB is then processed.

Flow calculation: in order to realize the timeliness of data, real-time consumption of acquired data.

Second, batch calculation and flow calculation

Batch Computing: Plenty of Time to Process Static Data, such as Hadoop. The real-time requirement is not high.

Stream computing: real-time acquisition of massive data from different data sources, real-time analysis and processing to obtain valuable information (real-time, multi-data structure, massive).

Stream computing adheres to the basic idea that the value of data decreases with the passage of time, such as user click stream. Therefore, when an event occurs, it should be processed immediately instead of being cached for batch processing. Streaming data has complex data format, numerous sources and huge amount of data. It is not suitable for batch calculation. Real-time calculation must be adopted. The response time is of the order of seconds and the real-time requirement is high. Batch computing focuses on throughput while stream computing focuses on real-time.

Characteristics of flow calculation:

1. realtime and unbounded Data Flow. Flow computation is real-time and streaming. Flow data is subscribed and consumed by flow computation in chronological order. Moreover, due to the persistence of data generation, the data stream will be integrated into the stream computing system for a long time and continuously. For example, as long as the website does not shut down its click log stream, the click log stream will always be generated and flow into the computing system. Therefore, for streaming systems, data is real-time and not terminated (unbounded).

2. continuos and efficient computation. Flow calculation is an “event triggered” calculation mode, and the trigger source is the unbounded flow data mentioned above. Once new stream data enters the stream calculation, the stream calculation is initiated and a calculation task is performed immediately, so the whole stream calculation is continuously performed.

3. streaming and real-time data integration. Stream data triggers the calculation result of a stream calculation, which can be directly written into the destination data storage, for example, the calculated report data can be directly written into RDS for report display. Therefore, the calculation results of streaming data can be continuously written into the destination data storage like streaming data.

III. Flow Calculation Framework

In order to process stream data in time, a low delay, scalable and highly reliable processing engine is needed. For a flow computing system, it should meet the following requirements:

High performance: basic requirements for processing big data, such as processing hundreds of thousands of pieces of data per second.

Massive: Supports TB or even PB data scale.

Real-time: Ensure lower delay time, reaching second level, even millisecond level.

Distributed: The basic architecture supporting big data must be able to expand smoothly.

Ease of use: Ability to quickly develop and deploy.

Reliability: Can process stream data reliably.

At present, there are three common flow computing frameworks and platforms.

Business-level flow computing platform, open source flow computing framework, and flow computing framework developed by the company to support its own business.
(1) commercial level: InfoSphere Streams(IBM) and StreamBase(IBM).
(2) Open source stream computing framework, represented as follows: Storm(Twitter), S4(Yahoo).
(3) the flow calculation framework developed by the company to support its own business: Puma(Facebook), Dstream (Baidu) and galaxy flow data processing platform (Taobao).

Fourth, the flow calculation framework Storm

Storm is an open source distributed real-time big data processing framework for Twitter. With the increasing application of stream computing, Storm’s popularity and role are increasing day by day. Next, the core components of Storm and performance comparison are introduced.

Storm’s core components

Nimbus: Master of Storm, responsible for resource allocation and task scheduling. A Storm cluster has only one Nimbus.

Supervisor: Slave of Storm, responsible for receiving tasks assigned by Nimbus and managing all Workers. A Supervisor node contains multiple worker processes.

Worker: a work process, each of which has multiple Task.

Task: Task, each Spout and Bolt in the Storm cluster is executed by several tasks. Each task corresponds to an execution thread.

Topology: Computing topology. Storm’s topology encapsulates real-time computing application logic. Its function is very similar to MapReduce’s Job. The difference is that a Job in MapReduce will always end after the result is obtained, while the topology will always run in the cluster until you manually terminate it. Topology can also be understood as a topological structure composed of a series of Spout and Bolt interconnected by Stream Grouping.

Streams: Streams are the core abstract concept in Storm. A data stream refers to an unbounded sequence of tuple created and processed in parallel in a distributed environment. The data stream can be defined by a pattern that can express fields of tuples in the data stream.

Spout: Data source (Spout) is the source of data flow in topology. Spout typically reads tuples from an external data source and sends them to the topology. According to different requirements, Spout can be defined as either a reliable data source or an unreliable data source. A reliable Spout can resend the tuple it sends when the tuple processing fails to ensure that all tuples can be processed correctly. Correspondingly, the unreliable Spout will not perform any other processing on the tuple after the tuple is sent. One Spout can send multiple data streams.

Bolts: All data processing in the topology is completed by Bolts. Bolt can fulfill almost any kind of data processing requirements through data filtering, function processing, aggregations, joins, database interaction and other functions. One Bolt can realize simple data stream transformation, while more complex data stream transformation usually requires multiple BOLTs and is completed in multiple steps.

Stream grouping: determining the input data stream for each Bolt in the topology is an important link in defining a topology. Data stream grouping defines how data streams are divided among Bolt’ different tasks. There are eight built-in data stream grouping methods in Storm.

Reliability: reliability. Storm can use topology to ensure that every tuple sent can be processed correctly. By tracking the tuple tree formed by each tuple issued by Spout, it can be determined whether the tuple has completed processing. Each topology has a “message delay” parameter. If Storm does not detect whether the tuple is processed within the delay time, it will mark the tuple as processing failure and will resend the tuple later.

1.png

Figure 1: Storm Core Components

2.png

Figure 2: Storm Programming Model

Comparison of Mainstream Computing Engines

At present, the popular real-time processing engines are Storm, Spark Streaming, Flink. Each engine has its own characteristics and application scenarios. The following table is a simple comparison of the three engines.

3.png

Figure 3: Performance Comparison of Mainstream Engines

Summary:

The emergence of stream computing broadens our ability to cope with complex real-time computing requirements. Storm, as a sharp tool for flow computation, greatly facilitates our application. Streaming computing engines are still developing. JStorm, Blink and other computing engines developed based on STORM and Flink have greatly improved their performance in all aspects. Flow computation deserves our continued attention.

References:

【1】http://storm.apache.org/relea …

【2】https://en.wikipedia.org/wiki …

【3】https://toutiao.io/posts/88a6nt

【4】https://blog.csdn.net/fjse51/ …

【5】https://www.cnblogs.com/xuwuj …

【6】https://www.douban.com/note/6 …

【7】https://www.cnblogs.com/ostin …

【8】https://tech.meituan.com/real …

【9】http://www.cnblogs.com/jiyuka …

Source:Yixin Institute of Technology
Author: Yao Yuan