Main points of Using Flume

  flume

What is flume

  • A reliable and scalable big data handling system acts as a buffer between data producers and the final destination of data, balancing data producers and consumers and providing a stable flow state.

  • The main destinations are hdsf and hbase.

  • Apache kafka and facebook scribe are similar.

Why flume

  • Storing data to hdfs or hbase is not as simple as simply calling api. here, we have to consider various complicated scenarios, such as the amount of concurrent writes, the system pressure of hdfs and hbase, network delay, etc.

  • Flume is designed as a flexible distributed system, providing customizable pipelines, ensuring no data loss and providing a persistent channel.

Flume composition

Agent is its basic unit (Each agent includes source, channel and sink)

图片描述

Source, responsible for capturing data to agent

  • Source interceptor, modify or delete events

Avro Source
Exec Source
Spooling Directory Source
NetCat Source
Sequence Generator Source
Syslog Sources
Syslog TCP Source
Multiport Syslog TCP Source
Syslog UDP Source
HTTP Source

Channel, a buffer, is responsible for saving the data that source has received before successfully writing the data to sink.

  • Channel filter/selector (Apply filtering conditions to events to determine which channel attached to the source the events should be written to)

  • Built-in channel

Memory Channel
File Channel
JDBC Channel

  • Channel processor (Handling Events Write to channel)

    图片描述

Sink, responsible for removing data from channel to destination or next agent

  • Sink operator (Event handling distribution)

  • Sink group (Contains multiple sink)

  • Sink processor (Take data from channel and write it to the destination)
    图片描述

  • Built-in sink

HDFS Sink
Logger Sink
Avro Sink
IRC Sink
File Roll Sink
Null Sink
HBaseSinks
ElasticSearchSink

Events

Flume represents data as an event, which includes the body of a byte array and a header in the form of a map (Routing information)
图片描述

Interceptor

图片描述

  • Built-in interceptor

Timestamp Interceptor
Host Interceptor
Static Interceptor
UUID Interceptor
Morpline Interceptor
Regex Filtering Interceptor
Regex Extractor Interceptor

Applicable scenario

  • Data can be represented as multiple independent records

  • Real-time push of continuous and massive data streams (If there is a few g of data every few hours, hdfs will not be harmed and flume need not be deployed.)