What is flume
A reliable and scalable big data handling system acts as a buffer between data producers and the final destination of data, balancing data producers and consumers and providing a stable flow state.
The main destinations are hdsf and hbase.
Apache kafka and facebook scribe are similar.
Storing data to hdfs or hbase is not as simple as simply calling api. here, we have to consider various complicated scenarios, such as the amount of concurrent writes, the system pressure of hdfs and hbase, network delay, etc.
Flume is designed as a flexible distributed system, providing customizable pipelines, ensuring no data loss and providing a persistent channel.
Agent is its basic unit (
Each agent includes source, channel and sink)
Source, responsible for capturing data to agent
Source interceptor, modify or delete events
Spooling Directory Source
Sequence Generator Source
Syslog TCP Source
Multiport Syslog TCP Source
Syslog UDP Source
Channel, a buffer, is responsible for saving the data that source has received before successfully writing the data to sink.
Channel filter/selector (
Apply filtering conditions to events to determine which channel attached to the source the events should be written to)
Channel processor (
Handling Events Write to channel)
Sink, responsible for removing data from channel to destination or next agent
Sink operator (
Event handling distribution)
Sink group (
Contains multiple sink)
Sink processor (
Take data from channel and write it to the destination)
File Roll Sink
Flume represents data as an event, which includes the body of a byte array and a header in the form of a map (
Regex Filtering Interceptor
Regex Extractor Interceptor
Data can be represented as multiple independent records
Real-time push of continuous and massive data streams (
If there is a few g of data every few hours, hdfs will not be harmed and flume need not be deployed.)