quinta-feira, 23 de junho de 2016

How do Data Ingestion with Apache Flume sending events to Apache Kafka

What is data ingestion?

Date ingestion is the process of getting data from somewhere like logging, database for immediate use or for storage.

These data could be sent in real time or in Batches.
Ingestion by Real time data is sent one by one directly from the data source, as in batch mode data are taken in batches in pre-defined intervals.

Usually when ingestion date of talks is likely to be related to Big Data, a large volume of data and assigned to it a few concerns us may come in mind, such as volume, variety, velocity and Veracity.
For this work Data Ingestion, a consolidated tool and widely used is Apache Flume.

Apache Flume 

Apache flume is an open source tool, reliable, high-availability for aggregation, collection and able to move large amounts of data from many different sources into a centralized data storage.
Based on data flow streaming is robust, fault-tolerant with high reliability mechanisms.

A Flume event is defined as a data flow unit. A Flume agent is a process of the JVM that hosts the components through which events flow from an external source to the next destination.

 The idea in this post is to represent the flow data model below focusing on the necessary settings for this flow happen, abstracting some concepts that should be checked in the flume if necessary documentation. 

Tools required to test the sample:

Configuration files:

After downloading the Apache flume should be created a configuration file in the folder conf/ "flume-sample.conf"

This file can basically be divided for better understanding of 6 parts: 

1.Agent Name:
a1.sources = r1
a1.sinks = sample 
a1.channels = sample-channel

2.Source configuration:
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /log-sample/my_log_file.log  
a1.sources.r1.logStdErr = true

3.Sink type
a1.sinks.sample.type = logger

4.Buffers events in memory to channel
a1.channels.sample-channel.type = memory
a1.channels.sample-channel.capacity = 1000
a1.channels.sample-channel.transactionCapacity = 100

5. Bind the source and sink to the channel
a1.sources.r1.channels.selector.type = replicating
a1.sources.r1.channels = sample-channel

6.Related settings Kafka, topic, and host channel where it set the source 
a1.sinks.sample.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.sample.topic = sample_topic
a1.sinks.sample.brokerList =
a1.sinks.sample.requiredAcks = 1
a1.sinks.sample.batchSize = 20
a1.sinks.sample.channel = sample-channel

Final result of the file below:

To run Apache Flume with this configuration you must run the following command within the Flume folder:

sh bin/flume-ng agent --conf conf --conf-file conf/flume-sample.conf  -Dflume.root.logger=DEBUG,console --name a1 -Xmx512m -Xms256m

Where :
  • Indicate where the file that was set flume-sample.conf
  • "- - Name" is the agent name equals to a1
  • Dflume.root.logger is the form that will be logged on the console 

Before it is necessary to raise the Apache Kafka (concepts related to Kafka are not part of this post the focus here is only the data ingestion with flume)

After downloading the Kafta with the default settings, you can see the flow work.
Execute the following commands: 

1. START Zookepper -  sudo bin/zookeeper-server-start.sh config/zookeeper.properties&
2. START Kafka Server - sudo bin/kafka-server-start.sh config/server.properties&
3. CREATE TOPIC - bin/kafka-topics.sh --zookeeper --create --replication-factor 1 --partitions 1 --topic sample_topic

In this command should show the data being extracted from the Source (my_log_file.log) and are being sent to the sample_topic topic.

4. START COMSUMER - bin/kafka-console-consumer.sh --zookeeper --topic sample_topic --from-beginning

Thus the proposed data flow is complete and providing data to other consumers in real time.


sexta-feira, 17 de junho de 2016

Understanding Reactor Pattern - Thread-based and Event-driven

To handle web requests, there are two competitive web architectures thread-based one and event-driven one.

Thread-based Architecture

The most intuitive way to implement a multi-threaded server is to follow the process/thread-per-connection approach.

It is appropriate for sites that need to avoid threading for compatibility with non-thread-safe libraries. 

It is also the best Multi-Processing Modules for isolating each request, so that a problem with a single request will not affect any other.

Processes are too heavyweight with slower context switching and memory-consuming. Therefore, the thread-per-connection approach comes into being for better scalability, though programming with threads is error-prone and hard-to-debug.

In order to tune the number of threads for the best overall performance and avoid thread-creating/destroying overhead, it is a common practice to put a single dispatcher thread  in front of a bounded blocking queue and a thread pool. The dispatcher blocks on the socket for new connections and offers them to the bounded blocking queue. Connections exceeding the limitation of the queue will be dropped, but latencies for accepted connections become predictable. A pool of threads polls the queue for incoming requests, and then process and respond.

Unfortunately, there is always a one-to-one relationship between connections and threads. Long-living connections like Keep-Alive connections give rise to a large number of worker threads waiting in the idle state for whatever it is slow, e.g. file system access, network, etc. In addition, hundreds or even thousands of concurrent threads can waste a great deal of stack space in the memory.

 Event-driven Architecture

Event-driven approach can separate threads from connections, which only uses threads for events on specific callbacks/handlers.

An event-driven architecture consists of event creators and event consumers. The creator, which is the source of the event, only knows that the event has occurred. Consumers are entities that need to know the event has occurred. They may be involved in processing the event or they may simply be affected by the event.

The Reactor Pattern

The reactor pattern is one implementation technique of the event-driven architecture. In simple words, it uses a single threaded event loop blocking on resources emitting events and dispatches them to corresponding handlers/callbacks.

There is no need to block on I/O, as long as handlers/callbacks for events are registered to take care of them. Events are like incoming a new connection, ready for read, ready for write, etc.

Those handlers/callbacks may utilize a thread pool in multi-core environments.

This pattern decouples modular application-level code from reusable reactor implementation.

There are two important participants in the architecture of Reactor Pattern:

1. Reactor

A Reactor runs in a separate thread and its job is to react to IO events by dispatching the work to the appropriate handler. It’s like a telephone operator in a company who answers the calls from clients and transfers the communication line to the appropriate receiver.

2. Handlers

A Handler performs the actual work to be done with an I/O event similar to the actual officer in the company the client who called wants to speak to.

Reactor responds to I/O events by dispatching the appropriate handler. Handlers perform non-blocking actions.

The intent of the Reactor pattern is: 

The Reactor architectural pattern allows event-driven applications to demultiplex and dispatch service request that are delivered to an application from on or more clients.

One Reactor will keep looking for events and will inform the corresponding event handler to handle it once the event gets triggered.

 The Reactor Pattern is a design pattern for synchronous demultiplexing and order of events as they arrive.

It receives messages/requests/connections coming from multiple concurrent clients and processes these post sequentially using event handlers.

The purpose of the Reactor design pattern is to avoid the common problem of creating a thread for each message/request/connection.

In Summary: Servers has to handle more than 10,000 concurrent clients and Threads can not scale the connections using Tomcat /Glassfish/ Jboss /HttpClient.

Then receives events from a set of handles and distributes them sequentially to the corresponding event handlers.

So, the application using the reactor only needs to use a thread to handle events simultaneously arriving.

Basically the standard Reactor allows a lead application with simultaneous events, while maintaining the simplicity of single threading.

A demultiplexer is a circuit that has an input and more than one output.
It is a circuit used when you want to send a signal to one of several devices.
This description sounds similar to the description given to a decoder, a decoder, but is used to select between many devices while a demultiplexer is used to send a signal, among many devices.

A Reactor allows multiple tasks which block to be processed efficiently using a single thread.
Reactor manages a set of event handlers and executes a cycle.

When I called to perform a task that connects with a new or available handler becoming active.
When called to perform a task, it connects with the handler that is available and makes it as active.

The cycle of events:

1 - Finds all handlers that are active and unlocked or delegates this for a dispatcher implementation.

2 - Execute each of these handlers sequentially found until complete or reach a point where they block.
Handlers completed inactivate or assets for reuse, allowing the event cycle to continue.

3 - Repeats from Step One (1)

Why matter now a day? 

Because the Reactor pattern is used by Node.js, Vert.x, Reactive Extensions ,Netty, Ngnix and others. So if you like identify pattern to know how thinks works behind the scenes, is important pay attention in this pattern.


quinta-feira, 9 de junho de 2016

Docker Hub, save and share your docker images

The Docker Hub is a cloud-based registration service for the construction of container applications or services.

It provides a centralized resource for image container discovery, distribution and change management, user and team collaboration and workflow automation throughout the development pipeline.

Docker hub provides the following features:

     Image Repository -Find, manage and push and pull of community images, official and private image libraries.

     Automated Build - automatically create new images when you make changes to a code that is on GitHub or bit bucket.

     Webhooks - A feature of Automated builds, Webhooks let you trigger actions after a successful push’s too a repository.

     Organization - Create working groups to manage user access to repositories of images.

Public Repository:

Sending image to own repository:

Here I got an existed image.

I logged in my docker hub account.

I put the my tag to send to my account

Now I have two images, the original and my own version.

In the end I did push to repository.

If I check my docker hub account, is possible seeing the image that I sent.


The webhook is an HTTP callback triggered by a specific event.
You can use a webhook to notify people, services and other applications after a new image is sent to your repository.

To start adding webhooks, scroll to the desired repository in Hub and click "Webhooks" under the "Settings" box.

The webhook is called only after a successful push is made.

Calls webhook are HTTP POST requests with a JSON payload similar to the example shown below.


After create my webhook I will have this image below:


Just for test this webhook I used this site: http://requestb.in

This site will provide one URL to put on WebHook URL and this site provide another URL to check Request received like this:

Automated build

You can build your images automatically from a compilation context stored in a repository.

A building context is a Docker File and any files in the specific location.

For an automated build the building context is a repository where it sends a Docker File.

Use of automated builds requires that you have an account on Docker Hub and hosted repository provider  GitHub or BitBucket.

If you already have on your Github or BitBucket account, you must have chosen the type of public and private connection.


After each commit that happen on git repository mapped is possible see this table with status about each image build with code committed. 


And is possible combine Automated Build with WebHooks and the result for this could be a deploy in someplace.

These functionalities for Webhooks and Automated Build are limited in private mode, for each user is available 1 Private Repo & Parallel Build, and this pipeline not happen immediately for free account.

My Slides