Chapter 1 - Introduction to KAFKA
Objectives
Key objectives of this chapter
What is Microservices?
Messaging Architectures
What is Kafka?
Need for Kafka
Where is Kafka useful?
Architecture
Core concepts in Kafka
Overview of ZooKeeper
Cluster, Kafka Brokers, Producer, Consumer, Topic
1.1 Microservices
Small, autonomous services which work well together.
Being able to change individual components independently.
Independent processes
Communicate over APIs, rather than using databases directly
High degree of autonomy
Small, focused on doing one thing well
A form of SOA. Typical SOA-based applications used to be monolithic.
Microservices concept facilitates in adopting Agile Software Development.
1.2 Microservices vs Classic SOA
SOA Microservices
XML JSON
Complex to integrate Easy to integrate
Chapter 1 - Introduction to KAFKA
SOA Microservices
Heavy Lightweight
HTTP/SOAP HTTP/REST
1.3 Traditional Enterprise Application Architecture
Classical architecture
Typical 3 layers:
◊ client-side UI (Browser, HTML + JS)
◊ a database (RDBMS, NoSQL …)
◊ server-side application (Java, .NET, PHP, …)
Any changes to the system involve building and deploying a new version
of the application. Changes are expensive.
Scaling requires scaling of the entire application, rather than parts of it that
require greater resource.
Long release cycles.
2
Chapter 1 - Introduction to KAFKA
1.4 Sample Microservices Architecture
Applications naturally start as Monoliths, they scale and evolve to
Microservice architecture
Applications are decomposed to components – smaller independent
service applications.
Components are loosely coupled.
1.5 Microservices Architecture – Pros
Multiple developers and teams can deliver relatively independently of each
other
Can be written in different programming languages
Can be managed by different teams
Can use different data storage technologies
Centralized management is minimal
Independently deployable by fully automated deployment machinery
Works well with Continuous Delivery
Allows frequent releases while keeping the rest of the system available
and stable
3
Chapter 1 - Introduction to KAFKA
1.6 Messaging Architectures – What is Messaging?
Application-to-application communication
Supports asynchronous operations.
Message:
A message is a self-contained package of business data and network
routing headers.
1.7 Messaging Architectures – Steps to Messaging
Messaging connects multiple applications in an exchange of data.
Messaging uses an encapsulated asynchronous approach to exchange
data through a network.
A traditional messaging system has two models of abstraction:
◊ Queue – a message channel where a single message is received
exactly by one consumer in a point-to-point message-queue pattern. If
there are no consumers available, the message is retained until a
consumer processes the message.
◊ Topic - a message feed that implements the publish-subscribe pattern
and broadcasts messages to consumers that subscribe to that topic.
A single message is transmitted in five steps:
◊ Create
◊ Send
4
Chapter 1 - Introduction to KAFKA
◊ Deliver
◊ Receive
◊ Process
1.8 Messaging Architectures – Messaging Models
1. Point to Point
2. Publish and Subscribe
1.9 What is Kafka?
In modern applications, real-time information is continuously generated by
applications (publishers/producers) and routed to other applications
(subscribers/consumers)
Apache Kafka is an open source, distributed publish-subscribe messaging
system.
Kafka allows integration of information of producers and consumers to
avoid any kind of rewriting of an application at either end.
Kafka provides overcomes the challenges of real-time data usage for
consumption of data volumes that may grow in order of magnitude, larger
than the real data.
5
Chapter 1 - Introduction to KAFKA
Kafka also supports parallel data loading in the Hadoop systems.
1.10 What is Kafka? (Contd.)
Kafka is a unique distributed publish-subscribe messaging system written
in the Scala language with multi-language support and runs on the Java
Virtual Machine (JVM).
Kafka relies on another service named Zookeeper – a distributed
coordination system – to function.
Kafka has high-throughput and is built to scale-out in a distributed model
on multiple servers.
Kafka persists messages on disk and can be used for batched
consumption as well as real-time applications.
1.11 Kafka Overview
When used in the right way and for the right use case, Kafka has unique
attributes that make it a highly attractive option for data integration.
Data Integration is the combination of technical and business processes
used to combine data from disparate sources into meaningful and valuable
information.
A complete data integration solution encompasses discovery, cleansing,
monitoring, transforming and delivery of data from a variety of sources
Messaging is a key data integration strategy employed in many distributed
environments such as the cloud.
6
Chapter 1 - Introduction to KAFKA
Messaging supports asynchronous operations, enabling you to decouple a
process that consumes a service from the process that implements the
service.
1.12 Kafka Overview (Contd.)
1.13 Need for Kafka
High Throughput
◊ Provides support for hundreds of thousands of messages with modest
hardware
Scalability
◊ Highly scalable distributed systems with no downtime
Replication
7
Chapter 1 - Introduction to KAFKA
◊ Messages can be replicated across a cluster, which provides support
for multiple subscribers and also in case of failure balances the
consumers
Durability
◊ Provides support for persistence of messages to disk which can be
further used for batch consumption
Stream Processing
◊ Kafka can be used along with real-time streaming applications like
spark, flink, and storm
Data Loss
◊ Kafka with proper configurations can ensure zero data loss
1.14 Kafka Architecture
1.15 Core concepts in Kafka
Topic
8
Chapter 1 - Introduction to KAFKA
◊ A category or feed to which messages are published
Producer
◊ Publishes messages to Kafka Topic
Consumer
◊ Subscribes and consumes messages from Kafka Topic
Broker
◊ Handles hundreds of megabytes of reads and writes
1.16 Kafka Topic
User defined category where the messages are published
For each topic, a partition log is maintained
Each partition basically contains an ordered, immutable sequences of
messages where each message assigned a sequential ID number called
offset
Writes to a partition are generally sequential thereby reducing the number
of hard disk seeks
Reading messages from partition can either be from the beginning and
also can rewind or skip to any point in a partition by supplying an offset
value
9
Chapter 1 - Introduction to KAFKA
1.17 Kafka Producer
Application publishes messages to the topic in Kafka Cluster
Can be of any kind like Front End, Streaming etc.
While writing messages, it is also possible to attach a key to the message
By attaching key the producers basically provide a guarantee that all
messages with the same key will arrive in the same partition
Supports both async and sync modes
Publishes as many messages as fast as the broker in a cluster can handle
10
Chapter 1 - Introduction to KAFKA
1.18 Kafka Consumer
Application subscribes and consumes messages from brokers in Kafka
Cluster
Can be of any kind like real-time consumers, NoSQL consumers etc.
During consumption of messages from a topic a consumer group can be
configured with multiple consumers.
Each consumer of consumer group reads messages from a unique subset
of partitions in each topic they subscribe to
Messages with the same key arrive at the same consumer
Supports both Queuing and Publish-Subscribe
Consumers have to maintain the number of messages consumed
11
Chapter 1 - Introduction to KAFKA
1.19 Kafka Broker
Kafka cluster basically comprised of one or more servers
Each of the servers in the cluster is called a broker
Handles hundreds of megabytes of writes from producers and reads from
consumers
Retains all the published messages irrespective of whether it is consumed
or not
If retention is configured for n days, then messages once published, it is
available for consumption for configured n days and thereafter it is
discarded
12
Chapter 1 - Introduction to KAFKA
1.20 Kafka Cluster
A Kafka Cluster is generally fast, highly scalable messaging system
A publish-subscribe messaging system
Can be used effectively in place of ActiveMQ, RabbitMQ, Java Messaging
System (JMS), and Advanced Messaging Queuing Protocol (AMQP)
Can be integrated with Hadoop Ecosystem
Expanding of the cluster can be done with ease
Effective for applications which involve large-scale message processing
13
Chapter 1 - Introduction to KAFKA
1.21 Why Kafka Cluster?
Kafka is preferred in place of more traditional brokers like JMS and
AMQP?
◊ With Kafka, we can easily handle hundreds of thousands of messages
in a second, which makes Kafka a high throughput messaging system
◊ The cluster can be expanded with no downtime, making Kafka highly
scalable
◊ Messages are replicated, which provides reliability and durability
◊ Fault-tolerant
1.22 Sample Multi-Broker Cluster
14
Chapter 1 - Introduction to KAFKA
1.23 Overview of ZooKeeper
An open source Apache project
Provides a centralized infrastructure and services that enable
synchronization across a cluster
Common objects used across the large cluster environments are
maintained in Zookeeper
Objects such as configuration, hierarchical naming space etc. are
maintained in Zookeeper
Zookeeper services are used by large scale applications to coordinate
distributed processing across large clusters
1.24 Kafka Cluster & ZooKeeper
1.25 Kafka Integration
Databases: MongoDB/CosmosDB/CouchDB/Oracle
Big Data: Hadoop, Spark
15
Chapter 1 - Introduction to KAFKA
Logging: Logstash (ELK stack)
IoT
1.26 Who Uses Kafka?
16
Chapter 1 - Introduction to KAFKA
1.27 Courses
WA2708 – Kafka for Application Modernization
WA2684 – Developing Microservices
1.28 Summary
Kafka is a unique distributed publish-subscribe messaging system written
in the Scala language with multi-language support and runs on the Java
Virtual Machine (JVM).
Kafka relies on another service named Zookeeper – a distributed
coordination system – to function.
Kafka has high-throughput and is built to scale-out in a distributed model
on multiple servers.
Kafka persists messages on disk and can be used for batched
consumption as well as real-time applications.
17