Distributed Streaming Platform Development

5531 words (22 pages) Essay

8th Feb 2020 Computer Science Reference this

Disclaimer: This work has been submitted by a university student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UKEssays.com.

Table of Contents

Introduction

Messaging System

Point to Point Messaging System

Publish-Subscribe Messaging System

What is Kafka

Terminologies used in Kafka

Background of Kafka

Advantages & Disadvantages

Architecture

Core APIs in Apache Kafka

Workflow

Prerequisites

Deployment & Testing

Configuration of Kafka

Creation of Topics

Send & Consume Messages

Multi-Broker Cluster

Flume & kafka(Flafka)

Results in web Interface of HDFS

Conclusion

Use Cases of Kafka

Kafka Messaging

Website Activity Tracking

Kafka Metrics

Kafka Log Aggregation

Stream Processing

Kafka Event Sourcing

Companies using Kafka

References

Introduction

Apache Kafka® is a distributed streaming platform capable of handling trillions of events a day.

Messaging System

We use Messaging System when we require transferring of data from one application to another. There are mainly two types of messaging patterns,

  • Point to Point
  • Publish-Subscribing (pub-sub).

Point to Point Messaging System

In this pattern messages are in a queue. One Message can be read by only one consumer. When the consumer consumes the message then that message will disappear from that queue. Multiple consumers can read messages from the queue.

Publish-Subscribe Messaging System

In this pattern, messages are sent to a topic. consumers can receive from one or more topic and consume all the messages.

What is Kafka

Apache Kafka is a publish-subscribe messaging system which is fast, fault-tolerant, scalable. It is a suitable platform for new generation distributed applications. The best feature of Kafka is that it is reliable, fault tolerant also supports automatic recovery.

Kafka is generally used for two classes of applications

  • To build real-time streaming data pipelines which can get data between systems / applications reliably.
  • To build real-time streaming applications which can transform the streams of data effectively.

Terminologies used in Kafka

       Broker: Each Servers in Apache Kafka Cluster is referred to as Broker.

       Topics: Topics stores feeds of messages sent by the source/producer.

       Cluster: Cluster is a group of nodes/computers which are serving a common purpose. In Kafka each of the computers have one instance of broker.

       Node in Kafka: A node is computer in Kafka Cluster.

       Leader: The broker who is responsible for read and write for the specified topic is called Leader.

       Replicas: It is basically “backup” of a partition. Kafka uses replicas to prevent data loss. It will not be involved in any read/write operations.

       Partitions:Each broker consists of some partitions. These partitions can be either leader or replica of the specified topic.

       Producers: Producers are the processes that will publish/send messages to Kafka topics.

       Consumers: Consumers are the processes that subscribes to topics and read the feed of the messages sent by the producer.

       Log Anatomy: Log is another way to view a partition. Producer writes messages to the log and any time the consumers can read that data from the log.

       Message: An information which is sent by the producer and consumed by the consumer in Kafka

       Follower in Kafka: Follower is a node that follows the instruction from the leader /broker. The main role of a follower is that if any of leader fails then the follower will automatically take charge of the leader.

Get Help With Your Essay

If you need assistance with writing your essay, our professional essay writing service is here to help!

Find out more

       Kafka Data Log: The main use of Kafka Data Log is to preserve the messages for some time, so that the consumers can read whenever they wish to. By default, the kafka can keep the messages for 24 hrs. Even though after 24 hrs, if the consumer is not up then it will lose the messages. But Still, If the downtime is within 60 mins then there is some hope to read the message from the last offset.

       Connector API: This API is focussed mainly into building and running reusable consumers /producers which connects existing applications to topics.

Background of Kafka

LinkedIn developed Apache Kafka in the year 2010 as they were facing issues related to low latency data ingestion because they are having huge data from portal to lambda architecture which were used for processing real time events.

There were technologies available in the market for processing the data in batch, but still there was a risk in deployment of those technologies because it was shared with the downstream users. In this case of processing data in real-time, all the known technologies were not a good option. Kafka was made public in the year 2011 to resolve the current issue with deployment. Right after when Kafka was made public, it evolved very quickly from a messaging queue to a complete event streaming platform. Kafka supports most of the popular Industrial applications like Twitter, LinkedIn, Netflix, Mozilla, Oracle.

As we all know, Data is growing in the field of Big Data, there are two main challenges which are

  1. collecting high volume of data
  2. To analyse the collected data.

As a resolution to the challenge, we needed the best messaging system and Apache Kafka has proved its ability. Apache Kafka is overtaking all the popular applications like RabbitMQ, Active MQ, Amazon Web Services.

Advantages & Disadvantages

 PROS

CONS

High-throughput:

Kafka can handle thousands of messages per second which is very high throughput.

Cannot be Fully Monitored & Managed

It lacks a full set of tools for Management & Monitoring which makes the enterprise users fearful towards choosing Kafka and supporting in a long run.

Low Latency:

Kafka can handle messages with very low latency of the range of milliseconds

Issues with Message Tweaking:

If we perform required tweaking in the messages, then the performance goes down.

Fault-Tolerant:

Kafka is resistant to machine or node failure within the cluster

Not support wildcard topic selection:

Kafka ONLY matches with the exact topic name; no wild card selection makes it less effective in some use cases

Persistence & Durability:

Replication of Messages are one of the key features which in turn makes Kafka durable. Kafka is also very effective in preventing data loss

Reduces Performance:

The node memory is used heavily when the brokers & consumers start compressing the messages because of the increasing size. Even though there is no issue with individual message size

Compression happens also when the data flow in the pipeline which in turn affects throughput and performance

Scalability:

Kafka can be scaled-out to multiple nodes so that very less chances of downtime.

Behaves Clumsy:

When the number of queues increases in Kafka cluster, It starts behaving a bit clumsy and slow.

Distributed:

The distributed architecture of Kafka makes it scalable (replication and partitioning methods)

 

High Concurrency:

Kafka can handle thousands of messages per second with high throughput

 

Consumer Friendly:

Kafka can be integrated with variety of consumers

 

Batch Handling Capable (ETL like functionality):

Kafka can be used for batch-processing use cases

 

Real-Time Handling:

Kafka can handle real-time data pipeline also.

 

Architecture

                      

Core APIs in Apache Kafka

       Producer API: This API is used to send data streams from applications to topics.

       Consumer API:  Consumer API is used to subscribe to topics and read the messages sent by the producer.

       Streams API: The streams API consumes an input data stream from various topics and generates an output stream to various output topics and also effectively transforms the data.

       Connector API: This API is focussed mainly into building and running reusable consumers /producers which connects existing applications to topics.

Workflow

Prerequisites

  • GNU/Linux is supported as a development and production platform. Windows is also supported. We will be using CentOS as our platform.
  • Java™ should be installed
  • ssh should be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons

Deployment & Testing

The main topics covered in this section are as follows.

1)       Configuration of Kafka

2)       Creation of Topic

3)       Send & Consume Messages

4)       Multi Broker Cluster

5)       Flume & Kafka (Flafka)

Configuration of Kafka

 To configure Kafka, we need to follow the below steps

1)       Download the 2.2.0 release of kafka from apache site

https://www.apache.org/dyn/closer.cgi?path=/kafka/2.2.0/kafka_2.12-2.2.0.tgz

2)       Un-tar the tgz file

Fig: Un-tar the tgz file of kafka

3)       Move it to /opt folder as kafka & provide the ownership to Hadoop user as we will be using Flume architecture.

Fig: Change Onwership & Group of the folder to Hadoop.

4)       Edit the configuration file zookeeper.properties from the config folder of kafka and set the logs folder to a newly created folder in kafka/sibin.

Fig: Main Properties that needs to be checked.

5)       For managing and coordinating, Kafka uses Zookeeper which is already packaged with the Kafka Download. Whenever a new broker is added or any of the running broker fails in the kafka system, the zookeeper sends a notification, then the consumer producer takes the decision and coordinates their task with other available brokers respectively. We can start the Zookeeper as shown in the screenshot below.

6)       Once the Zookeper Server is started, we can start the kafka server using kafka-server-start.sh shell script in /bin folder of kafka.

Fig: Starting Kafka Server

7)       Verifying the successful start of Kafka Server .  We can see a message showing that Kafka Server id=0 started.

Fig: Result of the Kafka Server start command.

Creation of Topics

Let’s create a topic ‘SibinTest’ with single partition and one replica. The bin/kafka-topics.sh command line tool is now able to connect directly to brokers with –bootstrap-server instead of zookeeper. The old –zookeeper option is still available for now.

The syntax for creating a Topic

/bin/kafka-topics.sh –create

–bootstrap-server <hostname>:<port>

–replication-factor <number-of-replicating-servers>

–partitions <number-of-partitions>

–topic <topic-name>

Bootstrap-servers is a comma-separated list of host and port pairs that are the addresses of the Kafka brokers in a “bootstrap” Kafka cluster that a Kafka client connects to initially to bootstrap itself.

A host and port pair uses : as the separator.

e.g. localhost:9092

Fig: Topic Creation in kafka

In order to list all the topics, we need to have –list parameter instead of –create

Fig: Listing of Topics created in kafka

Send & Consume Messages

The Producer API from Kafka helps to pack the message and deliver it to Kafka Server. We need to use kafka-console-producer.sh command line for sending the messages.

As we have created the topic named SibinTest, we will start the producer and send messages to the topic

Fig: Executing Producer & sending Messages to the topic

Once the messages are sent, we can open a new terminal start the consumer and connect to the topic SibinTest.

Fig: The messages sent from Producer is consumed and displayed by the consumer

Now let’s experiment on sending a message keeping both the producer & consumer terminals open.

Fig: Sending Message when Consumer is listening

 

Fig: Consumer displays the message sent when listening.

Multi-Broker Cluster

Let’s experiment furthermore by creating multiple clusters and send the messages

First step is to create 2 copies of the server.properties file under config folder of kafka as shown below.

Fig: Copy command to create copies of server.properties file..

Edit the configuration files for 2 different brokers, we need to set the listeners, broker id and log directories as below of which the property broker.id is unique in the kafka cluster. We must override the port and log directory also.

Fig: Configuration for the 2nd broker

Fig: Configuration for the 3rd broker.

Start both the brokers as before using kafka-server-start.sh and the result is as follows

Fig: Started the Kafka Server (Broker) 2

Fig: Started the Kafka Server (Broker) 3

After starting of all the brokers, we will create a new topic ‘my-replicated-topic-SibinTest’ with replication as below

Fig: Create new topic with replication as we have multiple brokers

Now inorder to see what each broker does, we can describe the topics.

Fig: Description of our previous topic ‘Sibintest’

Fig: Description of newly created topic my-replicated-topic-SibinTest

The first line of the result gives a summary of all partitions and each additional line gives information about individual partition. We have only one line because we have only one partition.

leader is the node responsible for all reads and writes for the given partition. Each node will be the leader for a randomly selected portion of the partitions.

replicas is the list of nodes that replicate the log for this partition regardless of whether they are the leader or even if they are currently alive.

isr is the set of in-sync replicas. This is the subset of the replicas list that is currently alive and caught-up to the leader.”

In our example for the newly created topic ‘my-replicated-topic-SibinTest’, the partition count is 1 and Leader: which means broker 2 is in lead role and replicas: 2,1,0 and ISR: 2,1,0

Let’s try sending messages and consume them as discussed earlier.

Fig: Sending message through newly created topic.

Fig: Result of the Consumer of my-replicated-topic-SibinTest.

Flume & kafka(Flafka)

Implementing Flume along with Kafka to store data in HDFS

To configure flume with kafka sources we will create a new flume conf file and name it kafkaflume.conf and do the configurations as shown below, of which the highlighted are specifically for kafka mapping.

Fig: kafkaflume.conf file.

Next step is to start the flume agent a1 using the kafkaflume.conf as configuration file.

Fig: Starting the Flume Agent a1

Once the flume agent is started, we can try sending messages through the kafka producer

Fig: Sending Messages through Kafka producer to HDFS.

Results in web Interface of HDFS

Fig: Created folder KafkaFlumeDate in HDFS as specified in flume conf

Fig: Under KafkaFlumeDate folder it created folder in the name of topic “my-replicated-topic-SibinTest” as specified in flume conf

Fig: Under “my-replicated-topic-SibinTest” folder it created folder of the current date as specified in flume-conf.

Fig: Based on the messages sent, 2 files created with the prefix test-kf-events which was specified in flume conf

The result of the messages stored in each file is as follows.

Fig: Messages consumed and stored by HDFS

Conclusion

Kafka is a great architecture for streaming data. It has an active community also. There is no major reason to shift from Kafka to something else, unless it is Kinesis in AWS. But when using Kinesis, you need to be dependent on AWS Services and will be committed to stay in AWS.  There are some things which Kafka can’t do well like cleaning of the data, aggregating the data etc. Still we can achieve all the mentioned with the integration of Apache Spark Streaming. One more drawback of Kafka is that it will not store the data for a long term, for that we can use something reliable like Amazon S3. The active community of Kafka is having visions to resolve all these things.

Use Cases of Kafka

There are a lot of Use Cases of Apache Kafka

Kafka Messaging

Normally we use Message brokers for variety of reasons, for e.g., to decouple processing from data producers, to buffer unprocessed messages and many more. Kafka is highly fault tolerant. Kafka has better throughput and durable based on the various capabilities like partitioning and replication in comparison to most of the other messaging systems.

Find out how UKEssays.com can help you!

Our academic experts are ready and waiting to assist with any writing project you may have. From simple essay plans, through to full dissertations, you can guarantee we have a service perfectly matched to your needs.

View our services

Website Activity Tracking

The original use case of developing kafka by linkedin is to rebuild a user activity tracking pipeline in a real time environment. The LinkedIn site activity is published to kafka ecosystem with one topic per activity type. The activity refers to the user searches, user page views, comments that users make.

Kafka Metrics

Kafka is also widely used in aggregating statistics from distributed applications. It will capture all the feeds from various applications to produce centralized feeds which contains operational data.

Kafka Log Aggregation

Kafka can be used across the organization to collect logs from various services/systems/ applications used in that organization and can transform into a standard format and then make it available for multiple consumers.

Stream Processing

However, there are frameworks like Storm and Spark Streaming which can read, process and write data to a new topic which will be available for users and applications, Kafka is very useful as it is highly durable.

Kafka Event Sourcing

Event Sourcing is a way of designing an application where all the changes of the state are logged in a timely order as sequence of records. Kafka is an excellent backed for such an application because Kafka supports large log file storage.

Companies using Kafka

LinkedIn

Cloud Physics

VividCortex

VisualDNA

Parsely

Yahoo

Graylog2

Trivago

Sematext

Datadog

Twitter

Yieldbot

Ants.vn

Wize Commerce

Netflix

LivePerson

IFTTT

Quixey

Square

Retention Science

Homeadvisor

LinkSmart

Spotify

Strava

Skyscanner

LucidWorks Big Data

Pinterest

Outbrain

IBM Message Hub

Shopify

Uber

SwiftKey

iPinYou

Cerner

Goldman Sachs

Yeller

MailChimp

Oracle

Tumblr

Emerging Threats

Simple

Oracle Golden Gate

PayPal

Hotels.com

Gnip

CloudFlare

Box

Helprace

Loggly

Mate1.com Inc.

Airbnb

Exponential

RichRelevance

 Boundary

Mozilla

Livefyre

SocialTwist

Ancestry.com

Cisco

Exoscale

Countandra

DataSift

Etsy

Cityzen Data

FlyHajj.com

Spongecell

Tagged

Criteo

uSwitch

Wooga

Foursquare

The Wikimedia Foundation

InfoChimps

AddThis

StumbleUpon

OVH

Visual Revenue

Urban Airship

Coursera

Helpshift

Oolya

Metamarkets

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on the UKDiss.com website then please: