Friday, January 4, 2019

JTD-DesignerSeries-18-Kafka-101


A brief Prep-up
Data Integration over the last 2 decades has been classified as Batch (ETL & Data warehouse) and Realtime (EAI with ESBs & queues) with data persistence technology being the relational DBs. However there are new trends that are demanding complete revamp of these traditional solutions.
a) Relational DBs are augmented with No-SQL distributed databases like MongoDB, Cassandra etc.
b) IOT solutions are connecting billions of devices & sensors and use cases like website tracking produce large datasets that have to be stored for providing analytical capabilities by Big Data solutions like Hadoop.
c) Ubiquitous publishing of streams needs solutions like data pipelines which can cleanse and transform data in realtime before generating the final content.

Data Integration in modern landscape is revamp with streaming platform which is real-time and scalable.




----------------------->
Transition - Streaming 
Platform
------------------------>



So, Data Integration solution in modern landscape has following needs:

a) It should able to to process high volume diverse data in an event centric way, so that you are able to process structured as well unstructured data from different channels like web, mobile, IOT sensors, APIs.

b) It should enable forward compatible architecture, which should allow you to add stream processing applications that need to process the same data differently to support different use cases.

c) Streaming Platform should able to scale out and provide low latency.


Apache Kafka - A Distributed Streaming Platform

With this modern central stream platform, all data is represented as streams, and streams of data is stored and processed through the platform to provide for modern data integration needs.

a) It can serve as a real-time scalable messaging bus and allows applications to publish and subscribe to streams of records.

b) It can store streams of data in a fault-tolerant, durable way and allows for features like message ordering, replay for a point in time.

c) It can process streams of data and allows you to create data pipelines for feeding clean data into all data processing destinations like Hadoop, No-SQL DBs. Streaming Apps can implement the transformation needs of a destination application.

Kafka is based on a concept of persistent, replicated, writehead, append only record log, and every record is identified by a unique index called an offset. Writes are immutable and are only appends, whereas readers can use the offset to index into the record set can can read messages in order.

Kafka runs as a cluster of one of more servers that can span multiple data centers. It stores streams of records in categories called topics. Each record consists of a key, value and a timestamp.


Kafka APIs
a) Producer API - Allows applications to publish streams of records to one or more kafka topics.

b) Consumer API - Allows an application to subscribe to one of more topics and process the stream of records produced to them.

c) Streams API - Allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.

d) Connector API - Allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

Updates to Follow

Jeetans-MacBook-Pro:dirKafka home$ kafka-topics.sh
Create, delete, describe, or change a topic.
Option                                   Description                            
------                                   -----------                            
--alter                                  Alter the number of partitions,