Wednesday, February 13, 2019

JTD-DesignerSeries-19-Dataweave-101


A brief Context
Integration Platform, at its core is a transformation engine that should support robust data mapping features and enables streamline communication between different components of connecting systems and processes. Mulesoft, platform traditionally supported mapping tasks with the component architecture and designed a DataMapper as their first pass to data transformation activities. As the platform matured, Mulesoft coupled the transformation engine with Mule runtime to design a elegant and lightweight expression language based on JSON syntax, Dataweave.


Dataweave Environment
Design time capabilities of Dataweave lets you define the message metadata which is automatically detected with Datasense. It also accepts several mime-types set at design time on different message processors and defaults to application /java. Explicitly setting the mime type in variable and payload processor parses the java string into standard content. A simple data mapping file has a header and an expression. Header section specifies the dataweave version & output mime-type and generally mapping is form semi-literal expression of Dataweave Object.

Dataweave Fundus
a) Dataweave not just support simple mappings but address advance transformation usecases like Normalization, Grouping, Joins, Partitioning, Pivots, Filtering.
a) Dataweave expressions can be of simple types like string, boolean, arrays [1,2,3], and objects (key/value pairs). Keys and Values in object can be expressed as expressions that return string literal for key and any expression for a value.
b) payload is a reserved key word that references the incoming data to the transform component and dot operator can be used to access the payload elements which can be array / or other objects.
c) Map operator allows you to iterate through each array element and '$' can be used to reference each element in the array.
d) filter keyword can be used on the payload or after map operator to to transform the desired dataset / or can be used after the transform expression to render the desired results.
e) When operator can be used to support conditional logic which can operate on values of incoming fields.
f) GroupBy operator can be used to aggregate the results based on the value of incoming field. Aggregated record can further iterate over the field array by using the another map operator.
g) Dataweave engine normalizes the variable references to payload, flowVars, sessionVars, recordVars, inboundProperties or outboundProperties to a canonical dataweave format specified with mime type (application/dw) irrespective of mime type of the incoming data.
h) Dataweave Engine separates the rendering process by taking the transformed canonical dataweave format and render it in a output mime type specified in the header.
 
Expressions
There are 7 basic expressions in dataweave:
a) Literal: "Hello World", [1,2,3], {firstname: "John", lastname: "Doe"}
b) Variable Reference: payload
c) Semi Literal: {people: payload map {name: $.firstname ++ $.lastname}}
d) Selector: lets you navigate to data elements by specifying dot operator on Objects(payload.firstnaname) and index parameters on an array ([1..-1]). You can specify multi key selectors to iterate over the nested object and result in an array.
e) Function Call
f) Flow Invocation
g) Compound: Typically when you use operators along with expression, you end up writing compound expressions to support advanced transformation usecases like iteration, filtering and conditional logic.

Pointers
Integration Platform, at its core is a transformation engine that should support robust
a) https://github.com/madajee/dataweave-2-101.git
b) https://blogs.mulesoft.com/dev/getting-started-with-dataweave-part-1/
c) https://blogs.mulesoft.com/dev/getting-started-with-dataweave-part-2/

Friday, January 4, 2019

JTD-DesignerSeries-18-Kafka-101


A brief Prep-up
Data Integration over the last 2 decades has been classified as Batch (ETL & Data warehouse) and Realtime (EAI with ESBs & queues) with data persistence technology being the relational DBs. However there are new trends that are demanding complete revamp of these traditional solutions.
a) Relational DBs are augmented with No-SQL distributed databases like MongoDB, Cassandra etc.
b) IOT solutions are connecting billions of devices & sensors and use cases like website tracking produce large datasets that have to be stored for providing analytical capabilities by Big Data solutions like Hadoop.
c) Ubiquitous publishing of streams needs solutions like data pipelines which can cleanse and transform data in realtime before generating the final content.

Data Integration in modern landscape is revamp with streaming platform which is real-time and scalable.




----------------------->
Transition - Streaming 
Platform
------------------------>



So, Data Integration solution in modern landscape has following needs:

a) It should able to to process high volume diverse data in an event centric way, so that you are able to process structured as well unstructured data from different channels like web, mobile, IOT sensors, APIs.

b) It should enable forward compatible architecture, which should allow you to add stream processing applications that need to process the same data differently to support different use cases.

c) Streaming Platform should able to scale out and provide low latency.


Apache Kafka - A Distributed Streaming Platform

With this modern central stream platform, all data is represented as streams, and streams of data is stored and processed through the platform to provide for modern data integration needs.

a) It can serve as a real-time scalable messaging bus and allows applications to publish and subscribe to streams of records.

b) It can store streams of data in a fault-tolerant, durable way and allows for features like message ordering, replay for a point in time.

c) It can process streams of data and allows you to create data pipelines for feeding clean data into all data processing destinations like Hadoop, No-SQL DBs. Streaming Apps can implement the transformation needs of a destination application.

Kafka is based on a concept of persistent, replicated, writehead, append only record log, and every record is identified by a unique index called an offset. Writes are immutable and are only appends, whereas readers can use the offset to index into the record set can can read messages in order.

Kafka runs as a cluster of one of more servers that can span multiple data centers. It stores streams of records in categories called topics. Each record consists of a key, value and a timestamp.


Kafka APIs
a) Producer API - Allows applications to publish streams of records to one or more kafka topics.

b) Consumer API - Allows an application to subscribe to one of more topics and process the stream of records produced to them.

c) Streams API - Allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.

d) Connector API - Allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

Updates to Follow

Jeetans-MacBook-Pro:dirKafka home$ kafka-topics.sh
Create, delete, describe, or change a topic.
Option                                   Description                            
------                                   -----------                            
--alter                                  Alter the number of partitions,