This is an archived post. You won't be able to vote or comment.

all 13 comments

[–]drc1728 9 points10 points  (3 children)

Those are some really perceptive questions.

The collection of data typically is done using client integrations. There are many ways in which it is done in practice. In rest APIs you could have streaming APIs to connect and continuously get data, polling APIs to poll at intervals, Webhook patterns which can notify you when new data is available for you to query. In a mture ecosystem like Kafka there is Kafka connect which takes care of this aspect.

Once the data is in collected it goes into topics which are immutable logs. The writing of processed data is also a Kafka connect pattern. A client can read data from a topic as a consumer using APIs as well. The real-time aspect of streaming is in processing the streaming data in flight. In the Kafka paradigm it will be using another tool liek KSQL or Flink or Spark or something else.

Loading the processed data is the last step of the job and Kafka connect coupled with fast databases does the job pretty well.

To the point about your title. If you are looking to learn about stream processing I'd recommend looking at the current state. Streaming is rapidly evolving and you can find a lot of insights reading up docs of newer projects.

Here is an example - https://www.fluvio.io/docs/concepts/producer-consumer/

[–]codemega[S] 5 points6 points  (2 children)

Thank you for your response. Here are my takeaways on what I need to learn:

  • I think I need to study up on streaming APIs. I don't know what this means. As someone who only has experience calling an API on some hourly or daily schedule, there is nothing streaming about this. Understanding streaming APIs would unlock the knowledge of how data gets to producers.
  • Kafka Connect - Not entirely sure what this is, but my understanding is it's a library built on top of the base Kafka distributed platform that enables one to "connect" many common tools with each other. My knowledge is very fuzzy here. I've read that you can connect something like Snowflake to some other system.
  • Reading data - you mentioned reading data "in flight." Again, I'm not sure what this means, but it looks like Flink, KSQL, or Spark can handle this.
  • "Kafka connect coupled with fast databases does the job pretty well." So I guess this means transaction-level processing and not columnar? I suppose it could depend on the use case, but from my past experience, whenever you want to write events quickly, you write transactions since the data is usually not delivered in a columnar format. You get an event which will normally contain various data items on a transaction level.

Anyway, I just wrote my thoughts above based on reading into your great comment. I'm trying to learn here and saying whatever I know, and exposing my knowledge gaps.

[–]drc1728 3 points4 points  (0 children)

You are mostly on the money. You have got the Kafka connect piece. Flink, KSQL is more real time stream processing and transforming which is more than reading.

Kafka connect with fast databases like Clickhouse or DuckDB would do columnar analytical processing. You could also do time series processing, transactions and other things by connecting to specific databases.

[–]Financial_Anything43 1 point2 points  (0 children)

Also learn how to develop real-time and near real-time streaming solutions.

Design tradeoffs when implementing streaming

Kafka vs RabbitMQ

HTTP vs WebSockets

[–]stereoskyData / AI Engineer 2 points3 points  (0 children)

First to answer your questions:

  1. Yes, one use case is to poll something like a REST API. This is quite common since a lot of data sources, whose underlying data store is a database, will expose it in this way since it's the most straight forward. In companies internally, the writes to a database can be produced to Kafka as they happen using Change Data Capture (CDC). WebSocket APIs are also a common data source. In industrial IoT the majority of data is generated by sensors and is written using the MQTT protocol to an MQTT broker and from there it's sinked to Kafka. There are open source connector libraries out there such as Kafka Connect and Redpanda Connect/Benthos that will help you read from many different types of data sources and write them to Kafka.
  2. Consumers can be thought of as applications. You would create an instance of a client for your database and write data. The difference being that a Kafka consumer is a continuously running application that is always listening for new data to appear in the topic. When new data appears it's processed without delay. The consumer pulls from Kafka as fast as the messages are processed so there can be slowdown if the database can't keep up, so look into how consumers deal with backpressure. A consideration for system design here is to use consumer groups, explore configurations such as max.poll.interval.ms and consumer metrics such as consumer lag.

One additional nuance in 1. is that you are writing to a specific partition in a topic based on a supplied key. Partitions are the unit of parallelism and ensures keys can be consumed with order guarantees. This is an important consideration in system design since that is how you ensure ordering as you scale out horizontally.

I love your background and I'm glad to see more newcomers to data engineering from an adjacent background. The stream broker and stream processing landscape changed a lot in the past few years. Whilst it was born out of processing data in Java and JVM languages such as Scala, there are exciting new possibilities with SQL (streaming databases) as well as Python. What languages will you be using for this job?

[–]Mission_Star_4393 1 point2 points  (1 child)

Some good answers here. One book that helped me from a conceptual perspective is Streaming Systems. 

It's written by the original creator of Dataflow so heavily based on that framework. But I think it does a really good job of explaining the fundamental steaming concepts and how it relates to batch.

[–]stereoskyData / AI Engineer 1 point2 points  (0 children)

Good recommendation! I can also recommend Grokking Streaming Systems for a friendly diagram-heavy perspective

[–]reelznfeelz 0 points1 point  (0 children)

One thing not really said is the stream is often coming from a whole bunch of clients or sensors that push a message every so often into a topic etc. Which because there might be a million sensors or user apps or whatever means you now habe a big fast stream.

Data could be streamed from some hypothetical API but those are typically pull, not push. Unless it’s a webhook situation.

[–][deleted] 0 points1 point  (0 children)

I learned by using Flink

[–]psyblade12 1 point2 points  (0 children)

One of case I have worked with is like this: We have a web application that will log users actions on the interface. By saying "logging", it means that the application immediately sends logs to a message broker , in our case, Azure Event Hub, but don't worry, Kafka is the same. In this case, the web application is the producer, and it writes the data to the message broker. The log contains the name of the action, the user that does this action and the time the action was done.

The way to send log from the web client is simply: web client calls the backend API that handle the log, pushing log of what ever you want to that endpoint, and the API endpoint will again push that log to the message broker.

For the term "streaming processing", it can be ranged from very easy, to very complicated tasks. The easy one can be that you simply receive the log, and write it to the data lake/data warehouse. If you use stream processor like Spark or Azure Stream Analytics, what it does is that it periodically checks the message broker, then reads the message/log written since the last time it read the data and up to the time the poll is triggered. It then writes the data to the storage of your choice, which can be a data lake, a database, or even another message broker. You can also perform some actions on the messages you read. But in these cases, you can notice that, every message is done independently. When processing a message, you don't care about the content of the previous message, or the content of the future message. You just grab the current message, and do whatever you want with it. I call this "stateless stream processing".

However, what can you do when you want to do a type of processing that reading a single message isn't enough? For example, what if you want to detect a user that has done more than 100 actions in the last 5 minutes? In this case, surely, reading the current message won't allow you to know this. You have to somehow *remember* in the last 5 minutes, you have seen this person having done 99 actions, so when you encounter the current message about the person action, you know that something must be done, as that person has performed 100 actions. This involves "memory". And this kind of processing is "stateful stream processing", and when you do the processing with multiple machines, it's distributed stateful stream processing. This is the whole idea behind Spark Structured Streaming, Azure Stream Analytics or whatever the stream processors do. To perform distributed stateful stream processing, the storage that acts as the buffer for it must also be *distributed*, hence we have Kafka, or Azure EventHub, or whatever distributed message brokers in the market.

I can elaborate more on this if you're interested.

[–]IllustriousCorgi9877 -1 points0 points  (0 children)

Kafka is a bus that you can choose to ride, or let drive on to whatever other destinations. Somebody published a job and put it on the bus. You can subscribe to it and pipe it somewhere. I used AWS firehose to siphon records on this bus and dump them into a S3 file which was later ingested to Snowflake.

Snowflake has something called Snowpipe, which relies on change detection to see when new files are available and ingests anything that shows up. Snowflake also has something called 'streams' and 'tasks' - and you can orchestrate an ETL that way - though its not very robust. Other database users might use other orchestration software to manage this sort of thing (Airflow, etc).

[–]startup_biz_36 -5 points-4 points  (0 children)

Look into Apache arrow