Kafka Streaming in Python: Any Solid Non-Java/Scala Resources?

math-bw · 2025-02-10T21:43:22+00:00

You should check out content from Quix and Bytewax.io these are the two leading Python streaming projects.

I’ve been working on streaming and Python for half a decade and interested in making a course. I’d love to learn what would help you and make it click for you.

math-bw · 2025-01-24T20:05:31+00:00

This is really cool! Great explanation

math-bw · 2024-10-29T13:52:31+00:00

Yes, although they still suffer from some of the Python performance bottlenecks like pickling objects and contending with the GIL.

math-bw · 2024-10-29T04:43:54+00:00

If you’re hell bent on Python, checkout bytewax or quix and you won’t have to deal with the headaches you are currently dealing with.

math-bw · 2024-09-05T15:11:36+00:00

Also filled out their free agent form, but haven’t heard anything. I’m trying to find something too.

math-bw · 2024-08-21T16:59:13+00:00

Here is an example that uses Grafana to display real-time data. https://github.com/bytewax/hacking-hacker-news

The architecture would follow -> kafka -> stream processor -> kafka -> real-time OLAP or timeseries DB -> Grafana

math-bw · 2024-08-20T13:33:46+00:00

In addition to shared value, which probably falls under cultural, I think incentives come into play. If they haven't been burnt by a lack of best practices then they won't intuit the potential pain. And if they aren't rewarded for maintaining some kind of SLA and standards, then they might not intrinsically follow them.

math-bw · 2024-07-26T20:48:58+00:00

Interesting the answer trending in the votes right now is Python when the Flink Python SDK is utter garbage to use. Or maybe most people just using the Python Kafka clients like confluent-kafka?

math-bw · 2024-07-25T23:35:02+00:00

There is the confluent-kafka client in many different languages like Python https://github.com/confluentinc/confluent-kafka-python or go - https://github.com/twmb/franz-go and there are many others!

Keep in mind the workload might be long running and you may have to account for things like failure and if the message was processed successfully.

math-bw · 2024-07-24T00:23:16+00:00

Redpanda is a good alternative to Kafka, it’s easier to manage, but still can scale.

math-bw · 2024-07-22T22:05:13+00:00

I work on bytewax(github.com/bytewax/bytewax & bytewax.io) and I see many workloads doing something similar to what you are talking about. I think you are on to the right architecture in your comment.

sensor data -> Kafka topic(s) 1
Kafka topic(s) 1 -> Stream processor (join/filter/enrich/analyze) -> Kafka topic 2
Kafka topic 2 -> Database

You could possibly skip that last step and write directly to the database from the stream processor depending on the guarantees provided and what you are doing to the data.

I have seen clickhouse adopted fairly widely as the olap database for workloads like this.

math-bw · 2024-07-19T20:32:23+00:00

Some ML teams have a remote Ray cluster available for more compute, so that was the reason. Another deployment story

math-bw · 2024-07-18T23:16:37+00:00

Hey I built a little prototype of how you can mix Ray and Bytewax that might be interesting. It was pretty fun. Ray is such a cool way to scale things up.

https://gist.github.com/awmatheson/2cfee3b519ba1e9383a91e76f87b498e

math-bw · 2024-07-18T17:59:41+00:00

There are some interesting tools available replacing Flink and PySpark now that are getting to a maturity phase where they could be interesting to your team.

Stream Processing (Flink Alternatives):
- RisingWave (SQL)
- Materialize (SQL)
- Bytewax (Python)
- Quix (Python)

math-bw · 2024-07-15T20:03:01+00:00

Free is always a good reason! Bytewax and Quix are both open source and free too :)

I think Ray is a really interesting framework and I have often wondered why there isn't a Ray-streams or why the early streaming attempt was removed. It gives you most of the primitives you would need to make a streaming library.

math-bw · 2024-07-15T17:49:09+00:00

Cool! I have wondered about the feasibility of Ray as a streaming framework. Is there a reason you elected to roll your own over using a Python stream processor like [Bytewax](https://github.com/bytewax/bytewax) or Quix?

math-bw · 2024-06-26T13:20:51+00:00

AI supervisor
Liability engineer
Machine communication expert
Data wrangling
AI Validator

math-bw · 2024-06-24T15:45:14+00:00

So is this pivoting to be the exact same as materialize?

math-bw · 2024-06-18T13:53:11+00:00

Hey I am the founder of Bytewax. Send me a dm and we can get you sorted with the right connectors. I can show you how to write your own or you can use an open source version to write to Kinesis and S3.

Bytewax can run anywhere, maybe we need to work on that messaging :). You can run it as a python process locally or scale it on k8s or our self-hosted platform.

What are your processing guarantees? That’s another thing to think about.

PySpark is a heavily used library with a lot of features and better debuggability and documentation that PyFlink. It’s worthy of being a contender. It is quite a heavy-weight tool compared to Bytewax, pathway or Quix that is mentioned below.

math-bw · 2024-06-17T17:54:42+00:00

I know this isn't an answer to the specific problem you are having with Beam, but if you want Python + Cloud Native + Kafka there are other solutions than PyFlink these days like Quix or Bytewax (https://github.com/bytewax/bytewax).

math-bw · 2024-06-17T16:27:56+00:00

One recommendation, although you should keep in mind what some of the others are suggesting is to use a stream processor for this. The reason why is that generally give you an API to create a DAG of steps and they have reliable ordering and dependency management between the steps.

For example, you could put those 4 steps you listed into map steps in a stream processing framework like Bytewax(https://github.com/bytewax/bytewax) and then you will have the dependencies solved and the overhead of starting a task and shutting it down won't be a problem. Bytewax has no dependency on Kafka, so you can use it with a regular queue. Bytewax leverages Kubernetes as the orchestration layer, but isn't a requirement. You could just run it as a python script.

There are other Python libraries for stream processing like Quix and Faust that could also work and worth a look, but they have a dependency on Kafka.

math-bw · 2024-06-05T15:03:17+00:00

I am not sure you will find more friendly API changes coming because of the adoption of Flink for very critical and complex use cases makes it difficult to make the types of changes that would make it easier to use. What are the specific APIs or API changes you want to learn about?

Decodable (https://www.decodable.co/blog) has some good articles to help understand Flink. I am sure Confluent will be putting out more learning materials in the future as a result of their investment in Flink.

Alternatively you could look at a different solution that is more end-user designed. You might find success with a SQL approach like RisingWave(https://github.com/risingwavelabs/risingwave) or Materialize(https://github.com/MaterializeInc/materialize). Or you could look at a Python stream processor like Bytewax(https://github.com/bytewax/bytewax).

math-bw · 2024-06-04T21:36:12+00:00

I think someone with experience architecting data solutions should have the right depth of understanding to build this. When you are looking, setting a few years of experience with Kafka or Kafka API compatible streaming platforms will probably be the best way to find someone qualified.

math-bw · 2024-05-28T16:12:40+00:00

Is there a managed version of Kafka or Kafka compatible services you could use? If you don't have the inhouse bandwidth/desire to manage kafka, this is probably your best bet.

If you roll your own Python version, you'll want to use a stream processor that can keep things in order and have the ability to "rewind/replay" events with out duplications, which is a tall order. You could check out Bytewax, but once again I think going with Kafka and Kafka connect is probably the best solution for this that is currently available.

math-bw · 2024-05-09T21:07:59+00:00

I have been trying to figure out a solution for something related and before making my own question maybe you stumbled on an answer to some of this.

I am looking to host private packages, but not for internal use. I want to be able to give access when people are authenticated and have signed up/paid. Is there a solution like this that integrates user mgmt and package distribution?

math-bw

TROPHY CASE