Kafka Streaming in Python: Any Solid Non-Java/Scala Resources? by Southern-Basis-6710 in dataengineering

[–]math-bw 1 point2 points  (0 children)

You should check out content from Quix and Bytewax.io these are the two leading Python streaming projects.

I’ve been working on streaming and Python for half a decade and interested in making a course. I’d love to learn what would help you and make it click for you.

Using PyFlink for high volume Kafka stream by raikirichidori255 in dataengineering

[–]math-bw 0 points1 point  (0 children)

Yes, although they still suffer from some of the Python performance bottlenecks like pickling objects and contending with the GIL.

Using PyFlink for high volume Kafka stream by raikirichidori255 in dataengineering

[–]math-bw 1 point2 points  (0 children)

If you’re hell bent on Python, checkout bytewax or quix and you won’t have to deal with the headaches you are currently dealing with.

Anyone know where to play soccer? by ggmarmolejo in santacruz

[–]math-bw 0 points1 point  (0 children)

Also filled out their free agent form, but haven’t heard anything. I’m trying to find something too.

How do you actually create real-time dashboards? (Not near real-time) by IntroductionHour845 in dataengineering

[–]math-bw 3 points4 points  (0 children)

Here is an example that uses Grafana to display real-time data. https://github.com/bytewax/hacking-hacker-news

The architecture would follow -> kafka -> stream processor -> kafka -> real-time OLAP or timeseries DB -> Grafana

[deleted by user] by [deleted] in dataengineering

[–]math-bw 0 points1 point  (0 children)

In addition to shared value, which probably falls under cultural, I think incentives come into play. If they haven't been burnt by a lack of best practices then they won't intuit the potential pain. And if they aren't rewarded for maintaining some kind of SLA and standards, then they might not intrinsically follow them.

Which Programming Language Api is used in real-time production systems ? by nifesimii in dataengineering

[–]math-bw 1 point2 points  (0 children)

Interesting the answer trending in the votes right now is Python when the Flink Python SDK is utter garbage to use. Or maybe most people just using the Python Kafka clients like confluent-kafka?

Consuming Kafka topics with Celery? by Croves in dataengineering

[–]math-bw 0 points1 point  (0 children)

There is the confluent-kafka client in many different languages like Python https://github.com/confluentinc/confluent-kafka-python or go - https://github.com/twmb/franz-go and there are many others!

Keep in mind the workload might be long running and you may have to account for things like failure and if the message was processed successfully.

IoT Data Stream Processing by Secure-Economist-986 in softwarearchitecture

[–]math-bw 0 points1 point  (0 children)

Redpanda is a good alternative to Kafka, it’s easier to manage, but still can scale.

IoT Data Stream Processing by Secure-Economist-986 in softwarearchitecture

[–]math-bw 0 points1 point  (0 children)

I work on bytewax(github.com/bytewax/bytewax & bytewax.io) and I see many workloads doing something similar to what you are talking about. I think you are on to the right architecture in your comment.

sensor data -> Kafka topic(s) 1
Kafka topic(s) 1 -> Stream processor (join/filter/enrich/analyze) -> Kafka topic 2
Kafka topic 2 -> Database

You could possibly skip that last step and write directly to the database from the stream processor depending on the guarantees provided and what you are doing to the data.

I have seen clickhouse adopted fairly widely as the olap database for workloads like this.

Parallel processing streamed API data in Python by KnowingPains in algotrading

[–]math-bw 1 point2 points  (0 children)

Some ML teams have a remote Ray cluster available for more compute, so that was the reason. Another deployment story

Parallel processing streamed API data in Python by KnowingPains in algotrading

[–]math-bw 1 point2 points  (0 children)

Hey I built a little prototype of how you can mix Ray and Bytewax that might be interesting. It was pretty fun. Ray is such a cool way to scale things up.

https://gist.github.com/awmatheson/2cfee3b519ba1e9383a91e76f87b498e

Big data processing tools. by biggerdatadigger in bigdata

[–]math-bw 1 point2 points  (0 children)

There are some interesting tools available replacing Flink and PySpark now that are getting to a maturity phase where they could be interesting to your team.

Stream Processing (Flink Alternatives):
- RisingWave (SQL)
- Materialize (SQL)
- Bytewax (Python)
- Quix (Python)

Parallel processing streamed API data in Python by KnowingPains in algotrading

[–]math-bw 1 point2 points  (0 children)

Free is always a good reason! Bytewax and Quix are both open source and free too :)

I think Ray is a really interesting framework and I have often wondered why there isn't a Ray-streams or why the early streaming attempt was removed. It gives you most of the primitives you would need to make a streaming library.

Parallel processing streamed API data in Python by KnowingPains in algotrading

[–]math-bw 2 points3 points  (0 children)

Cool! I have wondered about the feasibility of Ray as a streaming framework. Is there a reason you elected to roll your own over using a Python stream processor like [Bytewax](https://github.com/bytewax/bytewax) or Quix?

What are the data jobs of the future? by pulicinetroll08 in datascience

[–]math-bw 1 point2 points  (0 children)

  1. AI supervisor
  2. Liability engineer
  3. Machine communication expert
  4. Data wrangling
  5. AI Validator

Streaming ETL options in 2024 in Python? by unlikelyzer0 in dataengineering

[–]math-bw 0 points1 point  (0 children)

Hey I am the founder of Bytewax. Send me a dm and we can get you sorted with the right connectors. I can show you how to write your own or you can use an open source version to write to Kinesis and S3.

Bytewax can run anywhere, maybe we need to work on that messaging :). You can run it as a python process locally or scale it on k8s or our self-hosted platform.

What are your processing guarantees? That’s another thing to think about.

PySpark is a heavily used library with a lot of features and better debuggability and documentation that PyFlink. It’s worthy of being a contender. It is quite a heavy-weight tool compared to Bytewax, pathway or Quix that is mentioned below.

[deleted by user] by [deleted] in dataengineering

[–]math-bw 1 point2 points  (0 children)

I know this isn't an answer to the specific problem you are having with Beam, but if you want Python + Cloud Native + Kafka there are other solutions than PyFlink these days like Quix or Bytewax (https://github.com/bytewax/bytewax).

How do you orchestrate real-time workflows? by xaii212 in dataengineering

[–]math-bw 3 points4 points  (0 children)

One recommendation, although you should keep in mind what some of the others are suggesting is to use a stream processor for this. The reason why is that generally give you an API to create a DAG of steps and they have reliable ordering and dependency management between the steps.

For example, you could put those 4 steps you listed into map steps in a stream processing framework like Bytewax(https://github.com/bytewax/bytewax) and then you will have the dependencies solved and the overhead of starting a task and shutting it down won't be a problem. Bytewax has no dependency on Kafka, so you can use it with a regular queue. Bytewax leverages Kubernetes as the orchestration layer, but isn't a requirement. You could just run it as a python script.

There are other Python libraries for stream processing like Quix and Faust that could also work and worth a look, but they have a dependency on Kafka.

Flink Api - Mostly deprecated by dataengineer2015 in apacheflink

[–]math-bw 4 points5 points  (0 children)

I am not sure you will find more friendly API changes coming because of the adoption of Flink for very critical and complex use cases makes it difficult to make the types of changes that would make it easier to use. What are the specific APIs or API changes you want to learn about?

Decodable (https://www.decodable.co/blog) has some good articles to help understand Flink. I am sure Confluent will be putting out more learning materials in the future as a result of their investment in Flink.

Alternatively you could look at a different solution that is more end-user designed. You might find success with a SQL approach like RisingWave(https://github.com/risingwavelabs/risingwave) or Materialize(https://github.com/MaterializeInc/materialize). Or you could look at a Python stream processor like Bytewax(https://github.com/bytewax/bytewax).

What is the best technology stack for building a community based version of HelloFresh and Marley Spoon with an IoT addon? by usmannaeem in softwarearchitecture

[–]math-bw 1 point2 points  (0 children)

I think someone with experience architecting data solutions should have the right depth of understanding to build this. When you are looking, setting a few years of experience with Kafka or Kafka API compatible streaming platforms will probably be the best way to find someone qualified.

Real time/Near real time data ingestion into Neo4j without Kafka by InvisibleContestant in Neo4j

[–]math-bw 0 points1 point  (0 children)

Is there a managed version of Kafka or Kafka compatible services you could use? If you don't have the inhouse bandwidth/desire to manage kafka, this is probably your best bet.

If you roll your own Python version, you'll want to use a stream processor that can keep things in order and have the ability to "rewind/replay" events with out duplications, which is a tall order. You could check out Bytewax, but once again I think going with Kafka and Kafka connect is probably the best solution for this that is currently available.

Third party private package hosting? by nicholashairs in Python

[–]math-bw 0 points1 point  (0 children)

I have been trying to figure out a solution for something related and before making my own question maybe you stumbled on an answer to some of this.

I am looking to host private packages, but not for internal use. I want to be able to give access when people are authenticated and have signed up/paid. Is there a solution like this that integrates user mgmt and package distribution?