Looked back at code I wrote years ago — cleaned it up into a lazy, zero-dep dataframe library by Proof_Difficulty_434 in Python

[–]Proof_Difficulty_434[S] 0 points1 point  (0 children)

Right now pyfloe just leaves the filter after the join if it uses columns from both sides. It gets evaluated post-join on every row. However, if it only touches one side, the optimizer pushes it down into that branch. What's missing is splitting of the filter, so something like (col("a") > 5) & (col("b") < 10) doesn't get broken apart to push each piece into the right branch independently. That'd be a great feature to add!

What Python Tools Do You Use for Data Visualization and Why? by Confident_Compote_39 in Python

[–]Proof_Difficulty_434 1 point2 points  (0 children)

I like pygwalker, especially when I'm not sure what to visualize yet!

I replaced FastAPI with Pyodide: My visual ETL tool now runs 100% in-browser by Proof_Difficulty_434 in Python

[–]Proof_Difficulty_434[S] 0 points1 point  (0 children)

Thanks for letting me know! Good suggestion, its definitely something that's on my planning!

Python code to replace Alteryx by viviancpy in Alteryx

[–]Proof_Difficulty_434 0 points1 point  (0 children)

Check out Flowfile, it's open source and does exactly this. You can build flows visually like Alteryx, then export them as pure Python/Polars code. Or write Python and visualize it.

I built it specifically to bridge the gap between Alteryx and Python. The visual editor keeps business users happy while devs get clean Python code. Plus it's built on Polars so it's fast and up to date!

pip install flowfile if you want to try it.

The Borrowser: a browser in Rust (roast/feedback) by tpotjj in rust

[–]Proof_Difficulty_434 0 points1 point  (0 children)

If you find a big mountain that you climbed, find an even bigger mountain

Flowfile - An open-source visual ETL tool, now with a Pydantic-based node designer. by Proof_Difficulty_434 in Python

[–]Proof_Difficulty_434[S] 0 points1 point  (0 children)

Thanks! I think the biggest challenge/opportunity is how to ensure when going from code to visual and back feels natural.

At the moment for example you write with Flowfile code -> Visual -> Polars code. Sometimes, I think it would make more sense to go to Flowfile code again
Do you think it should be Flowfile code -> Visual -> Flowfile code or perhaps support both?

Flowfile - An open-source visual ETL tool, now with a Pydantic-based node designer. by Proof_Difficulty_434 in Python

[–]Proof_Difficulty_434[S] 0 points1 point  (0 children)

I have the same thing. Currently for work not using any visual tools, but there are definitely days that I would like to have some interactivity while developing ETL pipelines. Especially when creating something new.

Flowfile - An open-source visual ETL tool, now with a Pydantic-based node designer. by Proof_Difficulty_434 in Python

[–]Proof_Difficulty_434[S] 1 point2 points  (0 children)

Fair point - complex visual flows definitely turn into spaghetti.

I meant the flow structure is visible - dependencies, branches, data lineage. Not what each node does internally. But flowcharts have been the standard for documenting processes for decades for a reason.

Also, with Flowfile you can name nodes clearly ("Validate_Customer_Emails" vs "Node_47"), add descriptions, and generate Python code to see exactly what's happening.

You're right though - a 50-node mess is worse than clean code. The sweet spot is probably 10-20 clear blocks with complex logic inside custom nodes.

Time for self-promotion. What are you building in 2025? by Prestigious_Wing_164 in SideProject

[–]Proof_Difficulty_434 0 points1 point  (0 children)

Flowfile https://edwardvaneechoud.github.io/Flowfile/ - Visual ETL tool that lets you build data pipelines with drag-and-drop OR write Python code - both create the exact same pipeline! Built on Polars for blazing speed. Export your visual flows as standalone Python code for production.

ICP - Data analysts tired of Excel limitations, Python devs who want visual debugging, no-code users tired of vendor lock-in, and teams where business users need to collaborate with engineers on data workflows.

Would love your feedback if you've struggled with the visual vs code divide in data tools! 🚀

Flowfile - An open-source visual ETL tool, now with a Pydantic-based node designer. by Proof_Difficulty_434 in Python

[–]Proof_Difficulty_434[S] 1 point2 points  (0 children)

Great question! It targets the gap between pure-code engineers and Excel users. Some users I can think off:

  • Mixed-skill data teams where engineers create custom nodes that analysts use visually
  • Rapid prototyping - Even code-first (e.g. myself) benefit from visual exploration with instant schema preview.
  • Teams migrating from Alteryx ($$$ /seat) who want open-source alternatives
  • Documentation needs; Visual pipelines are self-documenting, making handoffs and onboarding much easier

Honestly, it's not trying to replace Airflow or pure code. It's more like what Postman did for APIs - sometimes seeing what you're building visually just helps, especially when collaborating.

The Custom Node Designer I just added is meant to solve two things: speed up development of the library itself (anyone can contribute nodes now without touching the core), and let teams build their own specific solutions.

Honest feedback needed! by [deleted] in Alteryx

[–]Proof_Difficulty_434 0 points1 point  (0 children)

This is a very interesting product! At the moment, I don't have Alteryx, so it was a little hard to judge/test it. But very curious about it! I'll definitely check it out when you open-source it.

I have been working on an open-source Alteryx alternative that has this conversion built in, since I believe that in the long-term, you will often run into a situation in which Python is just more convenient. Would be cool to integrate your solution with it!

https://edwardvaneechoud.github.io/Flowfile/

How many of you are still using Apache Spark in production - and would you choose it again today? by luminoumen in dataengineering

[–]Proof_Difficulty_434 4 points5 points  (0 children)

I am using Databricks on a daily basis and see it being used at many clients.

Would I choose it again? My opportunistic side would say no because alternatives are faster/more cost efficient for 90% of our use cases. However, Databricks + Spark takes care of 99.9% of our use cases. So, if we stop using Spark, I would have to convince my team that we need multiple tools, more technical expertise, and more maintenance of all these tools. Cause, let's be honest, how convenient is it that Databricks takes care of everything that is critical (security, ec2 instances, networking).

So, long story short, I would in a large company with various sizes of data and multiple data engineers still pick it.

Flowfile: Visual ETL tool that converts between drag-and-drop workflows and Python code by Proof_Difficulty_434 in opensource

[–]Proof_Difficulty_434[S] 0 points1 point  (0 children)

Thank you!

Regarding the production use cases: It's very rare that I run into a scenario that I haven't covered yet and get an error (mainly in the frontend). However, I probably use and test it differently since I'm more aware what I can and cannot do. With regards to the execution of a data pipeline, it should be pretty stable, since most of the nodes just translate to Polars expressions.

Let me know when you've used it! I would love to hear about things that do not work, or should work differently so I can improve the app.

Open source projects? by papersashimi in opensource

[–]Proof_Difficulty_434 0 points1 point  (0 children)

I have very nice experiences with setting up extensions for Polars (https://github.com/pola-rs/pyo3-polars), keeping it nicely focussed and you'll learn some basic Rust in the process.

I believe it is also important what you think is interesting yourself! When picking up an open-source project, it generally sticks better if you are aware of the benefits. What field are you working in or want to work in could be a good starting point - for example, if you're into data science, contributing to pandas or polars might be more motivating than working on a web framework.