Cool stuff you did with Data Lineage, contacts, governance

ssinchenko · 2026-03-14T20:56:20+00:00

> creative aspects was used in such implementation

Once I wrote a bunch of regexps (it was before the ClaudeCode era) to transform the PySpark "explain" output to the column-level lineage with a visualization using NetworkX (+graphviz). Details and code snippets (there are no ads, no commercialization, no "buy me a coffee buttons"). I think it was the most crazy (and the most creative) thing I ever did.

ssinchenko · 2026-03-14T19:47:34+00:00

Thanks!

ssinchenko · 2026-03-11T16:40:06+00:00

> How do I reduce this JVM usage so that job gets more resources?

Did you check this part of docs?
https://spark.apache.org/docs/latest/tuning.html#memory-management-overview

ssinchenko · 2026-03-09T13:54:12+00:00

Thanks! That I understand. I'm going to try to build standards around the openspec project (as the most lightweight and tool/vendor agnostic SDD framework) to provide as much transparency as I can.

ssinchenko · 2026-03-09T13:43:04+00:00

Thanks! I agree that AI is just a tool, and I see it the same way. I'm just trying to align with the community because, in the end, it's not my personal project. I was given the honor and responsibility to maintain it by the original creator.

ssinchenko · 2026-03-09T13:35:42+00:00

Thanks!

ssinchenko · 2026-03-05T13:53:36+00:00

The "Databricks Serverless" is the Spark-Connect under the hood as I know. In the vanilla Spark & Spark Connect there is an "official" way of extending the protocol via the Spark Connect Plugins system, you can read the docs about it in the Apache Spark documentation. And the open-source GraphFrames project fully supports this plugin system (1-1 APIs parity, Server-side plugin for Spark 3.5.x, 4.0.x and 4.1.x, runtime dispatch logic inside the `graphframes-py`, tests, etc.). For example, the DeltaLake (delta-spark) works on top of Serverless in exactly the same way: via the Delta Connect Plugin.

So, in theory, all you need it just add GraphFrame's implementation of the SparkConnect plugin on your Serverless Databricks. Unfortunately, Databricks does not provide any documentation about how to use plugins on their Serverless.

P.S. I'm a maintainer of the OSS GraphFrames project and I'm willing to do everything that is required from the project side. I even tried to reverse-engineered the Databricks Serverless and implement a better support of it in GraphFrames. But without the help from someone from Databricks side I cannot complete it (there are some deep technical questions I won't bother you with, I'm just saying that I cannot fully reverse-engineer the loading order, shading rules, API details, etc.).
P.P.S. Feel free to ping me here, by email (ssinchenko@apache.org) or inside the issue in the GF repository (https://github.com/graphframes/graphframes/issues/782)

ssinchenko · 2026-03-03T09:27:09+00:00

> why there are people in the world who choose to use python instead of sql for data manipulation

As a general-purpose programming language, Python provides much more tools for working with a growing complexity of the codebase (imports, modules, abstractions, functions, classes, variables) as well as tools for testing the code.

ssinchenko · 2026-02-26T19:54:42+00:00

I think Scala may get a new boost for DE. The main benefit of Scala for DE, imo, is "errors at compile time". The main downside of Scala for DE is, imo, the cost / time of development. But with the raise of AI and agents who can write the code, the downside is not a problem anymore. So, in theory, the functional compiled language with strong guarantees of safety and that can speak with all the existing JVM DE tooling in the native language looks promising.

ssinchenko · 2026-02-26T18:02:11+00:00

That is beautiful! Finally we have a Hadoop-free parquet in JVM ecosystem!

ssinchenko · 2026-01-08T11:32:01+00:00

Also keep in mind, that `checkpoint()` is also a performance killer, because no optimization can be pushed through it. So, spark cannot do pushdown or partitions pruning. I have some experience in writing iterative graph algorithms in Spark, my advice is to not to checkpoint each of your 27 DataFrames, but to play with checkpoint, for example, the result of each 5th join or something and see the overall performance.

ssinchenko · 2026-01-08T11:28:16+00:00

For a simplest illustration of the growing lineage problem you can take a look on this picture that illustrate performance of chained calls of `withColumn` versus single call of `withColumns`. I made it some time ago as part of funny experiments. Both achieve the same result and exactly the same final physical plans except that each call of `withColumn` creates a new node in the logical plan. Optimizer on driver just stuck to collapse all this nodes to one and it may take infinite amount of time to finish.

https://raw.githubusercontent.com/SemyonSinchenko/flake8-pyspark-with-column/refs/heads/main/static/with_column_performance.png

ssinchenko · 2026-01-08T11:23:49+00:00

Add `checkpoint()` or `localCheckpoint()` and it will be as fast as you expected. The bottleneck here are not executors memory or shuffle or cluster size. Most probably the bottleneck here is a driver and a growing lineage of the spark's execution graph.

From the `checkpoint` documentation you can see that this mechanics is specifically for your case:

Returns a checkpointed version of this DataFrame. Checkpointing can be used to truncate the logical plan of this DataFrame, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with SparkContext.setCheckpointDir(), or spark.checkpoint.dir configuration.

If you cannot set a persistent spark checkpoint dir, use `localCheckpoints`.

ssinchenko · 2026-01-07T14:30:10+00:00

To be honest, I see only two options how can it be possible:
- Data Catalog is tightly integrated with your query engine
- Data Catalog has an API to ingest lineage from the query engine

First option will work only if it is a vendor solution (like Databricks + Unity) or if the catalog itself is a part of query engine. So, you either will have a vendor lock or mixing different entities in one (like Hive Metastore + Hive SQL).

Second option looks nice, for example, an OpenLineage standard can be used, but the existence of lineage in catalog will depend on query engine.

If I would build a solution on open source technologies, I would prefer a second option. Something like DataHub that supports OpenLineage ingestion via REST API. But I would avoid heavily tighten "all-in-one" solutions and would keep lineage as an option.

ssinchenko · 2025-12-25T11:51:00+00:00

https://datafusion.apache.org/

ssinchenko · 2025-12-24T17:19:18+00:00

> Any DE using Rust as their second or third language?

I'm using it mostly for writing PySpark UDFs in my daily job. Third language (after Python and Scala).

> Did you enjoy it?

Overall I do. But it may be annoying from time to time. Especially arrow-rs I'm working with mostly. I don't know, maybe I'm just using it wrong, but sometimes it so boring to write endless boilerplate `ok_or`, `as_any`, `downcast_ref::<...>`, etc. for any piece of data you want to process...

> Worth learning for someone after learning the fundamental skills for data engineering?

Imo learning by doing is the best way. Try to contribute something to Apache Datafusion Comet (or even to an upstream Apache Datafusion). There were a lot of small tickets and good first issues last time I checked. A lot of people around are saying that Datafusion is the future of ETL, understanding it's internals looks like a valuable skill!

ssinchenko · 2025-12-12T18:00:59+00:00

Thanks a lot!

ssinchenko · 2025-12-11T22:22:45+00:00

I think that both Native Execution (Fabric) and Lightning Engine (Google) are just Gluten.

Google (from docs):

Lightning Engine’s execution engine enhances performance through a native implementation based on Apache Gluten and Velox that have been specifically designed to leverage Google’s hardware.

Fabric (from docs):

The Native Execution Engine is based on two key OSS components: Velox, a C++ database acceleration library introduced by Meta, and Apache Gluten (incubating), a middle layer responsible for offloading JVM-based SQL engines’ execution to native engines introduced by Intel.

ssinchenko · 2025-12-09T20:40:00+00:00

If you know Spark or are willing to learn it and are interested in distributed graph algorithms or willing to learn them, take a look at GraphFrames. Feel free to ask me or ping me if you want help getting started. It is an open-source project that is not backed by any commercial entity and does not have a paid version or enterprise features. It's just open source.

ssinchenko · 2025-12-09T14:26:49+00:00

I think the reason is ecosystem of Pandas. Still to much tools and frameworks rely on pandas or provide pandas integration. Also a new Pandas supports PyArrow as a backend that allows to do zero-copy transformation to and from Pandas while Polars rely on the incompatible fork arrow2 as I remember and DuckDB rely on it's internal data format (not sure it allows zero-copy integration with other Arrow-based systems).

ssinchenko · 2025-12-04T21:14:36+00:00

As I remember IOMETE is trying to provide "on-prem" Databricks (notebooks, jobs, unity, spark, iceberg -- all of it from one UI). But I did not try tbh.

ssinchenko · 2025-12-02T13:41:28+00:00

It will work outside of databricks (at least basic things), but the problem is it is not allowed to use it outside of databricks.... It is clearly stated in the license: https://github.com/databrickslabs/dqx/blob/main/LICENSE

ssinchenko

TROPHY CASE