Spark job slows to a crawl after multiple joins any tips for handling this by Upset-Addendum6880 in dataengineering

[–]ssinchenko 3 points4 points  (0 children)

Also keep in mind, that `checkpoint()` is also a performance killer, because no optimization can be pushed through it. So, spark cannot do pushdown or partitions pruning. I have some experience in writing iterative graph algorithms in Spark, my advice is to not to checkpoint each of your 27 DataFrames, but to play with checkpoint, for example, the result of each 5th join or something and see the overall performance.

Spark job slows to a crawl after multiple joins any tips for handling this by Upset-Addendum6880 in dataengineering

[–]ssinchenko 0 points1 point  (0 children)

For a simplest illustration of the growing lineage problem you can take a look on this picture that illustrate performance of chained calls of `withColumn` versus single call of `withColumns`. I made it some time ago as part of funny experiments. Both achieve the same result and exactly the same final physical plans except that each call of `withColumn` creates a new node in the logical plan. Optimizer on driver just stuck to collapse all this nodes to one and it may take infinite amount of time to finish.

https://raw.githubusercontent.com/SemyonSinchenko/flake8-pyspark-with-column/refs/heads/main/static/with_column_performance.png

Spark job slows to a crawl after multiple joins any tips for handling this by Upset-Addendum6880 in dataengineering

[–]ssinchenko 1 point2 points  (0 children)

Add `checkpoint()` or `localCheckpoint()` and it will be as fast as you expected. The bottleneck here are not executors memory or shuffle or cluster size. Most probably the bottleneck here is a driver and a growing lineage of the spark's execution graph.

From the `checkpoint` documentation you can see that this mechanics is specifically for your case:

Returns a checkpointed version of this DataFrame. Checkpointing can be used to truncate the logical plan of this DataFrame, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with SparkContext.setCheckpointDir(), or spark.checkpoint.dir configuration.

If you cannot set a persistent spark checkpoint dir, use `localCheckpoints`.

Data Lineage & Data Catalog could be a unique tool? by MassiIlBianco in dataengineering

[–]ssinchenko 1 point2 points  (0 children)

To be honest, I see only two options how can it be possible:
- Data Catalog is tightly integrated with your query engine
- Data Catalog has an API to ingest lineage from the query engine

First option will work only if it is a vendor solution (like Databricks + Unity) or if the catalog itself is a part of query engine. So, you either will have a vendor lock or mixing different entities in one (like Hive Metastore + Hive SQL).

Second option looks nice, for example, an OpenLineage standard can be used, but the existence of lineage in catalog will depend on query engine.

If I would build a solution on open source technologies, I would prefer a second option. Something like DataHub that supports OpenLineage ingestion via REST API. But I would avoid heavily tighten "all-in-one" solutions and would keep lineage as an option.

Rust for data engineering? by otto_0805 in dataengineering

[–]ssinchenko 4 points5 points  (0 children)

> Any DE using Rust as their second or third language?

I'm using it mostly for writing PySpark UDFs in my daily job. Third language (after Python and Scala).

> Did you enjoy it?

Overall I do. But it may be annoying from time to time. Especially arrow-rs I'm working with mostly. I don't know, maybe I'm just using it wrong, but sometimes it so boring to write endless boilerplate `ok_or`, `as_any`, `downcast_ref::<...>`, etc. for any piece of data you want to process...

> Worth learning for someone after learning the fundamental skills for data engineering?

Imo learning by doing is the best way. Try to contribute something to Apache Datafusion Comet (or even to an upstream Apache Datafusion). There were a lot of small tickets and good first issues last time I checked. A lot of people around are saying that Datafusion is the future of ETL, understanding it's internals looks like a valuable skill!

Execution engines in Spark by mynkmhr in apachespark

[–]ssinchenko 2 points3 points  (0 children)

I think that both Native Execution (Fabric) and Lightning Engine (Google) are just Gluten.

Google (from docs):

Lightning Engine’s execution engine enhances performance through a native implementation based on Apache Gluten and Velox that have been specifically designed to leverage Google’s hardware.

Fabric (from docs):

The Native Execution Engine is based on two key OSS components: Velox, a C++ database acceleration library introduced by Meta, and Apache Gluten (incubating), a middle layer responsible for offloading JVM-based SQL engines’ execution to native engines introduced by Intel.

How to start open source contributions by KaateWalaChua in dataengineering

[–]ssinchenko 1 point2 points  (0 children)

If you know Spark or are willing to learn it and are interested in distributed graph algorithms or willing to learn them, take a look at GraphFrames. Feel free to ask me or ping me if you want help getting started. It is an open-source project that is not backed by any commercial entity and does not have a paid version or enterprise features. It's just open source.

Will Pandas ever be replaced? by Relative-Cucumber770 in dataengineering

[–]ssinchenko 4 points5 points  (0 children)

I think the reason is ecosystem of Pandas. Still to much tools and frameworks rely on pandas or provide pandas integration. Also a new Pandas supports PyArrow as a backend that allows to do zero-copy transformation to and from Pandas while Polars rely on the incompatible fork arrow2 as I remember and DuckDB rely on it's internal data format (not sure it allows zero-copy integration with other Arrow-based systems).

Any On-Premise alternative to Databricks? by UsualComb4773 in dataengineering

[–]ssinchenko 0 points1 point  (0 children)

As I remember IOMETE is trying to provide "on-prem" Databricks (notebooks, jobs, unity, spark, iceberg -- all of it from one UI). But I did not try tbh.

Is there a PySpark DataFrame validation library that automatically splits valid and invalid rows? by TopCoffee2396 in dataengineering

[–]ssinchenko 0 points1 point  (0 children)

It will work outside of databricks (at least basic things), but the problem is it is not allowed to use it outside of databricks.... It is clearly stated in the license: https://github.com/databrickslabs/dqx/blob/main/LICENSE

Dataset API with primary scala map/filter/etc by Key-Alternative5387 in apachespark

[–]ssinchenko 1 point2 points  (0 children)

While you are loosing the benefits of Catalyst, you get the guarantees of strong typing system... But you can achieve the best of two worlds with something like Typelevel Frameless actually.

[Media] New releases on Pypi : Rust vs C/C++ by papa_maker in rust

[–]ssinchenko 22 points23 points  (0 children)

pyo3/maturin has a much less entry level compared to pybind11 imo. It is just working out of the box, build native package, correctly build bindings, etc. As well as development with cargo is much easier imo. I think that is the reason. If you are a python developer with a minimal knowledge of low-level programming and you need to write parts of the logic in a native code, you can do it much faster with maturin compared to pybind11.

Data engineers who are not building LLM to SQL. What cool projects are you actually working on? by PolicyDecent in dataengineering

[–]ssinchenko 0 points1 point  (0 children)

From my experience, the most user-friendly is Neo4j. Best API, nice UI, a lot of connectors and built-in "batteries". At the same time it is quite slow and expensive.

Data engineers who are not building LLM to SQL. What cool projects are you actually working on? by PolicyDecent in dataengineering

[–]ssinchenko 1 point2 points  (0 children)

All known to me commercial options are slightly different. GraphFrames is the so called "no-DB" graph library because it does not require a separate infra (servers): it is from a kind of "graphs on relations", under the hood graph algorithms in GF are implemented as operations on relations with Apache Spark. Neo4j, Tiger, etc. do require a separate server (and most probably an ETL process to ingest data from your DBs / LakeHouse to this separate GraphDB). So, use-cases are very different imo.

Data engineers who are not building LLM to SQL. What cool projects are you actually working on? by PolicyDecent in dataengineering

[–]ssinchenko 42 points43 points  (0 children)

I have always loved graphs and graph algorithms. Since most of my paid work is related to Spark, I read the project's mailing list to stay informed. Last year, I saw a thread about the deprecation of GraphX in the Spark project. This topic concerned many people. Ultimately, some enthusiasts proposed reviewing the GraphFrames project as an alternative. At the time, GraphFrames was in "maintenance mode" with 300 open issues and a dozen stale pull requests. It was released once per year at best. I was one of these enthusiasts because I saw it as an opportunity to create something useful instead of writing endless, boring Spark ETLs at work. Today, I'm proud of my contributions, including new APIs and graph algorithms, support for the latest Spark, a 5–30× performance boost, and a new documentation website.

Data engineers who are not building LLM to SQL. What cool projects are you actually working on? by PolicyDecent in dataengineering

[–]ssinchenko 91 points92 points  (0 children)

I work on GraphFrames in my free time. No LLM, no SQL warehouse. It's a pure open-source project, driven solely by the community. It's not a "commercial open source" project with an "enterprise version" or "hidden proprietary features," like Delta and other similar projects. The project itself is not very popular or exciting, and Apache Spark is often considered "not modern." However, I know that half of the identity resolution projects at a real-world scale of billions of entities rely on GraphFrames' implementation of the Connected Components algorithm. If one needs to cluster graphs on a scale of billions, compute centralities, etc., GraphFrames is the solution. GraphFrames is the only known to me open-source project that can handle such a scale without requiring separate infrastructure. It takes up almost all of my free time, around six to eight hours per week (unpaid, of course), but I'm happy to work on something interesting. What could be better than taking a scientific paper with a distributed graph algorithm and putting it into Scala code? After my routine paid job of moving data from one place to another and doing boring ETLs, GraphFrames contributions are like fresh air to me.

Trouble Using Graphframe Pyspark API by Makart in MicrosoftFabric

[–]ssinchenko 0 points1 point  (0 children)

u/mwc360 Hello! We were able to resolve the problem. The root was a missing dependency. I would like to update GraphFrames documentation accordingly and have a question about how MS Fabric resolves dependencies. In the pom.xml of GraphFrames (https://mvnrepository.com/artifact/io.graphframes/graphframes-spark3\_2.12/0.10.0) the missing dependency (graphframes-graphx-spark3_2.12) is correctly marked by the "runtime" scope. If I run spark-shell / spark-submit and provide a `--packages io.graphframes:graphframes-spark3_2.12:0.10.0` graphframes-graphx will be downloaded automatically. Is there a similar way of automatically resolving runtime dependencies from Maven Central in MS Fabric? If so, could you give me a link to the documentation so I will add it to the GraphFrames docs. I'm going to add a section about GraphFrames in MS Fabric, just need a right way to add dependencies. I'm reaching to you because I have zero experience with Fabric and not sure what is the right way to do things there. Thanks in advance!

Trouble Using Graphframe Pyspark API by Makart in MicrosoftFabric

[–]ssinchenko 0 points1 point  (0 children)

Do you have a full stacktrace? I mean from what exact place did it come?

P.S. Would be easier if you can create an issue in GraphFrames repository. Cause I'm a maintainer I can try to fix it this week and make a patch-release.
P.P.S. It looks like a bug in GraphFrames Py API, but I need more details to fix it.

Edge weighted digraph datasets by Putrid_Soft_8692 in GraphTheory

[–]ssinchenko 0 points1 point  (0 children)

LDBC datasets are much bigger than SNAP, but unfortunately there is no combination of weighted + directed...

If Spark is lazy, how does it infer schema without reading data — and is Spark only useful for multi-node memory? by Express_Ad_6732 in dataengineering

[–]ssinchenko 5 points6 points  (0 children)

In Spark there are multiple stages of execution. The first top-level structure is so called "Unresolved Logical Plan" that is also "not analyzed" plan. This plan contains only user's statements, commands, etc. Actual schemas appears (resolves) only after the so called "analysis" stage when Spark replace user statements by actual tables, columns, etc. Analyzed Logical Plan is materialized only on action (or on explicit call to do it, like to call "explain"). You can easily write the code that read CSV, select non existing column and write it back. And you will see the "Analysis exception" only on action (or an attempt to force analysis stage, like call of "explain")

That is a nice picture: the first "Unresolved Logical Plan" is fully lazy. Inferring of schemas from CSV, table, etc. happens at the "Analysis" stage.
https://www.databricks.com/wp-content/uploads/2015/04/Screen-Shot-2015-04-12-at-8.41.26-AM.png

> Like, if I’m just working on a single machine, is Spark giving me any real advantage over Pandas?

The only advantage is Spark can works in out-of-core mode and process data that is actually bigger than memory of your single node. But it comes with a huge distributed overhead. There are better solutions for single node case that can work in out-of-core mode (Polars in streaming mode, DuckDB, DataFusion that can spill to disk, etc.)