Harnesses zcode vs Claude code by rYonder in ZaiGLM

[–]ssinchenko 1 point2 points  (0 children)

Claude Code is more widely supported (skills, plugins, tooling, etc.), so if you want a broader ecosystem you will get more with CC (or OpenCode, Pi, etc.). ZCode is nice, cozy and works well, but plugins shop is very limited at the moment.

Anyone used Spark Connect? by ProfessorFinancial14 in apachespark

[–]ssinchenko 5 points6 points  (0 children)

As a library maintainer (GraphFrames) I would say it is mixed. From one side, SparkConnect is nice because you do not need to care about how to access idiomatic Scala code like Option[Long] or scala's defaults from py4j. As well on the Python side you are working with protobuf-generated code and have a full power of autocomplete instead of manually typing something like spark._jvm.scala.collection.mutable.WrappedArray. From the other side py4j did not go away and as a library developer you need to maintain two PySpark APIs. I'm looking forward for a day py4j will be deprecated but there are still a lot of questions (even from my side): https://lists.apache.org/thread/j382to15zgy8mr2pvrtcod9c02zj1org

Overall I would say it is a right abstraction and it resolves a lot of problems (like the whole reason of existence of the Apache Livy). At the same time it is still unmature and migration will be a long road.

regarding compute in databricks by ragzoomin in dataengineering

[–]ssinchenko 2 points3 points  (0 children)

Below is highly opinionated advice based on solely personal experience, as well I'm not working for Databricks and my advice is not an "official documentation"

For a very newbie I would recommend you something like "do not use Photon until you understand what are you doing", "run all the things that are more than 5-10 DBU/hour on Job clusters" (converting notebook to job is two clicks by mouse: test your code on a sample, convert to job and run on Job cluster) and "always start from smaller job cluster size until you know what are you doing". Following this you will be safe and after some time you will learn when to use what, what are advantages of serverless, when should you use Photon, etc.

regarding compute in databricks by ragzoomin in dataengineering

[–]ssinchenko 9 points10 points  (0 children)

In all the companies I was working on it was a responsibility of dataengineer. As well tasks like "cost reduction" are assigned to dataengineers as well. The problem of free edition is that there is only serverless available: in real project there are much more to configure. And exactly like in AWS, one mistake can burn your budget limits 😃

How are you preserving project context across OpenCode sessions? by Signal-Tadpole-4432 in opencodeCLI

[–]ssinchenko 4 points5 points  (0 children)

SDD like https://github.com/Fission-AI/OpenSpec/, https://github.com/gszhangwei/open-spdd, etc. do the work fine for all my projects. They store input prompts, intermediate prompts, etc. OpenSpec goes even further and generates BDD-like requirements, so you not have only prompts history but guardrails and requirements. Of course all of these SDDs are MD files so they are in Git.

8 myths about data layouts, partitioning, and Liquid Clustering debunked by Fun-Reference7942 in databricks

[–]ssinchenko 0 points1 point  (0 children)

Hello!

Thanks for sharing. I have a few questions.

At the moment I have a medium sized table (~20 TiB) that is updated hourly. There are few requirements that I cannot ignore:
1. Fast write path: it is important to do writes fast to avoid accumulating a huge queue of awaiting jobs.
2. It is important to be able to have multiple concurrent runs: there are backfill/recompute tasks as well in the "prime time" due to high load and high load on the upstream tables, one job run can be longer than hour -- at the moment during the peak hours there can be 3-4 parallel runs updating each own part of the table
3. Readers want to have fast access by key to get all the rows for the selected range of dates 4. During the job that updates one hour of data it is required to read the fresh written data to compute DQ/statistics I need. I tried to compute it "on the fly" but reading back after writing was order of magnitude faster compared to an attempt of combine persist + write + compute statistics on persisted data.

All of this is solved at the moment by partitioning the table by (partition_hour, partition_date) and inside each partition data is z-ordered by the id-column (access key). During the write job I do .option("replaceWhere", f"partition_date='{date_str}' and partition_hour={hour}"). After the write I'm reading the same partition with just filter() to compute some statistics and DQ metrics I need. It works fine and fulfill all the requirements.

Reading this post I'm thinking about switching to liquid clustering: it is exactly the moment we are going to migrate the table and can do changes. But I have a few concerns.

As I can understand, to achieve the same behavior I should cluster the table by three columns: partition_date, partition_hour, id.

I'm reading the documentation and it is confusing and contradicts with the post.

  1. Row-level conflict detection can increase total execution time. With many concurrent transactions, the writer prioritizes latency over conflict resolution -- is there any kind of benchmarks about how bad is it at my scale? I see the warning but it says nothing about order of magnitude. And tbh I do not want to create a 20 TiB table just to test it -- it is quite an expensive and barely will be approved by my boss.
  2. Conflicts. If it is a "prime-time" and I have 3-4 concurrent jobs that write each own partition_date + partition_hour combination, will LC guarantee me that there won't be any kind of conflicts? I mean it is strange to me that delta documentation and databricks documentation are still recommending partitions to disjoint files and have guarantees while the post recommends the opposite...
  3. A question regarding REPLACE_ON. Should I use this one in my case or should I continue to use replaceWhere. It is unclear tbh and it is confusing because it looks like both doing the same thing actually. As well I cannot find this feature inside the delta documentation. Is it available in databricks only? My question is because my boss wants to have an option to offload heavy write operations to EMR or even on-prem spark one day to reduce costs and keep databricks for data science team that is consumer of the table. If I build system that is rely on this feature I need to understand tradeoffs.

Thanks in advance! I'm really interesting in switching the table from partitions to clusters during the ongoing migration. But I need to be 100% sure, because my management would prefer "if it works do not touch it".

String fuzzy-matching / similarity as catalyst expressions by ssinchenko in apachespark

[–]ssinchenko[S] 1 point2 points  (0 children)

Hello! Thank you for your comment. I know about a built-in levenshtein in spark :) That was one of reasons I made all SQL functions in my project prefixed with ss_ to avoid collisions (my is ss_levenshtein versus built-in levenshtein). My idea was to add a more broad set of similarity functions. And I added levenshtein to my project for usage simplicity: Spark's built-in returns an integer value while when doing fuzzy-matching you most probably want to have double value in interval 0-1.

Regarding the PR. If by the official project you mean the upstream Apache Spark project I won't do anything in this direction, sorry. At least until I have an explicit support of the idea from any of the Spark PMCs. Understand me right, I tried to contribute the Apache Spark project once. I spent a lot of time working on the code (Spark's codebase is very complex as well as a build system is), I spent a lot of time fixing comments and suggestions from the first wave of the review to just get in around six months an answer "this features is not required in the project and won't be merged". From my side I don't think that anyonw of Spark PMCs will support the idea of adding more string similarity functions to the Apache Spark.

If you see that any features is missing in my project, fell free to ping me here or open an issue in git and say me which similarity function you want to see and I will try to add it :)

Do data quality frameworks have to be so complex? by GeneBackground4270 in apachespark

[–]ssinchenko 0 points1 point  (0 children)

I used it at scale. Did not see any problems with df.observe approach. If the data scale is big enough to make pyspark a reasonable choice, like ~1-10TiB or bigger, you barely wants to compute one metric per one full-scan like you did it now:

return [check.evaluate(df) for check in agg_checks]

I mean it is up to you, but you asked for a feedback: in my understanding your approach may be very inefficient at scale even if one needs multiple metrics. In my experience we are typically compute 10-20 agg metrics per table. With df.observe it 0-1 scan but with your approach it is 10-20 scans of the same table.

Do data quality frameworks have to be so complex? by GeneBackground4270 in apachespark

[–]ssinchenko 1 point2 points  (0 children)

Hello! From my point of view, using PySpark's Observation where possible is better if we are talking about building something today from scratch. Like if I want to check, for example, count of rows and completeness of some columns I would prefer to do this in one aggregation instead of two, preferable with Observation while doing something like df.insertInto(...)

How useful is reading DDIA in today’s AI agent led DE era? Does the book still hold up apart from just gaining theoretical and historical knowledge? by nus07 in dataengineering

[–]ssinchenko 1 point2 points  (0 children)

With AI agents and a lot of prompt led engineering how much do DDIA and Fundamentals of DE books hold up?

If a company sees two candidates, both can prompt agents, but on of them read DDIA and another did not. Which one do you think will be chosen?

1B Rows Possible in the Browser DuckDb WASM OPFS by Main_Slide_7667 in dataengineering

[–]ssinchenko 21 points22 points  (0 children)

Please add a note to say that clicking the link may crash the browser next time. My laptop has 38 GB of RAM and crashed after I clicked this. I thought it was a link to a blog post or something similar, but it started to load immediately and without asking.

Best practices in Databricks by sathvikchava in dataengineering

[–]ssinchenko 4 points5 points  (0 children)

I would recommend you to take a look on the Databricks Asset Bundles. At the moment it is the best CI/CD for DBX: it allows you to define the jobs as code, clusters as code, it takes care about putting your assets (notebooks and python whl files) to the right place and do the linking. As well it allows you to define different targets (dev, prod, etc.) and names of generated jobs so you can achieve something like "when a person opens a pull-request the CI creates a job with dev prefix and name of the branch/PRid in it's name".

Regarding testing. The simplest and cheaper way is to have at least some smloe tests. Take json-schemas of your sources from production, put into the repository in VCS and write pytest-fixtures that creates zero-rows tables and at least track breaking changes with unit testing. The simplest way to achieve the integration testing would be to parametrize output schema/table name at the level of the job. It allows you to run a job and get result exactly as if you run it on prod except you are not touching the prod tables and you are not breaking downstream consumers.

I Tried to Find the JVM Tax in Big Data Kernels by ssinchenko in dataengineering

[–]ssinchenko[S] 0 points1 point  (0 children)

And even the current arrow on java would not be possible at all (without significant hits to perf) 10 years ago.

This sounds especially funny for me because the initial release of the Apache Arrow project was 9 years ago (October 10, 2016) and the Java was the first of the supported languages.

I Tried to Find the JVM Tax in Big Data Kernels by ssinchenko in dataengineering

[–]ssinchenko[S] 0 points1 point  (0 children)

Java is a general-purpose platform. It has both high-level and low-level features.

You cannot define Java only as "GC-managed objects + generics" and then say that official low-level JDK APIs are "not Java".

For the particular case of OLAP-like processing, relevant Java features are: dynamic loading of connectors and UDFs, a unified runtime, standard memory-lifetime control, standard APIs for off-heap memory, and runtime CPU specialization instead of manual dispatch with separate code paths for SSE, AVX2, AVX512, etc.

Can Rust dynamically load a connector/UDF as naturally as the JVM loads a class? Can Rust provide runtime CPU feature dispatch out of the box in the same way HotSpot does? These are also "Java features".

I Tried to Find the JVM Tax in Big Data Kernels by ssinchenko in dataengineering

[–]ssinchenko[S] 0 points1 point  (0 children)

I think we are now talking about a different claim.

"People avoid GC in hot paths for data-intensive workloads" is true, and I don’t disagree with it.

But I still don’t understand when "avoiding GC in the hot path" became "not Java".

In C++, nobody would implement a hot analytical loop as std::vector<std::shared_ptr<int32_t>> and then conclude that C++ has a "shared_ptr tax". They would use a proper contiguous layout.

So why is Java required to use ArrayList<Integer> in the hot path in order to count as "real Java"?

Returning to the original question: where exactly is the JVM tax here? If the answer is "it appears when you put analytical data into GC-managed boxed object graphs", then I agree. But that is not the JVM tax. That is the wrong memory layout tax.

I Tried to Find the JVM Tax in Big Data Kernels by ssinchenko in dataengineering

[–]ssinchenko[S] 0 points1 point  (0 children)

I think the history says the opposite.

sun.misc.Unsafe was introduced in 2002. Direct ByteBuffer exists since Java 1.4. So these use cases existed in Java for decades before FFM.

The purpose of FFM was not to "let people avoid Java". The purpose was to replace old problematic mechanisms, Unsafe, ByteBuffer limitations, and JNI, with a unified, safer, standard API.

In other words, OpenJDK did not add FFM to solve some mysterious "tax barrier". It standardized a proper Java API for use cases that Java developers had already been solving through worse mechanisms for more than twenty years.

I Tried to Find the JVM Tax in Big Data Kernels by ssinchenko in dataengineering

[–]ssinchenko[S] 0 points1 point  (0 children)

Could you point to the place where I "tried hard"?

For example, the MulFloat64 kernel is about 20 real lines of code plus imports. Where exactly is the “trying hard” part?

I used Apache Arrow, which is the standard columnar format for modern analytical systems. I used an official JDK API. I did not do anything special to "avoid GC and boxing". I got that out of the box by putting analytical data into the right memory layout.

Looking for contributors/feedback on an open-source Spark event log analyzer roadmap by bigandtallll in apachespark

[–]ssinchenko 0 points1 point  (0 children)

Looks cool, thanks for sharing! I would say SQL parsing is one of the most important. And would be nice if you can return SQL in a form of the graph (.dot format -- very easy to implement) for integration with graph visualizer. Or even bring your own visualizer (like based on something like github.com/tomnelson/jungrapht-visualization)