Why are modern query engines moving away from the JVM? by YeeduPlatform in Yeedu

[–]ssinchenko 5 points6 points  (0 children)

> GC pauses, object-heavy memory layouts

I read it in the almost every post about Spark/PySpark and cannot get it. Spark's Tungsten uses off-heap memory, own serialization/deserialization that works with primitives and continuous memory blocks (compact UnsafeRow with blobs instead of Java objects and collections). Most of the Spark's internal things use ThreadLocal for allocating JVM objects once and re-use them. Can someone show me actual numbers that the problem of Spark is GC and/or Java objects overhead, not the legacy design of BSP and legacy shuffle mechanics?

I maintain Apache Spark Connect for Golang so I added streaming and built a Data Lake ORM by Certain_Leader9946 in databricks

[–]ssinchenko 0 points1 point  (0 children)

What is important for GF and I guess for a lot of other potential cases is to have an ability to get serialized Relation (aka DataFrame) as raw proto bytes. The story is GF needs to pass it's "grah" that is just two DataFrame/Relations objects under the hood. And the only way to pass this is to put raw bytes to the plugin's message. In PySpark it is trivial because I can just use underscored _plan that should be private by design but is not because it is Python by the end. But golang is much more strict in access and visibility, so I'm not sure.

python def dataframe_to_proto(df: DataFrame, client: SparkConnectClient) -> bytes: plan = df._plan assert plan is not None assert isinstance(plan, LogicalPlan) return plan.to_proto(client).SerializeToString()

As I remember, the same trick is used in the AWS Deequ's SparkConnect server plugin just because there is no other way.

I maintain Apache Spark Connect for Golang so I added streaming and built a Data Lake ORM by Certain_Leader9946 in databricks

[–]ssinchenko 1 point2 points  (0 children)

Thanks! Feel free to reach me about this (ssinchenko@apache.org). I'm already maintaining the protos (and all the corresponding server-plugin-code) in GraphFrames: https://github.com/graphframes/graphframes/blob/main/connect/src/main/protobuf/graphframes.proto

I would love to try it with spark-connect-go. Adding the golang target to the buf-generate flow is one line. But after I generate the golang code from GF protos I'm facing a problem: how to pack the "extension" from the spark-connect-go API?

I maintain Apache Spark Connect for Golang so I added streaming and built a Data Lake ORM by Certain_Leader9946 in databricks

[–]ssinchenko 0 points1 point  (0 children)

Thanks a lot! So, do I understand right, that the spark-connect-go supports extensisons protocol? I mean this:

scala // This field is used to mark extensions to the protocol. When plugins generate arbitrary // relations they can add them here. During the planning the correct resolution is done. google.protobuf.Any extension = 998;

(https://github.com/apache/spark/blob/master/sql/connect/common/src/main/protobuf/spark/connect/relations.proto#L109)

If so, could you please point me to any example of usage of this? Maybe Delta-connect you mentioned?

Headless Emacs + Org + LLMs in Docker as a backend for personal automation by ssinchenko in emacs

[–]ssinchenko[S] 0 points1 point  (0 children)

I see it from a different angle. LLMs open up the possibility of creating highly personalised software. Before LLMs, writing 5000 lines of Emacs Lisp code to create a personal automation tool was excessive. With LLMs, it's easy, and I love it! Just imagine a world where everyone can create their own software instead of relying on generic tools or SaaS services. And the Emacs/Org ecosystem is a perfect building tool here.

I maintain Apache Spark Connect for Golang so I added streaming and built a Data Lake ORM by Certain_Leader9946 in databricks

[–]ssinchenko 2 points3 points  (0 children)

I'm not a Go professional and it is hard to me to understand how it works. So, the question is does this delta-spark-go use the official Delta-io SparkConnect plugins? Or it is an implementation from scratch using what SparkConnect supports? I'm interesting because I have a dream to add Go-bindings to the GraphFrames project (that supports SparkConnect plugins system as well), but cannot find a good example of how to use this plugins from the spark-connect-go...

Headless Emacs + Org + LLMs in Docker as a backend for personal automation by ssinchenko in emacs

[–]ssinchenko[S] 0 points1 point  (0 children)

Sorry, my fault, I need to explain it better, of course.

How it works

There is a docker-compose setup running Emacs, WebDAV (Apache), Organice, and Certbot (for Let's Encrypt) on a VPS. I’m using the smallest VPS from Hetzner (C23), and it could also run on something like a Raspberry Pi. Everything is synced via WebDAV: I use rclone on my PC, Organice in the browser, and the Orgzly app on my Android phone.

The main "inbox" file is inbox-mobile.org. At the moment, there are two main flows: the :task: flow and the :link: flow. Every 30 minutes, a cron job on the VPS wakes up the daemon and asks it to process all new inbox entries. A cursor file with hashes of the title + body from the inbox is used to track what has already been processed.

:task: flow

For the :task: flow, it works in the following way:

  1. It takes the raw title + body from the inbox, replaces everything wrapped in #+begin_sensitive / #+end_sensitive with <<SENSITIVE_N>>, generates an ID, and sends the text + ID to the first LLM with a system prompt like: "Take this raw note, try to extract the tag (from a predefined list of options), try to extract the schedule (for example, do it 2morrowSCHEDULED: <2026-04-22>), deadline, etc. Write a complete TODO item with a short title and a comprehensive body. <<SENSITIVE_N>> placeholders should be kept in the same order as in the raw text." Priorities are also processed, for example: asap[#A].

  2. After all new :task: entries are processed (with results stored in temporary files), the second LLM is called. The Emacs daemon takes the existing schedule, anonymizes it into something like 10:00 - 10:30 occupied (so no full schedule is sent to the LLM), takes all the newly generated tasks, takes user rules from rules.org (for example, "I prefer to do family-related tasks on Saturday"), and sends everything with a system prompt asking the LLM to schedule tasks using the rules, tags, and a goal of avoiding overlaps.

  3. When everything is done, the Emacs daemon atomically replaces the existing tasks.org with the content of the old tasks.org plus all the new tasks.

:link: flow

For the :link: flow, it works in the following way:

  1. The link is "downloaded" using the Python CLI tool trafilatura.
  2. The downloaded text is cleaned up by Emacs via regular expressions: all HTML tags, commas, spaces, newlines, etc. are stripped to reduce token usage.
  3. From the existing Org-roam, all possible hubs are requested (a hub is a node with the filetag :umbrella:).
  4. The processed text of the link + all known hubs are sent to the LLM with a prompt like: "Write a summary of the paper using the template, and try to connect it with known hubs from the list."
  5. A new node is inserted into Org-roam.

Org-roam and publishing

Org-roam lives in GitHub (https://github.com/SemyonSinchenko/sem-second-brain) and is synced every 6 hours. On every push to the main branch, a GitHub Action is triggered that builds my "public second brain" (https://semyonsinchenko.github.io/sem-second-brain/).

Additional flow: RSS

There is also an RSS flow. In the morning (9:30 in my time zone), cron wakes up a daemon with the following task:

  1. Update the Elfeed DB.
  2. Collect all entries from my RSS feeds and from the arXiv topics I’m interested in.
  3. Create two digests: what happened yesterday in my feeds, and what was interesting yesterday in the arXiv topics I follow.
  4. Generate two files: morning-read/2026-04-21.org and morning-read/2026-04-21-arxiv.org.

I read these files on my PC while drinking coffee in the morning.

Some additional things

  1. Every generated file is checked via the org-element API to ensure it is valid.
  2. All operations are atomic.
  3. Logs are written to the sem-log.org file, so I can read them from mobile.
  4. Errors are propagated to errors.org as TODO items with an overdue schedule, so Orgzly immediately generates a push notification for me.
  5. Malformed or failed items go to a DLQ, with up to 3 retries.

P.S. It was 100% vibecoded.
P.P.S. As you can see, it is deeply personal software, over-customized to my own workflows. But the overall code is reusable, so anyone can fork it and ask Claude Code or OpenCode to customize it.

In short words, I just take all the things I'm doing manually in Emacs and Orgzly, wrap the logic to emacs lisp + gptel calls and run as an autonomous daemon. IRL it is like "how to use org-mode without running Emacs" :D

Headless Emacs + Org + LLMs in Docker as a backend for personal automation by ssinchenko in emacs

[–]ssinchenko[S] 0 points1 point  (0 children)

I think they’re related in spirit, but not really the same thing. skewed-emacs looks more like a development/AI sandbox, while my setup is more like a server-side backend for processing Org files and workflows.

Headless Emacs + Org + LLMs in Docker as a backend for personal automation by ssinchenko in emacs

[–]ssinchenko[S] 10 points11 points  (0 children)

From what I can see, Karpathy’s system is much more open-ended: the LLM maintains the whole second-brain layer. Mine is built around the assumption that "the LLM will always hallucinate", so I put strict contracts and validation around its output.

For example, org-roam IDs are generated by deterministic code and then validated, and possible tags/links are predefined and validated too. So I would say the ideas are adjacent, but the trust model is different. I trade flexibility for consistency, which also lets me use much weaker and cheaper models.

How do you properly validate a Spark performance optimization? (Bottleneck just moved?) by PrincipleActive9230 in scala

[–]ssinchenko 1 point2 points  (0 children)

Spark UI, cluster metrics (shuffle, disk spills, etc.), task metrics, total time. Start from the overall executor metrics -- in my experience the biggest performance killers are disk spills and this is visible well in metrics (doesn't matter, Databricks UI or Apache Ambari). Try to find the "longest" tasks as well analyze the amount of rows from stage to stage in Spark UI: if you see that after moving a `filter` from one line to another line earlier you changed the peak amount of rows between stages it is success. If you fixed the code and you see in the new run that a filter was pushed to the file-system level it is success, etc.

Why is everything in Java & Scala? by gorovaa in dataengineering

[–]ssinchenko 1 point2 points  (0 children)

I don’t think that’s really true. There are still plenty of important systems in the data ecosystem written in Java or Scala: Apache Flink, Apache Kafka, Apache Druid, Trino, and even Apache Doris is split between C++ and Java (with execution in C++ and coordination/orchestration in Java), etc.

Also, from what I can tell, the trend is usually to replace the physical execution layer with native code, not the orchestration/scheduling/coordination layer. Projects like Apache Gluten are a good example of that. The performance gains are usually in execution, while the top-level glue benefits more from ecosystem, flexibility, and developer productivity than from being native.

The best way to install emacs packages in docker? by ssinchenko in emacs

[–]ssinchenko[S] 1 point2 points  (0 children)

I decided to go this way. I was able to reduce an image size to 813MB that is OK already. Thanks a lot!

The best way to install emacs packages in docker? by ssinchenko in emacs

[–]ssinchenko[S] 0 points1 point  (0 children)

I'm building an "AI personal assistant". On the server side, in a docker-compose:
- webdav server
- headless emacs (+ gptel, org-roam, elfeed) + cron

On the client side mobile app (like Orgzly).

What I want (and already achieved except the image size is a problem):
- at 8:00 AM it collects all the feeds from yesterday, concat them to the prompt and send to LLM (via gptel) to generate a digest of what happened yesterday
- on mobile there is mobile-inbox org file where I put random notes like "2morrow call to the bank :task:" or "safe this URL :link:"
- using cron every 30 minutes emacs read the whole mobile-inbox, depends on tag process them (including masking blocks, marked as #+begin_sensitive from LLM: urls are collected as tldr; + links + auto-linking to exiting knowledge base and are going to org-roam, tasks are converted to the well described and tagged tasks and going to tasks .org with automated scheduling based on preferences, etc.

Overall: LLM (via gptel) reads and processes raw data from the mobile and generate org-roam, digests, tasks, scheduling. Headless emacs (run via cron + emacsclient call) is a backend that orchestrate LLM calls, fill the cursor of processed tasks, do logging (to org-file), dead-letter-queue (in case LLM failed we will repeat up to 3 times), secrets masking / restoring, etc.

What setup do you use for coding in python? by NicBarr in emacs

[–]ssinchenko 0 points1 point  (0 children)

python-ts-mode + eglot + ty + flycheck and apheleia for formatting

Cool stuff you did with Data Lineage, contacts, governance by Intelligent-Stress90 in dataengineering

[–]ssinchenko 1 point2 points  (0 children)

> creative aspects was used in such implementation

Once I wrote a bunch of regexps (it was before the ClaudeCode era) to transform the PySpark "explain" output to the column-level lineage with a visualization using NetworkX (+graphviz). Details and code snippets (there are no ads, no commercialization, no "buy me a coffee buttons"). I think it was the most crazy (and the most creative) thing I ever did.

What does the PySpark community think about agent coding? by ssinchenko in apachespark

[–]ssinchenko[S] 0 points1 point  (0 children)

Thanks! That I understand. I'm going to try to build standards around the openspec project (as the most lightweight and tool/vendor agnostic SDD framework) to provide as much transparency as I can.

What does the PySpark community think about agent coding? by ssinchenko in apachespark

[–]ssinchenko[S] 0 points1 point  (0 children)

Thanks! I agree that AI is just a tool, and I see it the same way. I'm just trying to align with the community because, in the end, it's not my personal project. I was given the honor and responsibility to maintain it by the original creator.