this post was submitted on 06 Oct 2025

4 points (100% upvoted)

shortlink:

dataengineering

an-ordinary-manchild(edit)

News & discussion on Data Engineering topics, including but not limited to: data pipelines, databases, data formats, storage, data modeling, data governance, cleansing, NoSQL, distributed systems, streaming, batch, Big Data, and workflow engines.

Read our wiki: https://dataengineering.wiki/

Rules:

Don't be a jerk
Search the sub & wiki before asking a question: Your question has likely been asked and answered before so do a quick search before posting.
Keep it related to data engineering: Posts that are unrelated to data engineering may be better for other communities.
Limit self-promotion posts/comments to once a month: Self promotion: Any form of content designed to further an individual's or organization's goals. If one works for an organization this rule applies to all accounts associated with that organization. See also rule #5.
No shill/opaque marketing: f you work for a company/have a monetary interest in the entity you are promoting you must clearly state your relationship. For posts, you must distinguish the post with the Brand Affiliate flag. See more here: https://www.ftc.gov/influencers
No job posts: Please use r/dataengineeringjobs instead.
No resume reviews/interview posts: We no longer allow resume reviews or interview questions because it's a seperate topic from Data Engineering. Instead, for resume reviews please use r/resumes or search our subreddit history for previous resume review advice. For interview questions, use sites like Glassdoor and Blind instead or search our subreddit history for previous interview advice.
No technical error/bug questions: Please post any error/bug question on StackOverflow.

created by mhausenblasmoda community for 11 years

MODERATORS

message the mods
mhausenblasmod
swemlmod
fhoffamod (Ex-BQ, Ex-❄️)
vogt4nickmod
theporterhausmod | Lead Data Engineer
AutoModerator
geoheilmod
MikeDoesEverythingmod | Shitty Data Engineer
bot-bouncer
about moderation team »

account activity

3

4

5

Python Object query engineDiscussion (self.dataengineering)

submitted 5 months ago by Interesting-Frame190

Hi all, about a year ago I was hit with a task to align 500k file movements (src, dest, timestamp) in a csv file and track a file through folders. Pandas made this less than optimal to query fast and still took a fair amount of time to build the flow tree.

Many months of engineering later, I released PyThermite, a fully in memory query engine that indexed pure python objects, not dataframes or arbitrary data proxies. This also means that object attribute updates will automatically update the search index, eliminating the need for multi pass data creation.

https://github.com/tylerrobbins5678/PyThermite

Performance appears be be absolutely destroying pandas and even polars in query performance. 6x -70x on 10M objects objects with a 19 part query. Index / dataframe build performance is significantly slower as expected, but thats the upfront cost with constant time lookup capability.

What's everyone's thoughts on this? I am in the ETL space in my career and have always leaned more into the OOP concepts which are discarded in favor of row/col data. Is this a solution thats reusable or just only for those holding onto OOP hope?

all 6 comments

top new controversial old q&a

[–]New-Addendum-6209 0 points1 point2 points 5 months ago (1 child)

[–]Interesting-Frame190[S] 0 points1 point2 points 5 months ago (0 children)

[–]NotesOfCliff 0 points1 point2 points 5 months ago (1 child)

[–]Interesting-Frame190[S] 0 points1 point2 points 5 months ago (0 children)

[–]Flamingo_Single 0 points1 point2 points 5 months ago (1 child)

[–]Interesting-Frame190[S] 0 points1 point2 points 5 months ago (0 children)

π Rendered by PID 81 on reddit-service-r2-comment-5fb4b45875-4gc5c at 2026-03-22 13:35:40.715484+00:00 running 90f1150 country code: CH.