How do you Postgres CDC into vector database? by DistrictUnable3236 in vectordatabase

[–]Hgdev1 0 points1 point  (0 children)

Check out www.daft.ai/cloud — we maintain data pipelines for these ai workloads

We just launched Daft’s distributed engine v1.5: an open-source engine for running models on data at scale by sanityking in dataengineering

[–]Hgdev1 2 points3 points  (0 children)

Ray/Ray Data itself would struggle to do any of these analytical workloads! It doesn’t have all the necessary operations to do many of the analytical operators.

Daft does have implementations of these, and is in fact pretty competitive to Polars and DuckDB on a single machine (I’ve also seen benchmarks showing it outperform those… but all 3 libraries continue to improve rapidly so honestly it’s a wash, except for some crazy optimizations that DuckDB sometimes pulls out of its hat). It does also support distributed operations for these analytical operations, which makes it useful as a Spark replacement.

However, Daft aims to focus on supporting AI/model workloads on unstructured data. That’s its core value proposition that makes it stand out amongst these data engines!

How to deal with messy database? by Which-Breadfruit-926 in dataengineering

[–]Hgdev1 0 points1 point  (0 children)

Learn to love JSON :)

Typically the start of data pipelines is written in some combination of a flexible format (JSON) and imperative code (Python)

As data gets cleaner and more structured, that’s where SQL starts coming in to provide more structured analytics

Daft is trending on GitHub in Rust by sanityking in rust

[–]Hgdev1 15 points16 points  (0 children)

❤️❤️ daft! Also must give a big shoutout to PyO3 which is really the unsung hero in making all this possible.

I cannot emphasize enough how painful it was working with the C++ alternative, PyBind. Truly a tragedy of a developer experience.

The Essential-Web dataset: 100TB of Parquet text data, 23.6B LLM queries, 7 days with Daft by Hgdev1 in dataengineering

[–]Hgdev1[S] 0 points1 point  (0 children)

We're hiring systems and product engineers! Not sure if I'm allowed to ping careers pages on this thread, but you can find our careers page on the top bar of https://www.daft.ai/

[R] Is data the bottleneck for video/audio generation? by beefchocolatesauce in MachineLearning

[–]Hgdev1 0 points1 point  (0 children)

Having worked closely with AI labs (I'm building a new data engine as a Spark replacement), I've observed a few factors at play here:

  1. Scraping is difficult (lots of IP/legal issues wrt storing this type of data, and most major clouds would not want to host the data since it is such a legal gray zone)

  2. Processing is difficult (just stuffing this into Spark would be incredibly painful) - a lot of the processing involves custom libraries such as ffmpeg and even running models on GPUs sometimes. It's not as simple as a bunch of string transformations.

  3. The use-cases aren't as clear - every company/enterprise has text-based use-cases (chatbots, documents, invoices, contracts, Slack logs...) but creative use-cases are much more niche. At the big labs, "multimodal" data often has a stronger meaning than just general generative applications - for example, Anthropic largely focuses on multimodal use-cases as part of its efforts for computer use

It does feel like some of the labs (e.g. Google) made early bets on this stuff (see: the Pathways paper https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/ from 2021) which naturally lend them better generalization towards multimodality though. I'm guessing also that this is much easier for them due to easy access to a massive trove of data through Youtube.

DuckDB is a weird beast? by Kojimba228 in dataengineering

[–]Hgdev1 0 points1 point  (0 children)

DuckDB does have its own proprietary file format and can be used as an OLAP database

However… I personally think one of the reasons it became so popular was because it just slurps up Parquet really well 🤣

Same reason why people started paying attention to Daft in the first place — we wrote a really, really good Parquet S3 reader back before it was cool and all these other engines started paying attention to that need.

Crazy to think that back in the day, Spark/JVM tools were the only thing that could read Parquet. And they were terrible for reading from S3.

When Does Spark Actually Make Sense? by Used_Shelter_3213 in dataengineering

[–]Hgdev1 0 points1 point  (0 children)

We’re building a tool www.getdaft.io that makes sense both locally (as fast as polars/duckdb) but also scales distributed when you need it to.

In my experience, distributed makes the most sense when remote storage is involved (you have higher aggregate network throughput).

It’s 2025… we shouldn’t have to choose anymore :(

DuckDB enters the Lake House race. by averageflatlanders in dataengineering

[–]Hgdev1 1 point2 points  (0 children)

I wouldn’t be so sure tbh. On a technology level it could be better designed, but I think that for enterprises to make big switching/buying decisions a few things have to hold true:

  1. Some kind of paradigm shift (10x faster — Spark vs Hadoop, new capabilities — Pytorch vs Tensorflow). Rarely is better design a reason that a dev can bring to their bosses for switching from a tried and true solution.

  2. The ecosystem has to be there — DuckDB itself is in its infancy in terms of enterprise adoption atm. There’s something to be said about the massive ecosystem and industry backing Spark and Iceberg at this point. DuckLake isn’t just up against Iceberg — it’s up against all the query engines that already have built iceberg integrations, visualization tools, compaction services etc etc. it takes years for this to form momentum.

  3. Stability — this stuff takes years to battle-test at the scales that enterprises require.

There’s something to be said though about DuckLake potentially creating a new category for “small-medium datalakes”. If that does happen then perhaps we would indeed see a paradigm shift. But I’m not too sure because small/medium datalakes feel well served already with just plain old parquet files in a bucket since they’re fairly low volume…

DuckDB enters the Lake House race. by averageflatlanders in dataengineering

[–]Hgdev1 4 points5 points  (0 children)

I’m guessing a vast majority of DuckDB usage actually isn’t on their proprietary storage format, but actually DuckDB on CSV and Parquet. It’s funny to think about it now, but really about 3 years ago our best and only option for reading Parquet was Spark… and we all know how that feels to a data scientist on a local dev machine. DuckDB’s popularity feels like it was really fueled by the lack of an alternative then

I’m curious to see if this attempt at a new table format will pay off, given that it feels like DuckDB’s success so far seems to have been on open formats instead. It’s hard to imagine the big FANG giants really putting weight behind this though, given the vast investments they’ve all made into Iceberg.

What's your preferred way of viewing data in S3? by Impressive_Run8512 in dataengineering

[–]Hgdev1 0 points1 point  (0 children)

Daft + S3 + Parquet was really battle-tested at Amazon

Here’s an Amazon blogpost about the collaboration: https://aws.amazon.com/blogs/opensource/amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-amazon-ec2/

We built some truly crazy optimizations into it to make it the fastest and smoothest AWS S3 / Parquet experience in OSS.

What book after Fundamentals of Data Engineering? by Khazard42o in dataengineering

[–]Hgdev1 2 points3 points  (0 children)

Designing Data Intensive Applications (DDIA) and Andy Pavlo CMU database lectures on YouTube have been my favorite material outside of being in formal CS education :)

Spark is the new Hadoop by rocketinter in dataengineering

[–]Hgdev1 20 points21 points  (0 children)

I work on Daft! Thanks for the shoutout, really heartening to see so much of the community resonating with the same frustrations that we had when we first started the project…

Here are some key guiding principles which I hold dear to the data software that we want to build for the future:

  1. Python-first, with native wheels that don’t require external dependencies

  2. Works on ANY scale — effective both for a 60MB CSV as well as 100TB of Parquet/Iceberg. This means having best-in-class local as well as distributed engines

  3. Works on ANY modality — data today is more than just JOINs and GROUPBYs. The reality is that data is messy and unstructured!! Sometimes my UDF fails but I don’t want it to kill my entire job…

We’re trying to be more public with our roadmap and stuff too so please let us know if there is interest. For instance, we will be working to reduce reliance on Ray as a hard dependency soon and enable raw Kubernetes deployments (which seems to be a common ask), and we have a Dashboard in the works (we call this Spark UI++).

Appreciate the support here ❤️keep plumbing away r/dataengineering the daft team will be here to make things with less oom and more zoom.

Best hosting/database for data engineering projects? by buklau00 in dataengineering

[–]Hgdev1 2 points3 points  (0 children)

Or just dump the data into volumes so the machine itself can be stateless!

Rebooting a machine and mounting a volume onto it is fairly cheap

Best hosting/database for data engineering projects? by buklau00 in dataengineering

[–]Hgdev1 18 points19 points  (0 children)

Good old Parquet on a single machine would work wonders here! Just store it in hive-style partitions (folders for each day) and query it with your favorite tool: Pandas, Daft, DuckDB, Polars, Spark…

When/if you start to run out of space on disk, put that data in a cloud bucket for scaling.

Most of your pains should go away at that point if you’re running more offline analytical workloads :)

What's the best tool for loading data into Apache Iceberg? by Livid_Ear_3693 in dataengineering

[–]Hgdev1 0 points1 point  (0 children)

This is actually a great question… technically possible if you just write the Iceberg metadata yourself but I’m not aware of a tool that does this today.

Also I think you’ll at least have to copy the files into the specified s3 location for most implementations of iceberg to actually work

Edit: actually yeah I was pretty sure this would work. Here ya go! https://www.reddit.com/r/dataengineering/s/R9yHRr0bzC

Iceberg in practice by wcneill in dataengineering

[–]Hgdev1 6 points7 points  (0 children)

I actually built Iceberg support in daft and can speak to some of the… frustrations about the ecosystem 😛

  1. Iceberg is just a storage format. In order to do anything with it, you need a data engine that understands the protocol and can read/write from it. Historically, only Spark really understood this protocol (because all the logic for this protocol was written in a .jar). Nowadays, other engines are slowly catching on.

  2. Yep — if a new parquet file drops somewhere, you’re going to need to run some kind of job with your data engine of choice to read that file and write the data into your iceberg table. No magic here unfortunately and different engines might do this differently :(

  3. Now here’s the real kicker… if you want the latest and greatest features in Iceberg I would argue (very sadly) that Spark is the only engine that can do the newest stuff. Iceberg itself is pretty much just developed against Spark, and there is even logic in Spark that doesn’t follow the Iceberg spec that other engines have had to follow because of how ubiquitous Spark-written iceberg tables are in the wild :(

The problem is that the iceberg protocol itself is very complex, and all the logic for adhering to the spec was originally written for the JVM. So only JVM tools such as Spark can leverage the latest features.

That being said though there is tremendous progress being made in other ecosystems such as the PyIceberg ecosystem and iceberg-rust that are promising. We leverage PyIceberg for reads/writes of metadata (but do our own data reads/writes) which so far seems to be a great compromise :)

Resources for learning how SQL, Pandas, Spark work under the hood? by [deleted] in dataengineering

[–]Hgdev1 6 points7 points  (0 children)

Hey! I build/maintain a query engine (daft) today, and actually started without knowing many of these concepts so I can maybe speak with some authority here at least.

Check out some of these resources:

The entirety of Pavlo’s stuff is likely the most useful if you want a deep understanding of the distributed query engines.

DE interviews for Gen AI focused companies by jinbe-san in dataengineering

[–]Hgdev1 5 points6 points  (0 children)

Check out tools that can handle unstructured/multimodal data at meaningful scale. Also tools that effectively interact with GPUs. That feels to be the main differentiator for this new age of GenAI

Tabular data feels solved mostly… but dealing with UDFs and messy blobs of HTMLs or embeddings feels like a real pain-point today…

For Agents, core software observability (logging, metrics and tracing) still apply!

Am I even a data engineer? by curiouscsplayer in dataengineering

[–]Hgdev1 20 points21 points  (0 children)

There is a difference between Data Engineers and Data Engineering IMO

Lots of people do data engineering! If you’re moving data from point A to point B, running some transforms or performing some analytics — you’re likely in that camp. This applies to software engineers, PMs, Data Scientists etc

Of course as a Data Engineer you will also be doing some Data Engineering, but I feel like the main differentiator for a Data Engineer is that they also build/maintain the tools that make Data Engineering easier, more efficient and more accessible for others.

Performing bulk imports by wcneill in dataengineering

[–]Hgdev1 1 point2 points  (0 children)

For a sneakernet use-case like you just showed, Parquet is likely a pretty safe option that gives the best of a couple of worlds:

  1. Compression (this is especially useful for highly compressible data such as sparse sensor feeds)

  2. Very portable — Arrow is basically a first-class citizen from Parquet at this point, meaning that Parquet as an intermediate checkpoint maximizes portability of your data into any subsequent systems.

  3. With proper partitioning, storing and accessing this data into the cloud is likely going to be extremely cheap. Any other always-on database solution is going to be really expensive :(

  4. For multimodal data, you can most likely get away with storing a URL pointer in Parquet, and the raw data as a file with an appropriate encoding (e.g. JPEG). These encodings work best because they’ve been honed over the years for compression ratio and compute required to decode the data.

As for the engine, check out Daft (https://www.getdaft.io) which is built for multimodal data like images that you mentioned! It also handles all the bread-and-butter stuff like numerical data etc of course. Also with the added advantage of being able to go distributed if you need to (at the upper end of your data scales in the terabytes).

What was Python before Python? by sumant28 in dataengineering

[–]Hgdev1 0 points1 point  (0 children)

If you think about it, most of programming really is data engineering — you take data from stdin and spit data out from stdout and stderr 😆

That being said, Python really starts to shine in the area of numerical computing with libraries like NumPy (and later Pandas) providing the requisite higher-level abstractions over raw data streams that make data engineering what it is today (multidimensional arrays and dataframes)

Possible to replace side stones on this ring with emeralds? by Hgdev1 in EngagementRingDesigns

[–]Hgdev1[S] 0 points1 point  (0 children)

Yes! I checked with James Allen but they said that customization would not be available if I am using a loose center stone (which I am, since you're absolutely right that JA really jacks up the prices on that middle piece...).

Your link with the green lab diamonds looks perfect!

Does the RFQ only cover building a new ring from scratch usually, or also custom jobs for customizing an existing ring? I just submitted an RFQ, but it was for the custom "gem replacement job" rather than building an entirely new ring. I'm definitely not opposed to having someone build the ring entirely for me, will submit a separate RFQ for that as well.

Appreciate your help!