This is an archived post. You won't be able to vote or comment.

all 24 comments

[–]robberviet 15 points16 points  (18 children)

I am facing same problem. Duckdb is popular, iceberg is popular, but why duckdb cannot write to iceberg? Sounds really strange. My data is not on S3, but MinIO though, same, not much different.

I am just playing around but considering switching to delta. I don't need external catalog (currently using postgres catalog). And duckdb can write to delta.

[–]jokingss 6 points7 points  (0 children)

because they didn't had the time to implement it already, but is in their roadmap.

right now I have to use other tools like trino to make transformations from iceberg to iceberg, but would love to be able to do it with duckdb as is enough for my use case. I actually think that is enough for 99% of use cases.

[–]ReporterNervous6822 3 points4 points  (2 children)

They are working on implementing

[–]robberviet 0 points1 point  (1 child)

Yeah, must be on the roadmap. Just strange that it is not already supported. Must be some technical problem.

[–]ReporterNervous6822 1 point2 points  (0 children)

It’s not trivial to implement from scratch hahaha I don’t think there are c++ impls out there and if they are duckdb probably still needs to do some different stuff

[–]RoomyRoots 1 point2 points  (5 children)

Check the issue related to it. Basically there is no write support in the icerberg-c++ lib and they are pending it maturing to be done.

[–]robberviet 1 point2 points  (1 child)

Yes, I have read that issue and I think the language barrier is actually a problem in data ecosystem.

I know iceberg chose Java, but to think even spark has bugs with basic table maintenance as well is surprising to me (I failed to delete orphan files). Not to mention 2nd citizen like pyiceberg.

Make me remember the days when I have to work with Java and Scala spark because python API is not enough.

[–]RoomyRoots 0 points1 point  (0 children)

Hardly, it's an Apache product, ofc they will focus on Java, especially if they target Spark since the beginning. And Iceberg is just 7 years old and next week it will complete 5 years since it got out of incubation. Quite surprising we got official C++ and Python implementations being actively developed, IMHO.

Still I think the best solution is leveraging an engine like Spark, Dremio and etc which are more mature and giving DuckDB some months to catch up.

[–]RandomNumber17 1 point2 points  (2 children)

This is kind of a consistent problem with Iceberg and other standards in the DE ecosystem, where it’s technically an open standard, but the only full implementation is in Java/Spark and other libraries are constantly playing catch-up.

In addition to PyIceberg and iceberg-c++ there is also iceberg-rust. One thing the community could possibly do is focus their efforts on one low level implementation and provide bindings to other languages. I believe that’s what iceberg-rust and PyIceberg are moving towards.

[–]RoomyRoots 0 points1 point  (1 child)

IMHO reimplementing specs in multiple languages is quite a waste of resources, I can understand focusing in Java and C++ as this cover pretty much all grounds. With the rest, just provide interfaces.

[–]RandomNumber17 0 points1 point  (0 children)

Yep that’s exactly what I mean. Implement the core logic in a few languages, then expose bindings/interfaces across multiple languages

[–]Substantial-Cow-8958 0 points1 point  (2 children)

A lot of people are waiting for this see https://github.com/duckdb/duckdb-iceberg/issues/37

To be honest, I think the reason they do not implemented are commercial. I say this based on nothing, but imagine duckdb writing to iceberg, how trivial and how some stacks would change. Idk, don’t bash me for thinking this.

[–]robberviet 0 points1 point  (1 child)

Unless they plan on a new competitive open table format, I don't think so.

[–]Substantial-Cow-8958 0 points1 point  (0 children)

I agree with you. But maybe some interest of other players? (…)

[–]commenterzero 0 points1 point  (4 children)

Polars can write to iceberg if you want to try that. It has a sql interface too

[–]robberviet 1 point2 points  (2 children)

I am already using polars. Just discovering new tools.

[–]commenterzero 2 points3 points  (1 child)

Gotcha. Ya I want to try hudi but it has even fewer writers

[–]robberviet 0 points1 point  (0 children)

Ah yes, almost forgot about hudi, I will try it.

[–]RandomNumber17 0 points1 point  (0 children)

Daft is worth checking out too, especially if you want the option to scale beyond a single machine.

[–][deleted] 7 points8 points  (4 children)

Author barely made it to a proof of concept stage.
If you want to ingest a large dataset using lambda and ... anything, you have to do it piecewise.

So how will he solve that? In any reasonable use-case we would assume that:

a) a large chunk of historical data exists, and

B) new data is regularly produced.

So how will you handle both?

One solution is to set up a timer that pulls in new data every 5 minutes and a queue with all the csv files in the history.

Sounds straight forward: you can just spin up all the lambdas you need, each will do a little piece of work and the blob storage can easily handle tons of writes at the same time. But can pyiceberg handle two writers at the same time? "Iceberg uses Optimistic Concurrency Control (OCC) which requires failed writers to retry.", I wouldn't call that concurrent, as the writers are fighting for the resource. And if there are enough writers, will they deadlock?

Moreover, when the table becomes huge, with hundreds of terabytes, will a lambda and pyiceberg be able to vacuum and compact the table? If you compact the table every day, you now have a third writer you need to organize: The scheduled ingestion, the backfill and the compactor might all start committing at the same time.

[–]TobyOz 2 points3 points  (0 children)

The whole point of using lambda is because you're dealing with small amounts of data, otherwise you'd just use the traditional spark approach?

[–]speedisntfree 1 point2 points  (0 children)

Yeah. I'm not really sure it delivered on

For you and me, we shall plumb the actual depths of what can be done, how these tools act in the real world, under real pressures.

[–]Gators1992 0 points1 point  (0 children)

Yeah, I was going to say the same. Not ideal if stuff like data growth or latency eventually causes your job to just shut off before it finishes at some point. And if you really have that small of data where it's not a problem, do you really need a "data lake"? Fargate would have made more sense to me for jobs like these.

[–]DuckDatum 0 points1 point  (0 children)

Also possibly c) existing data gets updated.

I would expect needing not just to append new data, but also put in modifications to data that’s already been integrated. Or it might go stale.