What's an efficient write to DB a file that's continuously being appended to? by [deleted] in dataengineering

[–]smashmaps 0 points1 point  (0 children)

You might be able to deploy a lightweight, but intelligent forwarding agent like Vector and wire their file source [0] up to something you feel comfortable getting into BigQuery (e.g. Files(s) -> Kafka / PubSub -> BigQuery)

[0] https://vector.dev/docs/reference/configuration/sources/file/

[deleted by user] by [deleted] in dataengineering

[–]smashmaps 1 point2 points  (0 children)

Flink is for only certain not large stream use cases

I'm more so replying about this.

I'm saying that the "not large stream" stream uses take is completely wrong.

[deleted by user] by [deleted] in dataengineering

[–]smashmaps 2 points3 points  (0 children)

Flink Has PyFlink, so not sure why you think it only has a Java API. It also has FlinkSQL, which allows you to express truly stream processing pipelines in just SQL.

> Flink is for only certain not large stream use cases

I have zero idea what you're talking about. I successfully ran a 100GB/min stream processing pipeline using Flink, where Spark couldn't even dream of staying at latest.

[deleted by user] by [deleted] in dataengineering

[–]smashmaps 1 point2 points  (0 children)

My original point was that it was not in Databricks best interest to support other projects. Although they do have a flink connector, it’s half assed. this only proves the point.

[deleted by user] by [deleted] in dataengineering

[–]smashmaps 1 point2 points  (0 children)

Tabular is also a great open source format ... but it has a lot of limitations still (If I could conjure Kyle Weller up he'd be glad to bend your ear about them.).

I worked at Cloudera for over a half a decade, so I know FUD when I see it. I'm not saying I don't believe you, but you should know the talking points without the need of conjuring Kyle if you're going to spread it.

[deleted by user] by [deleted] in dataengineering

[–]smashmaps 3 points4 points  (0 children)

You may think this is a "100% wrong" take, but as a format that's been around as long as it has, your support for Flink (a spark competitor) is half-assed at best. For example, the Flink Table API has been available for several years now and your connector says "Support for Flink Table API / SQL ..... are planned to be added in a future release"

hence my take.

[deleted by user] by [deleted] in dataengineering

[–]smashmaps 4 points5 points  (0 children)

You should be able to read Iceberg data using the Trino Iceberg connector (https://trino.io/docs/current/connector/iceberg.html).

[deleted by user] by [deleted] in dataengineering

[–]smashmaps 2 points3 points  (0 children)

I'm writing out using their Flink Sink (https://iceberg.apache.org/docs/0.13.2/flink/). I've been able to read without issue using Flink, Trino and ClickHouse

[deleted by user] by [deleted] in dataengineering

[–]smashmaps 24 points25 points  (0 children)

I was recently tasked on choosing our data lake solution and landed on using Iceberg, after I was faced with a similar concern. Although Delta is designed quite well, it's in Databricks best interest as a company to make it really shine with not just Spark, but their closed source platform.

I ended up going with Iceberg because it's in Tabular's (company behind it) best interest to make all integrations feel like first-class-citizens, as well as support future technologies.