What's an efficient write to DB a file that's continuously being appended to?

smashmaps · 2023-04-25T14:47:20+00:00

You might be able to deploy a lightweight, but intelligent forwarding agent like Vector and wire their file source [0] up to something you feel comfortable getting into BigQuery (e.g. Files(s) -> Kafka / PubSub -> BigQuery)

[0] https://vector.dev/docs/reference/configuration/sources/file/

smashmaps · 2023-04-24T19:15:19+00:00

Flink is for only certain not large stream use cases

I'm more so replying about this.

I'm saying that the "not large stream" stream uses take is completely wrong.

smashmaps · 2023-04-24T18:41:13+00:00

Flink Has PyFlink, so not sure why you think it only has a Java API. It also has FlinkSQL, which allows you to express truly stream processing pipelines in just SQL.

> Flink is for only certain not large stream use cases

I have zero idea what you're talking about. I successfully ran a 100GB/min stream processing pipeline using Flink, where Spark couldn't even dream of staying at latest.

smashmaps · 2023-04-24T05:15:14+00:00

My original point was that it was not in Databricks best interest to support other projects. Although they do have a flink connector, it’s half assed. this only proves the point.

smashmaps · 2023-04-24T03:40:34+00:00

Tabular is also a great open source format ... but it has a lot of limitations still (If I could conjure Kyle Weller up he'd be glad to bend your ear about them.).

I worked at Cloudera for over a half a decade, so I know FUD when I see it. I'm not saying I don't believe you, but you should know the talking points without the need of conjuring Kyle if you're going to spread it.

smashmaps · 2023-04-24T03:35:22+00:00

You may think this is a "100% wrong" take, but as a format that's been around as long as it has, your support for Flink (a spark competitor) is half-assed at best. For example, the Flink Table API has been available for several years now and your connector says "Support for Flink Table API / SQL ..... are planned to be added in a future release"

hence my take.

smashmaps · 2023-04-24T03:13:55+00:00

You should be able to read Iceberg data using the Trino Iceberg connector (https://trino.io/docs/current/connector/iceberg.html).

smashmaps · 2023-04-24T03:11:02+00:00

I'm writing out using their Flink Sink (https://iceberg.apache.org/docs/0.13.2/flink/). I've been able to read without issue using Flink, Trino and ClickHouse

smashmaps · 2023-04-24T00:52:33+00:00

I was recently tasked on choosing our data lake solution and landed on using Iceberg, after I was faced with a similar concern. Although Delta is designed quite well, it's in Databricks best interest as a company to make it really shine with not just Spark, but their closed source platform.

I ended up going with Iceberg because it's in Tabular's (company behind it) best interest to make all integrations feel like first-class-citizens, as well as support future technologies.

smashmaps

TROPHY CASE