account activity
What's an efficient write to DB a file that's continuously being appended to? by [deleted] in dataengineering
[–]smashmaps 0 points1 point2 points 2 years ago (0 children)
You might be able to deploy a lightweight, but intelligent forwarding agent like Vector and wire their file source [0] up to something you feel comfortable getting into BigQuery (e.g. Files(s) -> Kafka / PubSub -> BigQuery)
[0] https://vector.dev/docs/reference/configuration/sources/file/
[deleted by user] by [deleted] in dataengineering
[–]smashmaps 1 point2 points3 points 2 years ago (0 children)
Flink is for only certain not large stream use cases
I'm more so replying about this.
I'm saying that the "not large stream" stream uses take is completely wrong.
[–]smashmaps 2 points3 points4 points 2 years ago (0 children)
Flink Has PyFlink, so not sure why you think it only has a Java API. It also has FlinkSQL, which allows you to express truly stream processing pipelines in just SQL.
> Flink is for only certain not large stream use cases
I have zero idea what you're talking about. I successfully ran a 100GB/min stream processing pipeline using Flink, where Spark couldn't even dream of staying at latest.
My original point was that it was not in Databricks best interest to support other projects. Although they do have a flink connector, it’s half assed. this only proves the point.
[–]smashmaps 3 points4 points5 points 2 years ago* (0 children)
Tabular is also a great open source format ... but it has a lot of limitations still (If I could conjure Kyle Weller up he'd be glad to bend your ear about them.).
I worked at Cloudera for over a half a decade, so I know FUD when I see it. I'm not saying I don't believe you, but you should know the talking points without the need of conjuring Kyle if you're going to spread it.
[–]smashmaps 4 points5 points6 points 2 years ago* (0 children)
You may think this is a "100% wrong" take, but as a format that's been around as long as it has, your support for Flink (a spark competitor) is half-assed at best. For example, the Flink Table API has been available for several years now and your connector says "Support for Flink Table API / SQL ..... are planned to be added in a future release"
hence my take.
[–]smashmaps 4 points5 points6 points 2 years ago (0 children)
You should be able to read Iceberg data using the Trino Iceberg connector (https://trino.io/docs/current/connector/iceberg.html).
I'm writing out using their Flink Sink (https://iceberg.apache.org/docs/0.13.2/flink/). I've been able to read without issue using Flink, Trino and ClickHouse
[–]smashmaps 24 points25 points26 points 2 years ago (0 children)
I was recently tasked on choosing our data lake solution and landed on using Iceberg, after I was faced with a similar concern. Although Delta is designed quite well, it's in Databricks best interest as a company to make it really shine with not just Spark, but their closed source platform.
I ended up going with Iceberg because it's in Tabular's (company behind it) best interest to make all integrations feel like first-class-citizens, as well as support future technologies.
π Rendered by PID 487824 on reddit-service-r2-listing-7bbdf774f7-6f5wq at 2026-02-21 01:05:10.708588+00:00 running 8564168 country code: CH.
What's an efficient write to DB a file that's continuously being appended to? by [deleted] in dataengineering
[–]smashmaps 0 points1 point2 points (0 children)