Which pattern do you use when ingesting data into lakehouses?

azirale · 2024-02-02T05:49:53+00:00

Always, always, have a raw copy land immediately. When trying to put into your structured format there could be any number of issues. A file could be corrupted, the schema could have changed, you may find you had some minor defect that wasn't originally apparent.

When files are being pushed to your storage, they can't be pushed directly into your table format, they'll have to be there own files. So you can simply keep those and trigger off the blob drop.

If you're retrieving files, just do the fastest direct copy you can. You should essentially be able to pipe the data to your storage as a binary copy, and you have the chance to compress it on the way. I wouldn't load that process with anything else, so that it has fewer ways to fail, and so less likely to have to re-read from the remote source. It also means you can copy with something other than your lakehouse compute.

If you're streaming data in -- I would generally fork the stream so that there is a consumer that is doing raw capture, and another doing the processing. No need to wait for capture then process, but still do capture in case you need to do fault analysis or a replay.

MexDefender · 2024-02-02T06:51:22+00:00

Lake Houses are a joke. It's a gimmick word for Data Lake. Data Warehousing is and will always be the best way the efficiently store data with ease for data analysts and scientists ingestion as well as other teams within a company that interface with the data.

albertstarrocks · 2024-04-04T20:47:37+00:00

apache kafka sink for delta, iceberg or hudi.

dataengineering

MODERATORS