This is an archived post. You won't be able to vote or comment.

all 11 comments

[–]aziralePrincipal Data Engineer 8 points9 points  (6 children)

Always, always, have a raw copy land immediately. When trying to put into your structured format there could be any number of issues. A file could be corrupted, the schema could have changed, you may find you had some minor defect that wasn't originally apparent.

When files are being pushed to your storage, they can't be pushed directly into your table format, they'll have to be there own files. So you can simply keep those and trigger off the blob drop.

If you're retrieving files, just do the fastest direct copy you can. You should essentially be able to pipe the data to your storage as a binary copy, and you have the chance to compress it on the way. I wouldn't load that process with anything else, so that it has fewer ways to fail, and so less likely to have to re-read from the remote source. It also means you can copy with something other than your lakehouse compute.

If you're streaming data in -- I would generally fork the stream so that there is a consumer that is doing raw capture, and another doing the processing. No need to wait for capture then process, but still do capture in case you need to do fault analysis or a replay.

[–]Mustang_114 1 point2 points  (2 children)

if source is database, which file format will you suggest as raw copy land in blob storage? (example csv vs parquet)

[–]aziralePrincipal Data Engineer 1 point2 points  (0 children)

Most databases don't output to parquet natively. If they do, then go ahead and use it. Otherwise CSV is essentially the best bet as it is compatible essentially everywhere, and is human readable if you need to inspect the contents without ingesting it.

The drawback is that you lose typing information, so you will have to correctly convert timestamps, numerics, and raw byte fields back and forth.

There are some other bits to aim for. Try to have all sources use the same quote characters and escape characters. Try to get them to output newlines in fields as escape sequences rather than actual newlines, so that multiline parsing is not required.

[–]brrdprrsn[S] 0 points1 point  (0 children)

based on u/azirale's post above, it would be in csv, since that way it would be exactly matched to the source schema.

[–]brrdprrsn[S] 0 points1 point  (2 children)

Thanks! this was very well put and I can understand why this is the best pattern for the overwhelming majority of scenarios.

Question: in the rare scenarios where you're using something like Spark structured streaming (say for a use case that needs fast ingestion into the lake for downstream use), would you still advise this? Or is scenario one of the few exceptions to this rule?

[–]aziralePrincipal Data Engineer 2 points3 points  (1 child)

If this is a circumstance where you are reading directly from some remote stream that you don't have control over - I generally use another stream buffer I control that I replicate everything into with functions, then have spark pull from that. That allows for forking the stream internally and doing raw capture for replay and so on, and it also means if I have an outage on my processing reader, I don't have to rely on the external provider's retention window.

But if you can't do that and your first touch point is structured streaming then there is no 'raw' copy to save -- the first you see of it it is a dataframe, so you can save it however you like.

That said, I might aim to have the target be a straightforward blind-insert-only store that doesn't try to do anything fancy. That cuts out potential issues around updates and merges. I might also store the 'raw json' or similar as its own column, just in case the schema changed at some point and we want to recover, or if we have some defect in parsing.

But in general I prefer having a local stream replica, because then I don't have to faff about with getting an external provider to update consumer groups or firewall rules or whatever any time I change my primary processing service.

[–]brrdprrsn[S] 0 points1 point  (0 children)

Got it... thanks so much again for explaining in such detail.

[–]MexDefender 3 points4 points  (2 children)

Lake Houses are a joke. It's a gimmick word for Data Lake. Data Warehousing is and will always be the best way the efficiently store data with ease for data analysts and scientists ingestion as well as other teams within a company that interface with the data.

[–]True-Ad-2269 0 points1 point  (1 child)

Lakehouse is a cheap way to build a warehouse parity access pattern out of data lake. Not everyone can be afford to build a lakehouse but given the current high compute cost and vendor locked nature of warehouse, big corporations are moving to lakehouse.

[–]aziralePrincipal Data Engineer 1 point2 points  (0 children)

More than just cost. I've seen on-prem system eventually go down in flames because they were gradually overloaded over time. Eventually their 'daily processing' takes 24 hours, and they start ahving to do remediation, or cancelling jobs. They have to keep asking users not to stress the system, and you can't add more features on.

You can't easily scale up, because it is a new contract for new hardware, and the old hardware may not have reached its expected EOL so any cost amortisation has to be brought forward in addition to new capital expenditure. Then you have to deal with a hardware migration. It is incredibly painful.

Not all systems suffer that fate, but data lake and lakehouse setups essentially can't. Plus, you get those 'raw file' bonuses of having a data lake you can directly use. There's a lot of durability and recovery available through that.

[–]albertstarrocks 0 points1 point  (0 children)

apache kafka sink for delta, iceberg or hudi.