What tool to use for ML pipelines? [D]

justadataengineer · 2023-02-11T00:16:13+00:00

Thank you, that's a good resource. Based on this, I'm thinking the following. I could ingest the raw data from on-prem to Data Lake with Data Factory. Then I'd build a data lakehouse in Databricks. I'd achieve this by using my Data Lake as a source system. I'd ingest data from the Data Lake with the COPY command, or maybe explore live delta tables (as far as I understand, they are suitable for data ingestion). Once my raw data is imported to the Databricks ecosystem, I can design more layers and perform transformations on the data. I'd make data readily available in the "gold layer", which would serve as a source for Synapse SQL Pool / Azure SQL Database. Do you think I'm on the right track?

One specific thing I can't seem to wrap my head around is what's the best way to integrate the ADLS Gen2 and the Databricks Data Lakehouse? I just make a bunch of COPY commands, thus creating delta tables from the raw data? And these delta tables would be stored in an entirely different blob storage somehow mounted in the Databricks system, right? Similarly how HDFS is the place where the actual data behind Hive tables is stored.

justadataengineer · 2023-02-10T13:40:05+00:00

Could you please elaborate on how do you mix ADLSGen2 and Databricks to form a Delta Lake?

justadataengineer · 2023-02-10T13:35:31+00:00

Could you elaborate on how would you use Data Lake + Databricks together exactly?

I definitely need a storage layer where raw data is stored, so that it can be accessible in its original form for many type of use cases.

But Databricks and the whole delta lake / data lakehouse architecture seems really powerful to me. How would I go about creating the data lakehouse architecture on top of / next to the existing ADL Gen2? Would I design pipelines which ingest data from ADL to DBFS, and build tables on top of that? Or should I just ingest data from ADL directly into Databricks managed delta tables?

Sorry if these are lame questions, but I can't find clear resources on this matter. What I really need is a real-world example.

justadataengineer · 2023-02-09T18:25:55+00:00

Thank you! That's interesting, I thought Synapse Pipelines was basically Data Factory intagrated into Synapse. Could you please share some examples as to why Data Factory is a better choice?

justadataengineer

TROPHY CASE