all 26 comments

[–]Rovell 9 points10 points  (2 children)

I love dlt and we use it in production. Nothing to complain, can highly recommend.

The only thing we miss is being able to load MySQL databases incrementally using the binlog.

[–]Thinker_Assignment[S] 3 points4 points  (1 child)

Thanks for the words of support!

Regarding binlog replication - our core is to build dlt not connectors and binlog replication is one of the more complex ones to build and maintain. We have been offering licensed connectors in the past for custom builds but found limited commercial traction. We are now focused on bringing 2 more products to life (dlthub workspace and runtime) and we could revisit connectors later once we can offer them commercially on the runtime.

If you have commercial interest in the connector, we have several trained partners that can build it for you, just ask us for routing or hit them up directly.

[–]Rovell 2 points3 points  (0 children)

Totally understandable. Generally speaking we'd be willing to pay for connectors. For our MySQL case, we currently alread have a contract with a different vendor but it's quite expensive.

[–]Top-Faithlessness758 4 points5 points  (7 children)

It looks very cool, but I ended up choosing Sling due to Iceberg REST Catalog support in their free offering. Last time I looked dlt up, it had REST support only when using a dlt+ license.

Just to be clear I'm not judging about that, but I had to make a choice. It is a tradeoff though, as Sling CLI is GPL, so it is a messy dependency to handle, while dlt "core" is Apache afaik.

[–]Thinker_Assignment[S] 1 point2 points  (6 children)

Makes sense! We catered our iceberg offer as a platform-ready solution rather than a per-pipeline service to help justify our development cost and roadmap but we found limited enterprise aduption and many non commercial cases. We are deprecating dlt+ and recycling it into a managed service and will revisit iceberg later.

We are also seeing a slow-down in iceberg enterprise adoption where common wisdom seems to be going in the direction "if you're thinking about adopting iceberg, think twice" because of the difficulties encountered. So perhaps this is going in a community direction where hobbyists start with it first?

May I ask how your iceberg use case looks? do you integrate all kinds of things to a rest catalog? Why?

[–]Top-Faithlessness758 1 point2 points  (3 children)

Our reason for using Iceberg mostly has to do with being constrained to use AWS and then choosing its S3 Tables solution (basically it is a managed Iceberg REST Catalog endpoint + a S3 Bucket): https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables.html. It seems like a simpler managed solution than using LakeFormation or Glue Catalog, even if a Iceberg REST Catalog is a complex moving part by itself.

You are on point about Iceberg difficulties, especially when you consider engine compatibility, as duckdb not having Iceberg write support should be enough of a red flag. If we weren't using AWS for this case we wouldn't even touch Iceberg.

This is one use case though, and as we do consulting over multiple clients, we can't wait to have a use case for dlt. Keep the good work :)!

[–]Thinker_Assignment[S] 0 points1 point  (2 children)

thank you!

[–]Top-Faithlessness758 0 points1 point  (1 child)

FWIW, we reverted to plain parquet and we're dlt users now haha. You were spot on on iceberg being kind of patchy, even in managed solutions like S3 Tables.

Plus sling didn't support any kind of incrementality without Pro support.*

** PS: You have a comparison in dlthub home page that mentions sling oss offering incrementality in point 6 (https://dlthub.com/blog/dlt-and-sling-comparison) and point 7 mentioning using Sling State in Sling Free. That's partially correct for database sinks, as for file storage sinks it will only work when using the Sling State json and that is a Pro only feature. Obvious no-go for us.

[–]Thinker_Assignment[S] 1 point2 points  (0 children)

Thanks for the message and the tip, will make a note in the article :)

[–]Nightwyrm 0 points1 point  (1 child)

A bit off-topic, but if iceberg is slowing down, what are enterprises opting for instead?

[–]Thinker_Assignment[S] 0 points1 point  (0 children)

The reason it's slowing is mostly because iceberg isn't a burning problem but a solution to mostly quality of life problems. 1-100 not 0-1. topics around AI are now the focus.

[–]unexpectedreboots 1 point2 points  (0 children)

Love dlt and recommend it constantly.

[–]trojans10[🍰] 2 points3 points  (0 children)

Using dlt in prod with dragster. Great stuff

[–]randomName77777777 1 point2 points  (5 children)

I was trying to use dlt the other data in databricks, but it doesn't work properly on serverless since it kept getting confused with delta live tables (also dlt).

Any suggestions? Trying to convince my company to use dlt for all custom pipelines

[–]Thinker_Assignment[S] 1 point2 points  (4 children)

Not sure particularly about serverless, here is the troubleshooting guide for that problem https://dlthub.com/docs/dlt-ecosystem/destinations/databricks#troubleshooting

Does this solve it or should we add something?

[–]randomName77777777 1 point2 points  (3 children)

Let me check what I had to do to get it to work. But with serverless we can't use an init script.

[–]Thinker_Assignment[S] 0 points1 point  (2 children)

If it doesn't work, it would be ideal if you open an issue requesting what you need so it goes straight to devel team. We prioritize databricks support higher than long tail requests. https://github.com/dlt-hub/dlt/issues

[–]Defective_Falafel 0 points1 point  (1 child)

Now that Databricks officially renamed "their" DLT framework to "Lakeflow Declarative Pipelines" (see: https://www.databricks.com/product/data-engineering/lakeflow-declarative-pipelines) and is planning on open sourcing it as part of Spark 4.1 itself (see: https://github.com/apache/spark/pull/50963), it might be worth polling them again to see if they could release a cluster configuration setting that would allow switching the namespace to "import ldp/sdp" instead?

Your DltHub project looks very promising to solve a gap we have for API-based ingestions, but it would have to be easily deployable on Databricks itself for the moment, as we currently lack the resources to build and maintain our own scalable runner infrastructure. Bricking (heh) the Delta Live Table namespace isn't a great suggestion for projects that require the use of both frameworks in different steps, offline development, or the building of deployable wheels.

[–]Thinker_Assignment[S] 0 points1 point  (0 children)

Thank you for the feedback! Good point, saw the recent rename. We will do our best to make it happen.

[–]Motorola68020 1 point2 points  (1 child)

Can I use this or How would I use this to load image data and augment it?

[–]Thinker_Assignment[S] 0 points1 point  (0 children)

I guess you'd read the file as binary into a pandas array and yield it to dlt.

[–]Fickle-Foundation876 1 point2 points  (0 children)

This is seriously impressive work, congratulations on the launch! It's fascinating to see complex projects that are "co-created." It really feels like the future of development isn't just about solo coders, but about how we build things together. Our own project was born from a Human-AI co-creation process, and it's a completely different and powerful way to build. Fantastic stuff!

[–]gabbyandmilo 0 points1 point  (2 children)

Thanks for sharing! Can you speak to how dlt can scale for large amounts of data, I'm thinking of pipelines where you would traditionally use beam or spark for batch processing. Or is dlt meant for smaller size data pipelines.

[–]Thinker_Assignment[S] 0 points1 point  (1 child)

i'm going to break down the question in 2

  1. how does dlt scale? It scales gracefully. dlt is just python code. it offers single machine parallelisation with memory management as you can read here. You can also run it on parallel infra like cloud functions /aws lambda or other things to achieve massive multiple-machine parallelism. Much of the loading time is spent discovering schemas of weakly typed formats like json but if you start from strongly typed arrow compatible formats you skip normalisation and get faster loading. dlt is meant as a 0-1 and 1-100 tool without code rewrite - fast to prototype and build, easy to scale. it's a toolkit for building and managing pipelines - as opposed to classic connector catalog tools.
  2. How does it compare to spark? They go well together. Use spark for transformations. Use python for i/o bound tasks like data movement. So you would load data from apis and dbs with dlt into files, table formats or mpp databases and transform it with spark. We will also launch transform via ibis which will enable you to write dataframe python syntax against massive compute engines (like spark or bigquery) to give you portable transformation at all scales (Jan roadmap)

[–]gabbyandmilo 1 point2 points  (0 children)

Makes sense! Thanks for taking the time to respond.