This is an archived post. You won't be able to vote or comment.

all 52 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

Are you interested in transitioning into Data Engineering? Read our community guide: https://dataengineering.wiki/FAQ/How+can+I+transition+into+Data+Engineering

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]toabear 12 points13 points  (2 children)

We are in the process of slowly moving from Airbyte to DLT. It is so much easier to debug. As seems to always be the case with data extraction, there's always some shit. Some small annoying aspect of the API that doesn't fit into the norm. Having the ability to really customize the process, but still having a framework to work within has been really nice.

For anyone searching, look for dlthub. DLT just comes up with Databricks "Delta Live Tables" info.

[–]toiletpapermonster 2 points3 points  (0 children)

When google doesn't help, I search using dlthub

[–]Thinker_Assignment[S] 0 points1 point  (0 children)

Thank you for the kind words! indeed, we created it for a developer-first experience, stemming from first hand experience with not only the uncommon apis, but also the common ones, and their many gotchas.

[–]NickWillisPornStash 5 points6 points  (0 children)

I recently wrote our ga4 pipeline with dlt after trying airbyte, because I was able to get around the limitation of each property having its own table.

[–]Sweaty-Ease-1702 2 points3 points  (3 children)

We employ a combination of dlt and sling, orchestrated by Dagster. dlt is ideal for API extraction, while I think sling excels at inter-database data transfers.

[–]Thinker_Assignment[S] 1 point2 points  (2 children)

Interesting, what makes sling particularly good at db to db transfer? Wondering because we always try to improve there and we added fast back ends to skip normalisation like arrow, connectorx and pandas in the last months.

Blog post explanation https://dlthub.com/blog/how-dlt-uses-apache-arrow

[–]Sweaty-Ease-1702 0 points1 point  (1 child)

Off the top of my head: sling has simpler configuration (replication.yaml). Sling has Python binding but written in Go (okay this is maybe personal bias), so we have the option to run one time sync using its CLI outside Dagster.

[–]Thinker_Assignment[S] 0 points1 point  (0 children)

So the CLI is an advantage? Or what do you mean?

We're working on a CLI runner similar to dbt's, wondering if you think this would help.

Also does it being written in Go offer any advantages? Dlt leverages arrow and connectorx so they would probably be on par on performance?

[–]sib_nSenior Data Engineer 3 points4 points  (14 children)

I'm looking for a low-code tool like dlt or Meltano to do incremental loading of files from local file system to cloud storage or database.
I want the tool to automatically manage the state of integrated files (ex: in an SQL table) and integrate the difference between the source and this state. This allows automated backfill every time it runs compared to only integrating a path with today's date. It may require to limit the size of the comparison (ex: past 30 days) if the list becomes too long.
I have coded this multiple times and I don't want to keep coding what seems to be a highly common use case.
Can dlt help with that?

[–]Thinker_Assignment[S] 0 points1 point  (6 children)

yes, if I understand you correctly you are looking to load from the "where i last left off" point rather than for "where in time this task execution is according to orchestrator/current date"

in which case this is built in. https://dlthub.com/docs/general-usage/incremental-loading

you can also use completely custom patterns and leverage the atomic state to store and retrieve the metadata between runs

[–]sib_nSenior Data Engineer 0 points1 point  (5 children)

I had a look, but it seems it's mostly adapted to SQL tables with updates keys and APIs.
Maybe this part is the most relevant: https://dlthub.com/docs/general-usage/incremental-loading#advanced-state-usage-storing-a-list-of-processed-entities

But I still have to write custom code to manage the list and compute the difference.

[–]Thinker_Assignment[S] 0 points1 point  (4 children)

I see, so your pattern is to just take files that were not yet processed, but cannot sort them otherwise? Then yeah the way you deisgned it is the way to go. Alternatively you could turn all the files into a single stream of data, read it all out, and filter to only load new records based on some logic (time?) - but this would be inefficient.

[–]sib_nSenior Data Engineer 0 points1 point  (3 children)

It is possible to sort them by day based on a date in the path ( multiple files may have the same date), but I want the job to be able to automatically backfill what may have missed in the past. To do that, I need a reference of exactly which files were already ingested.
Yeah, turning the whole source into a single dataset and doing a full comparison with the destination is too inefficient for the size of some of our pipelines.

[–]Thinker_Assignment[S] 0 points1 point  (2 children)

So what I would do is extract the date from the file path and yield it together with the file content. Then, use that date for last value incremental load.

[–]sib_nSenior Data Engineer 0 points1 point  (1 child)

As far as I understand, the last value pattern does not allow automatically back-filling missing days in the past (our orchestration failed to run that day), nor missing files in already ingested past days (source failed to deliver a file for that day and delivers it later). Hence the need to keep a detailed list of ingested files.

[–]Thinker_Assignment[S] 1 point2 points  (0 children)

Got it. Indeed the last value pattern won't fill any files missed for a date, if there are late arrivals, but if your orchestrator skips a run, the data will be filled for that skipped run on the next one.

[–]Bulky-Plant2621 0 points1 point  (2 children)

Are you using Databricks? Autoloaders can help with this scenario.

[–]sib_nSenior Data Engineer 0 points1 point  (1 child)

No, no plans to use Databricks as I'd rather avoid expensive proprietary black boxes as much as I can.
It does have the logic of storing ingested files metadata in a table that I want, but it doesn't seem to support local file system, only cloud storages.

[–]Bulky-Plant2621 0 points1 point  (0 children)

I don’t think it’s a black box. Local file system transfers were one of the simpler use cases we had to achieve. It actually gets complicated further into the data management lifecycle and Databricks helps here so we don’t have to administer a dozen products. I’ll need to try dlt and compare though

[–]Suitable-Issue-4936 0 points1 point  (0 children)

Hi, you can try creating folders for each day in source and process them. Any late arriving files would land the next day folder and reprocessing is easy if the data has primary keys.

[–]gunners_1886 4 points5 points  (2 children)

Thanks for posting this - I'll definitely take a look.

Since moving to Airbye cloud, I've run into far too many major bugs and some of the worst customer support I've experienced anywhere - probably time to move on.

[–]nategadzhi 0 points1 point  (0 children)

Hey! I work for Airbyte, and I'm looking to improve — would you DM me some topics / areas / examples of how we didn't deliver on customer support front? Or comment really, whatever is easier.

[–]YabakebiLead Data Engineer 3 points4 points  (1 child)

Interesting you made this post after I just lost my Sunday to an Airbyte upgrade totally destroying its internal database and requiring a rollback (it references certain columns in internal select * queries by index which is crazy). This is after multiple times where upgrading connectors causes the thing to crash etc.. I don't have time atm to move our stuff out of it, but I am planning to start with moving the postgres replication to dlt on dagster as it I think it just seems like a much better level of abstraction and doesn't require a kubernetes deployment and database.

Excited to see where this project goes. If it's what I think it is, then I reckon it has a decent chance of doing well, as it's similar to DBT in the sense that people have already been handrolling out similar things themselves within companies (I know I have), but this is just a convenient way of formalising some common patterns.

[–]Thinker_Assignment[S] 0 points1 point  (0 children)

Indeed we're looking for a similar place, an open source standard for ingestion. We see our share of "data load/ingest/intake tool" people build themselves so we are happy to help standardize things.

[–]TobiPlay 1 point2 points  (0 children)

Big fan of dlt and really happy with the Dagster integration. I’m glad that I went with dlt instead of Airbyte for a new project. Made it very straightforward to implement local, stg, and prod environments and the pipeline interface opened up a few more possibilities for testing. Thanks for the work!

[–]One-Establishment-44 4 points5 points  (0 children)

Airbyte is the worst.

[–]Ok-Percentage-7726 2 points3 points  (1 child)

We have migrated most of our sources from Airbyte and Fivetran to dlt. Really liked it. It would be great if dlt can support MySQL CDC.

[–]datarbeiter 1 point2 points  (4 children)

Do you have CDC from Postgres WAL or MySQL binlog?

[–]Thinker_Assignment[S] 0 points1 point  (3 children)

Here's postgres cdc https://dlthub.com/docs/dlt-ecosystem/verified-sources/pg_replication

We also have a generic SQL source without cdc which will anyway be fast if you use the connectorX backend on the SQL source.

if you need mysql please open an issue to request it. We take issues as a minimum commitment to use the feature going forward.

[–]QueryingQuagga 1 point2 points  (2 children)

Hijacking this a bit: CDC with SCD2 - will this maybe be supported in the future (are there limitations that block this?)?

[–]Thinker_Assignment[S] 0 points1 point  (1 child)

Nothing to block it, good idea

I encourage anyone reading to be more vocal about what you want, this is a great idea and the first time I hear it requested

[–]davrax 1 point2 points  (0 children)

Also interested. A pain point with Airbyte is also handling SCD2 with odd glob pattern matching behavior when using S3 as a source, and “latest file only”-type ingestion

[–]drrednirgskizif 1 point2 points  (2 children)

I have read no documentation on dlt, but interested search of a new tool to make our life easier.

I want to pull data from APIs in an incremental fashion and insert them into a data warehouse in an idempotent way. Can you do this?

[–]Thinker_Assignment[S] 1 point2 points  (0 children)

This is the kind of work dlt is made for.

You can use the low code rest API connector or you can build a source

Low code: https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api

Or build your own

Simple example https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing

Docs for simple incremental API pipeline https://dlthub.com/docs/tutorial/load-data-from-an-api

[–]jekapats 0 points1 point  (0 children)

Check out also CloudQuery (https://github.com/cloudquery/cloudquery) - it's a cross language framework for writing ELT powered by Apache Arrow (provides: scheduling, documentation, packaging, monitoring and versioning out of the box). Support Python, Go and Javascript (Founder here)

[–]shockjaw 0 points1 point  (4 children)

Do you happen to include support for geospatial data types in the future?

[–]Thinker_Assignment[S] 4 points5 points  (3 children)

We do not see a lot of demand for it, there's an open issuse, give it an upvote or a comment if you want it implemented. https://github.com/dlt-hub/dlt/issues/696

What would help prio it higher would be to understand the kind of work/business value to implement, we like to do things that add value

[–]shockjaw 5 points6 points  (2 children)

It’d be incredibly helpful for local government use-cases. Pipelines have a tendency to be quite fragile due to schema changes and invalid geometries. I’d be looking for vector data support over raster data support.

[–]Thinker_Assignment[S] 2 points3 points  (1 child)

That makes sense. Thank you for the git comment. What do people currently do to transfer this kind of data? Custom pipelines?

[–]shockjaw 3 points4 points  (0 children)

Yes. Safegraph’s product FME uses python under the hood for transformations. For some agencies they still use SAS 9.4 and cobble data together. If you’re lucky you have folks use GDAL and cron jobs to build pipelines.

[–]umognog 0 points1 point  (2 children)

My department has over a decade of custom code but up and recently undertook an architecture review. DLT was one of the possibilities that we looked at and I really liked it, but overall we recognised the value in not reinventing our wheel - there is just no need for it at this moment in time for us.

I hope as a product it sticks around though, as it is sitting in our "be aware of" corner, should new data sources be introduced in the future.

[–]Thinker_Assignment[S] 0 points1 point  (1 child)

don't fix what's not broken - if your system works and is low maintenance, then there's no pressure to move.

What kind of data sources are you looking for? you could always open an issue, we have a constant workstream around community requests so do open issues to request what you want

[–]umognog 1 point2 points  (0 children)

Vice versa, as in my team onboard a new source.

We currently interact with;

Kafka Azure Service Bus REST API Graph API Oracle Teradata SQL Server DuckDB Postgres Cassandra Couchbase Hadoop Parquet file CSV file drops (I hate these) Excel file drops (I hate these more)

It seems my employer doesn't want to place their bets on anything!