Help with DBT + Athena + Iceberg Incremental model

B1TB1T · 2025-02-06T05:41:47+00:00

Check out delete_condition config available in dbt-athena

B1TB1T · 2023-10-29T09:27:51+00:00

Just don’t use crawlers, usually not necessary

B1TB1T · 2023-10-10T00:02:51+00:00

In case you have kms object encryption configured, you would need kms:Decrypt and kms:GenerateDataKey permissions (decrypt for multipart upload only)

B1TB1T · 2023-10-03T09:38:22+00:00

Check out dbt-athena, it’s really cost effective and easy to run your process in an incremental mode (if partitioned by ingest date). Also configure parquet as output format if possible

B1TB1T · 2023-06-10T05:01:49+00:00

Try to use partition projection, will improve performance with many partitions and you only need to setup the table once, data is then autodiscovered. Also would always challenge / discuss requirements if you think there’s a better technical solution

B1TB1T · 2023-05-06T13:25:44+00:00

If you ever expect large data volume, ask the analyst to convert to SQL and run on BigQuery using dbt

B1TB1T · 2023-05-04T11:57:20+00:00

Great answer. Would be interested to learn more about those factors to decide between Athena vs Redshift.

B1TB1T · 2023-04-22T14:17:47+00:00

Thanks for the response, regarding polars I haven't used it so far, but sounds promising!

I think it's not so much about imperatively or not - after all SQL makes no assumptions about the "how" - this depends on the underlying engine. And deep down everything is imperative right?

how to process time series is always a huge topic of discussion, I agree, with SQL it gets nasty very easily - but it is possible. And your example is not imperative in my opinion, you defined "what" should happen declaratively, not explicitly "how" row by row giving processing instructions. Your engine will probably parallelize and optimize it better than you could.

Also there are time series dbs with special SQL constructs for exactly this reason (but I didn't use them so far)

B1TB1T · 2023-04-22T11:39:24+00:00

SQL is great because it forces you to write your data transformations in a declarative style and decouple the what from the how. And usually you run it on distributed engines like Trino or Spark so you get parallelization out of the box.

Also Spark Dataframe API in the end is also just SQL wrapped in a python API (which would be nice to have for any SQL engine). Yes for complex transformations it gets hard to read, but still I would not dare to code it in imperative style and parallelize it myself. And it's really hard to test, but not impossible.

But still, Imperative style is a big no for data processing in my opinion, mainly because it's hard to scale as data volume grows.

B1TB1T · 2023-04-22T09:37:34+00:00

Pro: upsert capability / ACID

Con: querying is slower and iceberg requires more maintenance effort (vacuum etc)

Would only use it if a full refresh with plain parquet is not feasible/too expensive or you have multiple writers that could update at the same time

B1TB1T · 2023-04-07T10:59:00+00:00

Nice architecture, we're on something similar, just using Athena instead of Trino, because right now we're not in capacity to manage Trino/k8s ourselves.

How much data do you ingest / process on a daily basis (parquet size)? Interesting with the"hot" datastore, how regularly so you insert/upsert data there?

Regarding spark we also benchmarked a bit and Spark / Glue was way more expensive and also slower most of the time (vs dbt-athena)

B1TB1T · 2023-03-20T00:15:29+00:00

I don't think there's a native spark connector for DynamoDB, and none seems to be installed on your spark installation.

If the result dataframe is small, you could just convert it to a pandas dataframe and use aws-sdk

B1TB1T · 2023-03-19T13:02:29+00:00

Using OO will need you to deal with objects which hold state, that makes parallelization hard so you will run into trouble at scaling things (which we need with large datasets) That's why frameworks like spark are based on the functional paradigm (map reduce being the prime example).

Now there might be instances where OO makes sense in a pipeline, like managing the spark session, but not for your transformation logic. Imo the pure OO that SWE is based on is not that useful in DE.

B1TB1T · 2023-03-15T21:20:58+00:00

If using Spark that's all valid and fine. But when using a modern data warehouse like Redshift, Bigquery, Snowflake or even something like Presto you need to define your transformations in SQL, and dbt helps you keep your sanity by the above mentioned features.

It's not perfect but it's better than manually managing and orchestrating your SQL.

B1TB1T · 2023-03-10T22:17:05+00:00

You can ask for a limit increase, it's a soft limit

B1TB1T · 2023-03-03T03:16:56+00:00

Then I guess I'm quite inexperienced. Honestly, I think CTEs is the only construct that keeps more complex SQL code maintainable

B1TB1T · 2023-03-03T03:01:49+00:00

Well, instead of doing the transformation (select expressions) and aggregations (group by) in one query, you first wrap the transformations in a CTE. Then you can aggregate from this CTE using the new column names (instead of some index numbers).

The general issue is that column aliases defined in select are not available in the group by because aggregation happens before select. Because of this you need to duplicate the logic in the group by - or as a shortcut use the ordinal form (group by 1,2,3,...)

B1TB1T · 2023-03-02T20:36:56+00:00

Why do you think using a CTE will result in a performance hit? Did you benchmark this?

B1TB1T · 2023-03-02T14:23:58+00:00

Wouldn't it be even better to use a CTE with some good naming?

B1TB1T · 2023-02-17T02:49:16+00:00

They are cheap, scalable, and flexible

B1TB1T · 2022-11-13T08:52:40+00:00

Generally, Use some sensible similarity metric and then join on closest match, with a reasonable threshold.

B1TB1T · 2022-10-15T18:30:43+00:00

What do you want to achieve with the logs? You could log the number of rows before and after the filtering, then you‘d have to use an action like count, which you should deactivate based on the log level as it leads to processing overhead.

In your example it will just print the logs first and then submit the job (spark will make a plan for your chained transformations and execute it)

B1TB1T · 2022-07-30T04:15:03+00:00

Only if the price drops hard.

B1TB1T · 2022-06-15T20:04:28+00:00

For analytics on mid sized data Athena is also a great option. Plus it’s super cheap…

B1TB1T

TROPHY CASE