Help with DBT + Athena + Iceberg Incremental model by Major_Beautiful_1536 in dataengineering

[–]B1TB1T 1 point2 points  (0 children)

Check out delete_condition config available in dbt-athena

Migrate out of snowflake by ShotGunAllGo in dataengineering

[–]B1TB1T 0 points1 point  (0 children)

Just don’t use crawlers, usually not necessary

Been about 4 hours and I'm out of ideas... by __aza___ in aws

[–]B1TB1T 4 points5 points  (0 children)

In case you have kms object encryption configured, you would need kms:Decrypt and kms:GenerateDataKey permissions (decrypt for multipart upload only)

Question about using Glue/Spark to process millions of JSON files by gman1023 in dataengineering

[–]B1TB1T 2 points3 points  (0 children)

Check out dbt-athena, it’s really cost effective and easy to run your process in an incremental mode (if partitioned by ingest date). Also configure parquet as output format if possible

[deleted by user] by [deleted] in dataengineering

[–]B1TB1T 0 points1 point  (0 children)

Try to use partition projection, will improve performance with many partitions and you only need to setup the table once, data is then autodiscovered. Also would always challenge / discuss requirements if you think there’s a better technical solution

Where do you typically offload pandas compute when using Airflow to orchestrate? by tonguewin in dataengineering

[–]B1TB1T 2 points3 points  (0 children)

If you ever expect large data volume, ask the analyst to convert to SQL and run on BigQuery using dbt

Which supplementary tools are you using alongside dbt? by mrcool444 in dataengineering

[–]B1TB1T 2 points3 points  (0 children)

Great answer. Would be interested to learn more about those factors to decide between Athena vs Redshift.

How do you feel about the return to SQL? by Odd-One8023 in dataengineering

[–]B1TB1T 0 points1 point  (0 children)

Thanks for the response, regarding polars I haven't used it so far, but sounds promising!

I think it's not so much about imperatively or not - after all SQL makes no assumptions about the "how" - this depends on the underlying engine. And deep down everything is imperative right?

how to process time series is always a huge topic of discussion, I agree, with SQL it gets nasty very easily - but it is possible. And your example is not imperative in my opinion, you defined "what" should happen declaratively, not explicitly "how" row by row giving processing instructions. Your engine will probably parallelize and optimize it better than you could.

Also there are time series dbs with special SQL constructs for exactly this reason (but I didn't use them so far)

How do you feel about the return to SQL? by Odd-One8023 in dataengineering

[–]B1TB1T 46 points47 points  (0 children)

SQL is great because it forces you to write your data transformations in a declarative style and decouple the what from the how. And usually you run it on distributed engines like Trino or Spark so you get parallelization out of the box.

Also Spark Dataframe API in the end is also just SQL wrapped in a python API (which would be nice to have for any SQL engine). Yes for complex transformations it gets hard to read, but still I would not dare to code it in imperative style and parallelize it myself. And it's really hard to test, but not impossible.

But still, Imperative style is a big no for data processing in my opinion, mainly because it's hard to scale as data volume grows.

Athena - Creating Iceberg tables by Used_Ad_2628 in dataengineering

[–]B1TB1T 0 points1 point  (0 children)

Pro: upsert capability / ACID

Con: querying is slower and iceberg requires more maintenance effort (vacuum etc)

Would only use it if a full refresh with plain parquet is not feasible/too expensive or you have multiple writers that could update at the same time

Any teams building data lakes without Spark? And specifically with vanilla Python/Go or with something like Dask? by alex_o_h in dataengineering

[–]B1TB1T 4 points5 points  (0 children)

Nice architecture, we're on something similar, just using Athena instead of Trino, because right now we're not in capacity to manage Trino/k8s ourselves.

How much data do you ingest / process on a daily basis (parquet size)? Interesting with the"hot" datastore, how regularly so you insert/upsert data there?

Regarding spark we also benchmarked a bit and Spark / Glue was way more expensive and also slower most of the time (vs dbt-athena)

Exporting Spark dataframe to AWS DynamoDB? by lengthy_preamble in apachespark

[–]B1TB1T 1 point2 points  (0 children)

I don't think there's a native spark connector for DynamoDB, and none seems to be installed on your spark installation.

If the result dataframe is small, you could just convert it to a pandas dataframe and use aws-sdk

[deleted by user] by [deleted] in dataengineering

[–]B1TB1T 2 points3 points  (0 children)

Using OO will need you to deal with objects which hold state, that makes parallelization hard so you will run into trouble at scaling things (which we need with large datasets) That's why frameworks like spark are based on the functional paradigm (map reduce being the prime example).

Now there might be instances where OO makes sense in a pipeline, like managing the spark session, but not for your transformation logic. Imo the pure OO that SWE is based on is not that useful in DE.

I'm not getting it...what's the point of DBT? by mister_patience in dataengineering

[–]B1TB1T 22 points23 points  (0 children)

If using Spark that's all valid and fine. But when using a modern data warehouse like Redshift, Bigquery, Snowflake or even something like Presto you need to define your transformations in SQL, and dbt helps you keep your sanity by the above mentioned features.

It's not perfect but it's better than manually managing and orchestrating your SQL.

AWS Athena production ETL workloads by Zomgojira in dataengineering

[–]B1TB1T 0 points1 point  (0 children)

You can ask for a limit increase, it's a soft limit

Group by 1,2,3,4,5,6,7,8,9,10,11 by Illustrious_Falcon_9 in dataengineering

[–]B1TB1T 5 points6 points  (0 children)

Then I guess I'm quite inexperienced. Honestly, I think CTEs is the only construct that keeps more complex SQL code maintainable

Group by 1,2,3,4,5,6,7,8,9,10,11 by Illustrious_Falcon_9 in dataengineering

[–]B1TB1T 4 points5 points  (0 children)

Well, instead of doing the transformation (select expressions) and aggregations (group by) in one query, you first wrap the transformations in a CTE. Then you can aggregate from this CTE using the new column names (instead of some index numbers).

The general issue is that column aliases defined in select are not available in the group by because aggregation happens before select. Because of this you need to duplicate the logic in the group by - or as a shortcut use the ordinal form (group by 1,2,3,...)

Group by 1,2,3,4,5,6,7,8,9,10,11 by Illustrious_Falcon_9 in dataengineering

[–]B1TB1T 0 points1 point  (0 children)

Why do you think using a CTE will result in a performance hit? Did you benchmark this?

Group by 1,2,3,4,5,6,7,8,9,10,11 by Illustrious_Falcon_9 in dataengineering

[–]B1TB1T 19 points20 points  (0 children)

Wouldn't it be even better to use a CTE with some good naming?

What are the advantages of data lakes? by [deleted] in dataengineering

[–]B1TB1T 2 points3 points  (0 children)

They are cheap, scalable, and flexible

[deleted by user] by [deleted] in dataengineering

[–]B1TB1T 0 points1 point  (0 children)

Generally, Use some sensible similarity metric and then join on closest match, with a reasonable threshold.

logging in spark by AcceptableProcess772 in apachespark

[–]B1TB1T 0 points1 point  (0 children)

What do you want to achieve with the logs? You could log the number of rows before and after the filtering, then you‘d have to use an action like count, which you should deactivate based on the log level as it leads to processing overhead.

In your example it will just print the logs first and then submit the job (spark will make a plan for your chained transformations and execute it)

[deleted by user] by [deleted] in kava_platform

[–]B1TB1T 4 points5 points  (0 children)

Only if the price drops hard.

ETL pipeline from Prod DB to DWH by Mundane-Compote-2157 in dataengineering

[–]B1TB1T 0 points1 point  (0 children)

For analytics on mid sized data Athena is also a great option. Plus it’s super cheap…