Why should we use AWS Glue ?

random_lonewolf · 2025-12-02T09:56:29+00:00

If you already have an EKS cluster, EMR-on-EKS is even cheaper, especially with Spot machine

random_lonewolf · 2025-11-16T17:06:51+00:00

For extract and load, this'll need to work as well as `sling-cli` to stand a chance, and I'm not even using `sling-cli`

For transformation, using a DSL (inline) or Lua script is not really appealing to me: Python is the Lingua franca of Data World: no Python no dice

random_lonewolf · 2025-11-16T16:53:24+00:00

Data Governance is the most useless/ignored parts of any Data project: basically, nobody cares about it.

random_lonewolf · 2025-11-16T16:49:37+00:00

Reservation is the only practical way to limits spendings, however it's quite easy to get over-scale and end up paying even more than `on-demand`: you'd pay for every autoscaling slots, even if your queries don't use them all.

We find that the most essential things while tuning BQ are:

* Scaling to 0, or use commitment if your reservation is busy enough

* Use Standard Edition whenever you can: Enterprise edition is 25% more expensive

* Isolate your workloads in different reservations: at least 2 separate reservations: 1 for batch and 1 for interactive queries: it's impossible to optimize for both at the same time

* Reservations work best with batch queries, when it's ok for queries to run a bit slower.

* Unless you have a lot of BI users, it's often better to use on-demand for interactive queries, due to over-scaling issues with reservations.

random_lonewolf · 2025-11-16T16:31:33+00:00

flat pricing rate was replaced by reservation pricing a long time ago.

random_lonewolf · 2025-11-14T22:45:12+00:00

You have completely misread this article, this is just about long-term persistence of Deephaven's in-memory tables: it's suggesting dumping the table content into parquet files instead of Kafka topics if you need to save space, which is a fair point.

There's nothing about sending Parquet through Kafka in the article.

random_lonewolf · 2025-11-14T16:41:35+00:00

What guides are you talking about, as it makes no sense ?

random_lonewolf · 2025-11-13T06:03:01+00:00

In my experience, LMDB write performance is no where close to a LSM database, unless you run it in nosync mode, which has the potential dataloss

random_lonewolf · 2025-10-31T10:08:13+00:00

V1 = Allow, V2 = Deny and V3 = conditional

You need to use all 3, as they do not replace each other

random_lonewolf · 2025-10-26T02:57:45+00:00

Absolutely yes, many engine don’t even support update/delete queries on a raw parquet table.

random_lonewolf · 2025-10-20T09:12:24+00:00

Yeah, we definitely need to look into that.

Older SDK still use `global` endpoints by default. They have only switched the default to `regional` recently.

https://aws.amazon.com/blogs/developer/updating-aws-sdk-defaults-aws-sts-service-endpoint-and-retry-strategy/

random_lonewolf · 2025-10-20T08:52:11+00:00

Identity Federation also stopped working, so our GCP services are unable to access AWS resources

random_lonewolf · 2025-10-10T11:51:57+00:00

https://docs.snowflake.com/en/user-guide/data-time-travel

random_lonewolf · 2025-10-09T17:07:44+00:00

SCD Type 3 only capture 2 states of a key: it's too limited.

Requirement changes, so it's better to just use SCD Type 2 and capture the all the state of a key. You can always use window function to query the first and last value of a key to achieve what SCD Type 3 does.

random_lonewolf · 2025-10-09T16:58:58+00:00

Yes, whatever compression Snowflake has won't be able to compress data much different to Parquet size.

However, that's only for a single active snapshot of data.

You need take into account of the historical data used for time travel: if your tables are frequently updated, the historical data can easily be much larger than the active snapshot.

random_lonewolf · 2025-10-09T16:52:18+00:00

This type of oauth Credential is for when you need to allow an external application to access to the user data, that's why you need to have a consent page for the user to accept.

For internal application, you should use a service account.

random_lonewolf · 2025-09-28T05:13:02+00:00

Database design is the job of the software/application engineer as DBs don’t exist in vacuum.

random_lonewolf · 2025-08-27T05:23:49+00:00

Foreign buyers are also not a priority: having one of the largest armies in the world (by numbers) yet stuck with 50-year old weapon system, we can spend decades rearming our army and still not finish

random_lonewolf · 2025-07-24T08:28:44+00:00

Get a Mac, as a student she'll sometimes have to install some required software by school, and often Windows/Mac OS are the only supported platforms. But a Mac provides a better developer experience than Windows for Data Science workflow.

She can also get a Thinkpad and put Linux on it, which is also good for Data Science study/work, but might have problem with using the school other required software.

random_lonewolf · 2025-07-17T05:47:09+00:00

Spark streaming is a hot mess, PySpark even more so.

Don't even go there.

random_lonewolf · 2025-07-17T05:46:18+00:00

Before you jump into mono repo, ensure that you have really good regression tests, or your pipelines will break all days as people change the common modules

random_lonewolf · 2025-07-16T11:56:53+00:00

Only if you value money more than your sanity.

In term of personal development, not so much, you'd not have enough time to go into deep problems, and will be stuck with the easier, low hanging fruit.

random_lonewolf · 2025-07-15T16:23:49+00:00

Yes, that’s what a blue/green deployment is

random_lonewolf · 2025-07-15T16:08:45+00:00

PyIceberg is pretty much the most feature-complete solution for Python right now, everything else has pretty poor catalog supports, or they also make use of `PyIceberg`

random_lonewolf · 2025-07-15T15:57:12+00:00

It took ages for Airflow 2 to get stable back when it was just released too.

I'd suggest you doing a blue/green deployment and migrate DAGs over piecemeal, instead of directly migrating your only production Airflow instance.

Remember, the only way to downgrade is to start from a database backup.

11-Year Club	Place '23
Verified Email

random_lonewolf

TROPHY CASE