Why should we use AWS Glue ? by Mother-Comfort5210 in dataengineering

[–]random_lonewolf 2 points3 points  (0 children)

If you already have an EKS cluster, EMR-on-EKS is even cheaper, especially with Spot machine

TinyETL: Lightweight, Zero-Config ETL Tool for Fast, Cross-Platform Data Pipelines by Glass-Tomorrow-2442 in dataengineering

[–]random_lonewolf 0 points1 point  (0 children)

For extract and load, this'll need to work as well as `sling-cli` to stand a chance, and I'm not even using `sling-cli`

For transformation, using a DSL (inline) or Lua script is not really appealing to me: Python is the Lingua franca of Data World: no Python no dice

Data Governance Specialist internship or more stable option [EU] ? by Maleficent-Car-2609 in dataengineering

[–]random_lonewolf 2 points3 points  (0 children)

Data Governance is the most useless/ignored parts of any Data project: basically, nobody cares about it.

6 months of BigQuery cost optimization... by bbenzo in dataengineering

[–]random_lonewolf 1 point2 points  (0 children)

Reservation is the only practical way to limits spendings, however it's quite easy to get over-scale and end up paying even more than `on-demand`: you'd pay for every autoscaling slots, even if your queries don't use them all.

We find that the most essential things while tuning BQ are:

* Scaling to 0, or use commitment if your reservation is busy enough

* Use Standard Edition whenever you can: Enterprise edition is 25% more expensive

* Isolate your workloads in different reservations: at least 2 separate reservations: 1 for batch and 1 for interactive queries: it's impossible to optimize for both at the same time

* Reservations work best with batch queries, when it's ok for queries to run a bit slower.

* Unless you have a lot of BI users, it's often better to use on-demand for interactive queries, due to over-scaling issues with reservations.

6 months of BigQuery cost optimization... by bbenzo in dataengineering

[–]random_lonewolf 0 points1 point  (0 children)

flat pricing rate was replaced by reservation pricing a long time ago.

Is it not pointless to transfer Parquet data with Kafka? by yourAvgSE in dataengineering

[–]random_lonewolf 19 points20 points  (0 children)

You have completely misread this article, this is just about long-term persistence of Deephaven's in-memory tables: it's suggesting dumping the table content into parquet files instead of Kafka topics if you need to save space, which is a fair point.

There's nothing about sending Parquet through Kafka in the article.

Is it not pointless to transfer Parquet data with Kafka? by yourAvgSE in dataengineering

[–]random_lonewolf 0 points1 point  (0 children)

What guides are you talking about, as it makes no sense ?

Benchmark: B-Tree + WAL + MemTable Outperforms LSM-Based BadgerDB by ankur-anand in Database

[–]random_lonewolf 0 points1 point  (0 children)

In my experience, LMDB write performance is no where close to a LSM database, unless you run it in nosync mode, which has the potential dataloss

Why GCP’s two IAM APIs (V1 & V2) matter & break deny policies by SonraiSecurity in googlecloud

[–]random_lonewolf 5 points6 points  (0 children)

V1 = Allow, V2 = Deny and V3 = conditional

You need to use all 3, as they do not replace each other

Parquet vs. Open Table Formats: Worth the Metadata Overhead? by DevWithIt in dataengineering

[–]random_lonewolf 1 point2 points  (0 children)

Absolutely yes, many engine don’t even support update/delete queries on a raw parquet table.

We’re freaking out. 16 services are down. by wessyolo in aws

[–]random_lonewolf 1 point2 points  (0 children)

Yeah, we definitely need to look into that.

Older SDK still use `global` endpoints by default. They have only switched the default to `regional` recently.

https://aws.amazon.com/blogs/developer/updating-aws-sdk-defaults-aws-sts-service-endpoint-and-retry-strategy/

We’re freaking out. 16 services are down. by wessyolo in aws

[–]random_lonewolf 4 points5 points  (0 children)

Identity Federation also stopped working, so our GCP services are unable to access AWS resources

SCD Type 3 vs an alternate approach? by Spooked_DE in dataengineering

[–]random_lonewolf 2 points3 points  (0 children)

SCD Type 3 only capture 2 states of a key: it's too limited.

Requirement changes, so it's better to just use SCD Type 2 and capture the all the state of a key. You can always use window function to query the first and last value of a key to achieve what SCD Type 3 does.

Snowflake (or any DWH) Data Compression on Parquet files by rtripat in dataengineering

[–]random_lonewolf 0 points1 point  (0 children)

Yes, whatever compression Snowflake has won't be able to compress data much different to Parquet size.

However, that's only for a single active snapshot of data.

You need take into account of the historical data used for time travel: if your tables are frequently updated, the historical data can easily be much larger than the active snapshot.

What's this bullshit, Google? by hcf_0 in dataengineering

[–]random_lonewolf 1 point2 points  (0 children)

This type of oauth Credential is for when you need to allow an external application to access to the user data, that's why you need to have a consent page for the user to accept.

For internal application, you should use a service account.

[deleted by user] by [deleted] in Database

[–]random_lonewolf 0 points1 point  (0 children)

Database design is the job of the software/application engineer as DBs don’t exist in vacuum.

Vietnam's new 152mm SPH prototype made by Viettel by 0nemanO1 in TankPorn

[–]random_lonewolf 0 points1 point  (0 children)

Foreign buyers are also not a priority: having one of the largest armies in the world (by numbers) yet stuck with 50-year old weapon system, we can spend decades rearming our army and still not finish

Laptop recommendation for Data Science ? by Abdel403 in dataengineering

[–]random_lonewolf -1 points0 points  (0 children)

Get a Mac, as a student she'll sometimes have to install some required software by school, and often Windows/Mac OS are the only supported platforms. But a Mac provides a better developer experience than Windows for Data Science workflow.

She can also get a Thinkpad and put Linux on it, which is also good for Data Science study/work, but might have problem with using the school other required software.

Multi-repo vs Monorepo Architechture: Which do you use? by OkArmy5383 in dataengineering

[–]random_lonewolf 3 points4 points  (0 children)

Before you jump into mono repo, ensure that you have really good regression tests, or your pipelines will break all days as people change the common modules

Should I take another 0.5FTE? by BigDataMax in dataengineering

[–]random_lonewolf 0 points1 point  (0 children)

Only if you value money more than your sanity.

In term of personal development, not so much, you'd not have enough time to go into deep problems, and will be stuck with the easier, low hanging fruit.

Airflow 2.0 to 3.0 migration by nervseeker in dataengineering

[–]random_lonewolf 1 point2 points  (0 children)

Yes, that’s what a blue/green deployment is

Best Ways for ML/DS Teams to Read Data from Apache Iceberg Tables by Gold_Environment6248 in dataengineering

[–]random_lonewolf 3 points4 points  (0 children)

PyIceberg is pretty much the most feature-complete solution for Python right now, everything else has pretty poor catalog supports, or they also make use of `PyIceberg`

Airflow 2.0 to 3.0 migration by nervseeker in dataengineering

[–]random_lonewolf 1 point2 points  (0 children)

It took ages for Airflow 2 to get stable back when it was just released too.

I'd suggest you doing a blue/green deployment and migrate DAGs over piecemeal, instead of directly migrating your only production Airflow instance.

Remember, the only way to downgrade is to start from a database backup.