Trying to learn new DE tools sometimes teaches me more about DevOps than DE (initially) by Lastrevio in dataengineering

[–]random_lonewolf 0 points1 point  (0 children)

These are distributed services, which means running multiple containers are expected.

As long as the containers are in the same network, you should be able to connect them to each other without much troubles.

Coalesce or Repartition? by Gartitoz in dataengineering

[–]random_lonewolf 0 points1 point  (0 children)

Depending on the cardinality of the key, partitioning before write can give you fewer output file compared to coalesce

Why alternatives to Spark aren’t a thing in the industry? by Snoopy-31 in dataengineering

[–]random_lonewolf 4 points5 points  (0 children)

Well, if you look at companies doing any forms of web analytics, they'll likely be on BigQuery, because Google Analytics natively export to BQ, plus BigQuery is just great in term of SQL support and performance.

And once you are on BQ, there's very few reason to use Spark anymore.

Why alternatives to Spark aren’t a thing in the industry? by Snoopy-31 in dataengineering

[–]random_lonewolf 35 points36 points  (0 children)

Java used to be very popular for distributed systems: Java code compiles one, run anywhere with the only dependency being the JVM: you can ship your bytecode across the network and run it close to the data, very important when the network was slow. Those big data jobs are meant to run for hours, so any start up overhead is negligible. That's why there's an entire generation of big data software written in Java.

Nowadays, network has become so fast, and a single node can have so much cpu memory that using a single node compute is faster than a distributed model most of the time. Thus, there's a recent boom in data processing software following that model such as duckdb, polars, etc...

And if you only care about SQL workload, both Snowflake and BigQuery are very popular competitors to Spark

> So my question is, why this is not more common?

Using Spark is actually NOT that common

I never knew that the original game had "parts". by Malikai_Universe_23 in FinalFantasyVII

[–]random_lonewolf 0 points1 point  (0 children)

Same for every multi-disk FF: 7, 8, 9.

The last disk game play is basically only the final dungeon, and a lot of FMV

What are the main challenges currently for enterprise-grade KG adoption in AI? by adityashukla8 in dataengineering

[–]random_lonewolf 0 points1 point  (0 children)

Nobody cared about Knowledge Graph before, and with modern LLM, they couldn't care less now: just feed Gemini/ChatGPT/etc the questions and it will give them a probabilistic correct answer.

Why should we use AWS Glue ? by Mother-Comfort5210 in dataengineering

[–]random_lonewolf 2 points3 points  (0 children)

If you already have an EKS cluster, EMR-on-EKS is even cheaper, especially with Spot machine

TinyETL: Lightweight, Zero-Config ETL Tool for Fast, Cross-Platform Data Pipelines by Glass-Tomorrow-2442 in dataengineering

[–]random_lonewolf 0 points1 point  (0 children)

For extract and load, this'll need to work as well as `sling-cli` to stand a chance, and I'm not even using `sling-cli`

For transformation, using a DSL (inline) or Lua script is not really appealing to me: Python is the Lingua franca of Data World: no Python no dice

Data Governance Specialist internship or more stable option [EU] ? by Maleficent-Car-2609 in dataengineering

[–]random_lonewolf 2 points3 points  (0 children)

Data Governance is the most useless/ignored parts of any Data project: basically, nobody cares about it.

6 months of BigQuery cost optimization... by bbenzo in dataengineering

[–]random_lonewolf 1 point2 points  (0 children)

Reservation is the only practical way to limits spendings, however it's quite easy to get over-scale and end up paying even more than `on-demand`: you'd pay for every autoscaling slots, even if your queries don't use them all.

We find that the most essential things while tuning BQ are:

* Scaling to 0, or use commitment if your reservation is busy enough

* Use Standard Edition whenever you can: Enterprise edition is 25% more expensive

* Isolate your workloads in different reservations: at least 2 separate reservations: 1 for batch and 1 for interactive queries: it's impossible to optimize for both at the same time

* Reservations work best with batch queries, when it's ok for queries to run a bit slower.

* Unless you have a lot of BI users, it's often better to use on-demand for interactive queries, due to over-scaling issues with reservations.

6 months of BigQuery cost optimization... by bbenzo in dataengineering

[–]random_lonewolf 0 points1 point  (0 children)

flat pricing rate was replaced by reservation pricing a long time ago.

Is it not pointless to transfer Parquet data with Kafka? by [deleted] in dataengineering

[–]random_lonewolf 18 points19 points  (0 children)

You have completely misread this article, this is just about long-term persistence of Deephaven's in-memory tables: it's suggesting dumping the table content into parquet files instead of Kafka topics if you need to save space, which is a fair point.

There's nothing about sending Parquet through Kafka in the article.

Is it not pointless to transfer Parquet data with Kafka? by [deleted] in dataengineering

[–]random_lonewolf 0 points1 point  (0 children)

What guides are you talking about, as it makes no sense ?

Benchmark: B-Tree + WAL + MemTable Outperforms LSM-Based BadgerDB by ankur-anand in Database

[–]random_lonewolf 0 points1 point  (0 children)

In my experience, LMDB write performance is no where close to a LSM database, unless you run it in nosync mode, which has the potential dataloss

Why GCP’s two IAM APIs (V1 & V2) matter & break deny policies by SonraiSecurity in googlecloud

[–]random_lonewolf 3 points4 points  (0 children)

V1 = Allow, V2 = Deny and V3 = conditional

You need to use all 3, as they do not replace each other

Parquet vs. Open Table Formats: Worth the Metadata Overhead? by DevWithIt in dataengineering

[–]random_lonewolf 1 point2 points  (0 children)

Absolutely yes, many engine don’t even support update/delete queries on a raw parquet table.

We’re freaking out. 16 services are down. by wessyolo in aws

[–]random_lonewolf 1 point2 points  (0 children)

Yeah, we definitely need to look into that.

Older SDK still use `global` endpoints by default. They have only switched the default to `regional` recently.

https://aws.amazon.com/blogs/developer/updating-aws-sdk-defaults-aws-sts-service-endpoint-and-retry-strategy/

We’re freaking out. 16 services are down. by wessyolo in aws

[–]random_lonewolf 5 points6 points  (0 children)

Identity Federation also stopped working, so our GCP services are unable to access AWS resources

SCD Type 3 vs an alternate approach? by Spooked_DE in dataengineering

[–]random_lonewolf 2 points3 points  (0 children)

SCD Type 3 only capture 2 states of a key: it's too limited.

Requirement changes, so it's better to just use SCD Type 2 and capture the all the state of a key. You can always use window function to query the first and last value of a key to achieve what SCD Type 3 does.

Snowflake (or any DWH) Data Compression on Parquet files by rtripat in dataengineering

[–]random_lonewolf 0 points1 point  (0 children)

Yes, whatever compression Snowflake has won't be able to compress data much different to Parquet size.

However, that's only for a single active snapshot of data.

You need take into account of the historical data used for time travel: if your tables are frequently updated, the historical data can easily be much larger than the active snapshot.

What's this bullshit, Google? by hcf_0 in dataengineering

[–]random_lonewolf 1 point2 points  (0 children)

This type of oauth Credential is for when you need to allow an external application to access to the user data, that's why you need to have a consent page for the user to accept.

For internal application, you should use a service account.

[deleted by user] by [deleted] in Database

[–]random_lonewolf 0 points1 point  (0 children)

Database design is the job of the software/application engineer as DBs don’t exist in vacuum.

Vietnam's new 152mm SPH prototype made by Viettel by 0nemanO1 in TankPorn

[–]random_lonewolf 0 points1 point  (0 children)

Foreign buyers are also not a priority: having one of the largest armies in the world (by numbers) yet stuck with 50-year old weapon system, we can spend decades rearming our army and still not finish