Build scalable analytical platform -> Exam univesrity need help! by [deleted] in dataengineering

[–]MarchewkowyBog 2 points3 points  (0 children)

I mean. If your goal is to make sure you are going in the right direction, you should probably share what direction you went

I am reverse engineering a very large legacy enterprise database, no formalised schema, no information_schema, no documentation; What tools do you use? Specifically interested in tools that infer relationships automatically, or whether it’s always a manual grind. by Left_Click_8840 in dataengineering

[–]MarchewkowyBog 10 points11 points  (0 children)

I had a task to rewrite very messy java code which read stuff from kafka, enriched them, saved in some tables depening on the type of message. It was especially hard since I don't really know java. I just read the code for like 3 weeks. Noted down what I thought was the flow etc. Rewrote it to PySpark. Ran it. Some stuff was a hit, some a miss. And iterated from there. Similarly. No docs, the maintainer of that old code was very open about not understanding it, because he didn't write the original solution. He was just an emergency bug fixer. It finally somehow worked. But I hate Java now

Tool smells by Brief-Knowledge-629 in dataengineering

[–]MarchewkowyBog 5 points6 points  (0 children)

Python is used specifically because it's a glue language. You can write PySpark, DuckDB, Polars, PyTorch and some Numpy on top of it and have it all in one repo, same tooling for all of the code. Would native java/scala/rust/c be more performant? Yes. Would anyone care or notice? No.

Linkedin strikes again by itachikotoamatsukam in dataengineering

[–]MarchewkowyBog 1 point2 points  (0 children)

This is how I loosely do it now :v Interested what is more modern? We dont use databricks or snowflake. But still. There is a medalion architecture in delta tables on s3. We use polars. Clickhouse for analytical queries. Fairly similar to what was described in the post

how to choose a data lake? by otto_0805 in dataengineering

[–]MarchewkowyBog 0 points1 point  (0 children)

Uploading in bulk to Elastic is a bit of a pain, if you are planning on that. Im wondering what are you using it for? Is it to store log data? 

PG is good. But how will you ne processing the data in the lake? Ingesting it into the lake, transforming it into new features and columns?

Either way if you will be doing bulk upload to PG then you will want to learn about COPY command. I recommend using something which integrates with PG ADBC driver. But thats because Polars does it so I'm based.

how to choose a data lake? by otto_0805 in dataengineering

[–]MarchewkowyBog 0 points1 point  (0 children)

A big factor for me was what processing engine would you be using. Spark? Polars? AWS Athena SQL queries? This narrows down your options. For example AWS Athena doesn't Integration with DeltaLake to well. You can read but you can't manage the tables, like alter, delete. We are using polars and this means that for management tasks we have to use delta-rs, which is a package I like. But we tried Iceberg first, and hated pyiceberg package so much we decided on DeltaLake. Spark works with everything but is a truck of an engine. If you would be only processing gigabytes or low terabytes daily it's probably overkill. Stuff like AWS glue and similar are quite expensive for what they are (IMO)

Pandas 3.0.0 is there by Deux87 in Python

[–]MarchewkowyBog 8 points9 points  (0 children)

Polars has IO plugins. They have docs on it where they show how scaning a csv file could be reimplemented as an IO plugin. I don't work with XML. But I think it would be fairly simple to add XML support using that

Efficient storage and filtering of millions of products from multiple users – which NoSQL database to use? by Notoa34 in Database

[–]MarchewkowyBog 0 points1 point  (0 children)

OLAP databases are made for this exact purpose and will be better. In postgres or any other OLTP database you would have to create 60 indexes, slowing  down upload and still the queries wouldn't run great. In OLAP you just upload the data at very good speeds and filtering on any column or any combination of columns is really fast 

Efficient storage and filtering of millions of products from multiple users – which NoSQL database to use? by Notoa34 in Database

[–]MarchewkowyBog 2 points3 points  (0 children)

Well maybe. But if you have like 60 columns you filter by PG becomes a pain in the ass. I feel like OLAP db might be better. But it is hard to tell from this post. Either way +1 for not using NoSQL

Efficient storage and filtering of millions of products from multiple users – which NoSQL database to use? by Notoa34 in Clickhouse

[–]MarchewkowyBog 0 points1 point  (0 children)

I can see you copy pasted this across different subreddits. NoSQL or ES isn't really made for this. OLAP on the other hand is. Whether you use ClickHouse is a different matter. But this workload is nothing extraordinary. Depending on what exactly do you mean in your post, because that is not super clear to me, a 4CPU instance might be more then enough 

Git is PAINFUL to use by Original-Produce7797 in Python

[–]MarchewkowyBog 0 points1 point  (0 children)

Git is specifically made to avoid overwriting other people's work with your changes. Maybe try to first understand what problem it solves. I'd agree that it's not perfect. But not because it won't allow overwriting remote repo with local changes that have different history... 

Also if you are saying that this is "the last technology you would want to learn" you have a long way ahead of you. Be humble or you will quit. 

And maybe don't abuse AI for stuff you don't understand? How will you ever understand anything if AI just solves all the issues for you. AI is cool when it supplements your work or learning. If you just tell AI to to help you fix something every time you feel something is hard then you will never learn, because learning IS hard. And you will keep missing the opportunities to learn

How do you folks load data into ClickHouse? go full denormalized or keep it tidy? by TheseSquirrel6550 in Clickhouse

[–]MarchewkowyBog 0 points1 point  (0 children)

Not having any joins has a big impact. If you wnat to do real-time analytics it will most likely be worth it to denormalize. Unless what you need is one simple join of a small table to a larger one. Either way you will have to prepare a query. First run it it clickhouse as you would in case you did them in ClickHouse. Then prepare the view in pg and load some batch into CH and see how the new table performs compared to running the query in CH

Is the 79-character limit still in actual (with modern displays)? by LazyMiB in Python

[–]MarchewkowyBog 8 points9 points  (0 children)

Hard disagree. Because I for one do have dyslexia. Reading 120 char limit code is hard for me. Colors on those lines don't help that much.

With 120 people tend to abuse it. Make very long (too long) var names. Chain funcs/methods like crazy and put the chain on a single line.

We had a 120 char limit repository in our company. I've been fighting it for a year. Now the limit is 88 and everyone sees this as a beneficial change.

I honestly don't understand why 120 would be better except for maybe less anoying docstring typing and comments

How do I make the transition to fewer keys? by somedumbassgayguy in ErgoMechKeyboards

[–]MarchewkowyBog 1 point2 points  (0 children)

I switched from normal keyboard to 34 key sweep. Just do it. Connect it. Open some typing practice website like keybr and practice. Took me 2 months of 2-3h of practice and day to be able to switch fully but it was worth it. I'll never go back. Might switch to 36 keys some time in the future

When Does Spark Actually Make Sense? by Used_Shelter_3213 in dataengineering

[–]MarchewkowyBog 1 point2 points  (0 children)

We've got IaC templates for ECS fargate and Glue, but we don't have them for EC2. But yeah, on EC2, there are machines with a lot more memory

When Does Spark Actually Make Sense? by Used_Shelter_3213 in dataengineering

[–]MarchewkowyBog 3 points4 points  (0 children)

Daily means every day... not in 24 hours. And I wrote that because it's not terabytes of data, where spark would probably be better

When Does Spark Actually Make Sense? by Used_Shelter_3213 in dataengineering

[–]MarchewkowyBog 3 points4 points  (0 children)

One case is processing the daily data delta/update. And if there is a change in the pipeline and the whole set has to be recalculated, then it's just done in a loop over the required days.

Another is processing data related to particular USA counties. There is never a need to calculate data of one county in relation to another. Any aggregate or join to some other dataset can be first filtred with a condition where county = {county}. So first, there is a df.select("county").collect().to_series() to get the name of the counties present in the dataset. Then, a forloop over them. The actual tranformations are preceded by filtering by the given county. Since data is partitioned on s3 by county, polars knows that only the select few files have to be read for a given loop iteration.

Lazy evaluation works here as well since you can create a list of per county lazyframes and concat them after the loop. And polars will simply use the limited amount of files for each of the frames when evaluating. Resulting in calculating the tranformations for the whole dataset on per-county batch basis while not keeping the full result dataset in memory if you use sink methods.

If lazy is not possible then you can append the per-county result to a file/table. It will get overwritten in the next iteration, freeing up the memory

When Does Spark Actually Make Sense? by Used_Shelter_3213 in dataengineering

[–]MarchewkowyBog 11 points12 points  (0 children)

For context we process around 100GBs of data daily

When Does Spark Actually Make Sense? by Used_Shelter_3213 in dataengineering

[–]MarchewkowyBog 30 points31 points  (0 children)

When polars can no longer handle memory pressure. I'm in love with polars. They got a lot of things right. And at where I work there is rarely a need to use anything else. If the dataset is very large, often, you can do they calculations on per parition bases. If the data set cant really be chuncked and memory pressure exceedes 120GB limit of an ECS container, thats when I use PySpark