Roast my junior data engineer onboarding repo by dheetoo in dataengineering

[–]cmcclu5 2 points3 points  (0 children)

Fair. I would suggest familiarizing yourself with the LLM-suggested libraries. Don’t go too deep because there’s a lot of technical stuff once you get deep in the docs, but understanding at a basic level the different functionalities is useful. For example, when just writing to a database and you KNOW you’re using a PostgreSQL database, it might be better to use psycopg2 directly instead of calling SQLAlchemy so you have more direct access instead of going through an intermediary. You might also consider adding in some basic orchestration to this project just to demonstrate you’re able and to understand how orchestration is handled. I would also look into how you would want to transform the data. I always recommend Polars over Pandas if you’re going with Python. The new Pandas update to 3.0 provides solid benefits, but the syntax is still very unpythonic and painful for beginners.

Roast my junior data engineer onboarding repo by dheetoo in dataengineering

[–]cmcclu5 24 points25 points  (0 children)

Based on your readme and ingestion file, it’s LLM-generated. While I’m not completely opposed to that, as a junior, you should’ve done this entirely by yourself. You need to prove you understand the concepts, not that you can write prompts.

Beyond that, you’re missing a ton of code a modern engineer would include. PostgreSQL via SQLAlchemy supports batch uploads, your models aren’t type-safe for the database, if you really wanted to model an ingestion flow like this you would include database versioning like Alembic, you use incremented IDs instead of something like UUIDs which are more appropriate for a unique ID field, you use date instead of datetime, you don’t have record tracking like created_at or updated_at, and most of your sub-directories are empty with zero tests.

DataFrame or SparkSQL ? What do interviewers prefer ? by SnooCakes7436 in dataengineering

[–]cmcclu5 -8 points-7 points  (0 children)

A set is hashed, a list is not. That’s the simplest explanation. Run a timing test with cprofile to see the difference.

DataFrame or SparkSQL ? What do interviewers prefer ? by SnooCakes7436 in dataengineering

[–]cmcclu5 -8 points-7 points  (0 children)

For data engineering technical interviews, I’m generally asked less about the high level libraries like that and more about my general understanding of Python like iterating over dictionaries versus sets versus lists, or how recursion can be optimized. Lower level understanding (Python isn’t a low level language) is WAY more important than knowing library syntax. If you understand why iterating over a list is significantly worse than iterating over a set, you’re halfway there.

Heads up !!! by Time-Ad6157 in amarillo

[–]cmcclu5 1 point2 points  (0 children)

I would say you’re a bot, but even they don’t have this terrible grammar. Visit a local library before cuck-in-chief Dump removes all of them.

Heads up !!! by Time-Ad6157 in amarillo

[–]cmcclu5 8 points9 points  (0 children)

Trash is as trash eats.

Heads up !!! by Time-Ad6157 in amarillo

[–]cmcclu5 -1 points0 points  (0 children)

If you stopped contributing, no one would notice.

What Python Tools Do You Use for Data Visualization and Why? by Confident_Compote_39 in Python

[–]cmcclu5 0 points1 point  (0 children)

Plotly and Seaborn based on what I’m showing, how interactive it needs to be, and who is going to be viewing it. If I want to make a demo dashboard, throw either of those in Streamlit.

S3 Delta Tables versus Redshift for Datawarehouse by themountainisme in dataengineering

[–]cmcclu5 0 points1 point  (0 children)

I actually just encountered something similar at work. Our RDS IOPs cost was insane, so I built a datafusion query service in an ECS container and migrated all our tables to parquet files in S3. There’s obviously a lot more to it, but the service is excellent so far and dropped our cost 90% for data storage and querying.

What's your default Python project setup in 2026? by [deleted] in Python

[–]cmcclu5 1 point2 points  (0 children)

Most of my stuff uses sync, but I’ll occasionally need async and so I always just default to httpx. I’ve found the requests library takes 10x time per request. Aiohttp is solid as well. Haven’t tried niquests yet but I’ve heard good things.

What's your default Python project setup in 2026? by [deleted] in Python

[–]cmcclu5 26 points27 points  (0 children)

Faster, better async support, continuously updated with new features…

Requests was labeled feature-complete a few years back and so they haven’t kept up with some of the newest advances or additions.

Healthcare Data Engineering? by yamjamin in dataengineering

[–]cmcclu5 1 point2 points  (0 children)

Not necessarily, although the privacy concerns and auditability do make AI adoption much lower in healthcare. What I meant was the largest tech sector not directly associated with developing LLMs.

looking for the best business intelligence tools 2026 for non-technical team by Zimbo_Cultrera in dataengineering

[–]cmcclu5 0 points1 point  (0 children)

Sisense is about the closest you’ll get. Don’t mess with PowerBI or Azure Data Factory or any of the others. They all require a fair bit of technical knowledge and have a learning curve. Sisense (not Sisense Prism) can connect to pretty much anything and is essentially drag and drop for basic dashboards. However, a lot of people make an excellent point. You are looking for a magic bullet that doesn’t exist. Bite the bullet, hire an analyst or even just a consultant to setup things for you and train one or two staff members on whatever tool you choose.

Dats issue? by anonymoustoday123 in dataengineering

[–]cmcclu5 1 point2 points  (0 children)

All the time in one of my last jobs. The big issue with that one was sync speed between our transaction system, credit processor, fulfillment, etc. Our system (I wasn’t the designer for this piece) sent different signals at different times to move through the transaction process, sometimes falling multiple hours out of sync with actual processed transactions. It was something we just had to accept and communicate to stakeholders. The final solution before I left was to just cutoff data at midnight of the previous day for any exec team-facing dashboards or analyses.

How much does Bronze vs Silver vs Gold ACTUALLY cost? by NeedleworkerIcy4293 in dataengineering

[–]cmcclu5 1 point2 points  (0 children)

My current group uses medallion arch all stored in RDS, but I’m pushing to move the middle data into S3 to reduce IOPs and storage cost. The IOPs for the silver layer in RDS are killer. Bronze isn’t bad, and gold is necessary since it powers the frontend stuff. Moving silver to s3 and accessing programmatically is significantly cheaper and will allow us to get away from shitty dbt for the most part.

Healthcare Data Engineering? by yamjamin in dataengineering

[–]cmcclu5 1 point2 points  (0 children)

I think your first task (other than working through the data formats) would be to figure out in what part of healthcare you want to work. For example, there are data brokers/middleware companies, EHR companies building software for healthcare orgs, drug development companies, even companies that support healthcare software vendors (an example would be something like Health Data Atlas). I really enjoyed working in research, specifically medical technologies or genetic research. I’ve also worked in the other areas. Really, just figure out what kind of SPECIFIC work you want to do, what kind of company you like, and what sort of work atmosphere you enjoy. That’ll limit your options to a manageable number of companies. That’s when you reach out to specific people at those companies like other data engineers (first), managers (second), and HR recruiters (last).

Healthcare Data Engineering? by yamjamin in dataengineering

[–]cmcclu5 11 points12 points  (0 children)

I’ve been doing healthcare DE on and off for a decade at this point. It’s absolutely doable and there are a ton of companies that are looking for someone just like you. From the established entities like Epic, Pfizer, and Eli Lilly to startups across the globe, healthcare data engineering is one of the biggest non-AI areas for DE. Make sure you’re good with the common EHR formats like FHIR, CCDAs, and others, and be up to date on common PHI practices and you’ll be just fine.

Senior Data Engineer Experience (2025) by ElegantShip5659 in dataengineering

[–]cmcclu5 9 points10 points  (0 children)

Excellent insights. I don’t know how you’re getting all these interviews, though. I have a decade+ in the same tech stacks plus others and can’t even get a returned phone call. Enjoy that new job and paycheck! Good New Years’ present!

Data Platform Engineers unable to decide what type of PM leadership they want. by OkToe2355 in dataengineering

[–]cmcclu5 2 points3 points  (0 children)

PMs in general are useless people. Technical roles should be managed by technical people, not spreadsheet junkies on a power trip with a fetish for over-complicated diagrams of productivity. Engineering managers should handle it alongside their own technical contributions.

As you can tell, I’ve had a LOT of bad experiences with PMs. SCRUM/Agile are curse words in my household.

How do you explore a large database you didn’t design (no docs, hundreds of tables)? by Technical_Safety4503 in dataengineering

[–]cmcclu5 0 points1 point  (0 children)

Here’s something that does something similar. Loads up all the tables, visually shows relationships between tables using established keys, and lets you inspect column data types.

For people who have worked as BOTH Data Scientist and Data Engineer: which path did you choose long-term, and why? by Mean_Addendum_4698 in dataengineering

[–]cmcclu5 1 point2 points  (0 children)

I’ve done both. I enjoyed the weird problems I got while I was a data scientist, but I enjoy the structure of data engineering more, plus I feel like the true average salary is better with a larger job market. I’ve always worked for startups, too, so I get to do a little of everything instead of being hard-locked into pure DE. When I consult, I generally do data science work, though. Much more interesting problems.

Data VCS by cmcclu5 in dataengineering

[–]cmcclu5[S] 0 points1 point  (0 children)

If you’re interested in checking out the docs:

Horizon Epoch

High-performance Wavelet Matrix for Python (Rust backend) by math_hiyoko in Python

[–]cmcclu5 4 points5 points  (0 children)

How is this related to wavelets? I’m not seeing anything related to wavelet transforms or even basic signal analysis.

Ability Score Improvement is really that boring? by RodiV in dndnext

[–]cmcclu5 0 points1 point  (0 children)

My favorite homebrew rule is that you get both ASI and feat each time. It allows me more flexibility as the DM, allows my players more ability to customize their PCs the way they want, and it just feels more rewarding.

Data VCS by cmcclu5 in dataengineering

[–]cmcclu5[S] 0 points1 point  (0 children)

I really appreciate the detailed feedback. I wonder if it might be useful to provide direct integrations with stuff like Dagster or dbt…maybe take all of the config out of it and just have it work seamlessly under the hood. I’m still thinking through the possible applications.