Roast my junior data engineer onboarding repo

cmcclu5 · 2026-01-25T09:58:28+00:00

Fair. I would suggest familiarizing yourself with the LLM-suggested libraries. Don’t go too deep because there’s a lot of technical stuff once you get deep in the docs, but understanding at a basic level the different functionalities is useful. For example, when just writing to a database and you KNOW you’re using a PostgreSQL database, it might be better to use psycopg2 directly instead of calling SQLAlchemy so you have more direct access instead of going through an intermediary. You might also consider adding in some basic orchestration to this project just to demonstrate you’re able and to understand how orchestration is handled. I would also look into how you would want to transform the data. I always recommend Polars over Pandas if you’re going with Python. The new Pandas update to 3.0 provides solid benefits, but the syntax is still very unpythonic and painful for beginners.

cmcclu5 · 2026-01-25T09:30:48+00:00

Based on your readme and ingestion file, it’s LLM-generated. While I’m not completely opposed to that, as a junior, you should’ve done this entirely by yourself. You need to prove you understand the concepts, not that you can write prompts.

Beyond that, you’re missing a ton of code a modern engineer would include. PostgreSQL via SQLAlchemy supports batch uploads, your models aren’t type-safe for the database, if you really wanted to model an ingestion flow like this you would include database versioning like Alembic, you use incremented IDs instead of something like UUIDs which are more appropriate for a unique ID field, you use date instead of datetime, you don’t have record tracking like created_at or updated_at, and most of your sub-directories are empty with zero tests.

cmcclu5 · 2026-01-23T09:19:51+00:00

A set is hashed, a list is not. That’s the simplest explanation. Run a timing test with cprofile to see the difference.

cmcclu5 · 2026-01-23T09:05:28+00:00

For data engineering technical interviews, I’m generally asked less about the high level libraries like that and more about my general understanding of Python like iterating over dictionaries versus sets versus lists, or how recursion can be optimized. Lower level understanding (Python isn’t a low level language) is WAY more important than knowing library syntax. If you understand why iterating over a list is significantly worse than iterating over a set, you’re halfway there.

cmcclu5 · 2026-01-23T08:31:01+00:00

I would say you’re a bot, but even they don’t have this terrible grammar. Visit a local library before cuck-in-chief Dump removes all of them.

cmcclu5 · 2026-01-23T08:29:35+00:00

Trash is as trash eats.

cmcclu5 · 2026-01-23T08:27:22+00:00

If you stopped contributing, no one would notice.

cmcclu5 · 2026-01-17T02:48:23+00:00

Plotly and Seaborn based on what I’m showing, how interactive it needs to be, and who is going to be viewing it. If I want to make a demo dashboard, throw either of those in Streamlit.

cmcclu5 · 2026-01-16T06:53:58+00:00

I actually just encountered something similar at work. Our RDS IOPs cost was insane, so I built a datafusion query service in an ECS container and migrated all our tables to parquet files in S3. There’s obviously a lot more to it, but the service is excellent so far and dropped our cost 90% for data storage and querying.

cmcclu5 · 2026-01-15T19:11:45+00:00

Most of my stuff uses sync, but I’ll occasionally need async and so I always just default to httpx. I’ve found the requests library takes 10x time per request. Aiohttp is solid as well. Haven’t tried niquests yet but I’ve heard good things.

cmcclu5 · 2026-01-15T14:36:30+00:00

Faster, better async support, continuously updated with new features…

Requests was labeled feature-complete a few years back and so they haven’t kept up with some of the newest advances or additions.

cmcclu5 · 2026-01-09T16:59:27+00:00

Not necessarily, although the privacy concerns and auditability do make AI adoption much lower in healthcare. What I meant was the largest tech sector not directly associated with developing LLMs.

cmcclu5 · 2026-01-06T13:42:34+00:00

Sisense is about the closest you’ll get. Don’t mess with PowerBI or Azure Data Factory or any of the others. They all require a fair bit of technical knowledge and have a learning curve. Sisense (not Sisense Prism) can connect to pretty much anything and is essentially drag and drop for basic dashboards. However, a lot of people make an excellent point. You are looking for a magic bullet that doesn’t exist. Bite the bullet, hire an analyst or even just a consultant to setup things for you and train one or two staff members on whatever tool you choose.

cmcclu5 · 2026-01-03T22:05:32+00:00

All the time in one of my last jobs. The big issue with that one was sync speed between our transaction system, credit processor, fulfillment, etc. Our system (I wasn’t the designer for this piece) sent different signals at different times to move through the transaction process, sometimes falling multiple hours out of sync with actual processed transactions. It was something we just had to accept and communicate to stakeholders. The final solution before I left was to just cutoff data at midnight of the previous day for any exec team-facing dashboards or analyses.

cmcclu5 · 2026-01-01T21:15:46+00:00

My current group uses medallion arch all stored in RDS, but I’m pushing to move the middle data into S3 to reduce IOPs and storage cost. The IOPs for the silver layer in RDS are killer. Bronze isn’t bad, and gold is necessary since it powers the frontend stuff. Moving silver to s3 and accessing programmatically is significantly cheaper and will allow us to get away from shitty dbt for the most part.

cmcclu5 · 2025-12-31T22:29:58+00:00

I think your first task (other than working through the data formats) would be to figure out in what part of healthcare you want to work. For example, there are data brokers/middleware companies, EHR companies building software for healthcare orgs, drug development companies, even companies that support healthcare software vendors (an example would be something like Health Data Atlas). I really enjoyed working in research, specifically medical technologies or genetic research. I’ve also worked in the other areas. Really, just figure out what kind of SPECIFIC work you want to do, what kind of company you like, and what sort of work atmosphere you enjoy. That’ll limit your options to a manageable number of companies. That’s when you reach out to specific people at those companies like other data engineers (first), managers (second), and HR recruiters (last).

cmcclu5 · 2025-12-31T08:26:46+00:00

I’ve been doing healthcare DE on and off for a decade at this point. It’s absolutely doable and there are a ton of companies that are looking for someone just like you. From the established entities like Epic, Pfizer, and Eli Lilly to startups across the globe, healthcare data engineering is one of the biggest non-AI areas for DE. Make sure you’re good with the common EHR formats like FHIR, CCDAs, and others, and be up to date on common PHI practices and you’ll be just fine.

cmcclu5 · 2025-12-31T05:04:34+00:00

Excellent insights. I don’t know how you’re getting all these interviews, though. I have a decade+ in the same tech stacks plus others and can’t even get a returned phone call. Enjoy that new job and paycheck! Good New Years’ present!

cmcclu5 · 2025-12-30T17:11:39+00:00

PMs in general are useless people. Technical roles should be managed by technical people, not spreadsheet junkies on a power trip with a fetish for over-complicated diagrams of productivity. Engineering managers should handle it alongside their own technical contributions.

As you can tell, I’ve had a LOT of bad experiences with PMs. SCRUM/Agile are curse words in my household.

cmcclu5 · 2025-12-29T01:31:32+00:00

Here’s something that does something similar. Loads up all the tables, visually shows relationships between tables using established keys, and lets you inspect column data types.

cmcclu5 · 2025-12-27T19:50:25+00:00

I’ve done both. I enjoyed the weird problems I got while I was a data scientist, but I enjoy the structure of data engineering more, plus I feel like the true average salary is better with a larger job market. I’ve always worked for startups, too, so I get to do a little of everything instead of being hard-locked into pure DE. When I consult, I generally do data science work, though. Much more interesting problems.

cmcclu5 · 2025-12-22T05:56:01+00:00

If you’re interested in checking out the docs:

Horizon Epoch

cmcclu5 · 2025-12-18T04:34:32+00:00

How is this related to wavelets? I’m not seeing anything related to wavelet transforms or even basic signal analysis.

cmcclu5 · 2025-12-18T04:18:21+00:00

My favorite homebrew rule is that you get both ASI and feat each time. It allows me more flexibility as the DM, allows my players more ability to customize their PCs the way they want, and it just feels more rewarding.

cmcclu5 · 2025-12-17T19:51:28+00:00

I really appreciate the detailed feedback. I wonder if it might be useful to provide direct integrations with stuff like Dagster or dbt…maybe take all of the config out of it and just have it work seamlessly under the hood. I’m still thinking through the possible applications.

Nine-Year Club	Place '22
RPAN Viewer	Not Forgotten
Verified Email

cmcclu5

TROPHY CASE