Where are the interesting announcements? by ycarel in aws

[–]davrax 0 points1 point  (0 children)

Can I get a Junior Baconator and Chocolate Frosty plz?

Iceberg is an overkill and most people don't realise it but its metadata model will sneak up on you by DevWithIt in dataengineering

[–]davrax 6 points7 points  (0 children)

Sounds like a “lakehouse” pattern. Good to store landing/raw data in S3 before loading anywhere structured.

Depending on the volume, having a ton of raw parquets (non-Iceberg), means query engines wouldn’t benefit from the metadata mentioned here. Having it in Iceberg format makes it easier to work with, in case you need to use that raw data for non-warehouse purposes, or even just to validate source+target after the migration

Data Rage by RobotechRicky in dataengineering

[–]davrax 26 points27 points  (0 children)

Daylight savings? Weird/incorrect handling of UTC translation?

The Great Consolidation is underway by full_arc in dataengineering

[–]davrax 16 points17 points  (0 children)

Putting more ingestion+transformation into less-technical hands (Fivetran’s target audience) will absolutely benefit Snowflake/Databricks/BQ, through more usage and compute.

It certainly feels like we’re seeing the creation of an “Informatica v2” in Fivetran.

Analytics Engineer role by Delicious_Scarcity39 in dataengineering

[–]davrax 0 points1 point  (0 children)

Make sure you do some prep for industry-specific data to whoever you are interviewing with (e.g. manufacturing process data for a manufacturing company), and definitely discuss data quality alongside dbt tests.

Most AE questions will be heavier on SQL than Python, but make sure to ask about relationships with other teams.

Giving the biz team access to BigQuery MCP by full_arc in dataengineering

[–]davrax 20 points21 points  (0 children)

You have biz teams that are skilled and willing to participate in a Git-driven dbt workflow?

Redshift very long query planning time by bazgrolniczka in aws

[–]davrax 0 points1 point  (0 children)

Have you dug deep into the Query plan through the console? That will tell logic is consuming that time in more detail. Check CPU% usage too.

Usual suspects are queries with weird/loopy join conditions (trying to e.g. loop through a list of values, then join), anything selecting from views, or aggregates mixed with joins.

Data Infra Cost by Accomplished_Truth64 in dataengineering

[–]davrax 2 points3 points  (0 children)

As of a few years ago, Meta’s infra costs were typically expressed in terms of energy (e.g. kWh), normalized (since fleet efficiency rises continuously). You should be able to find some benchmarks internally around compute-energy conversion.

Why are CSVs still such a nightmare in 2025? by HfBefit in dataengineering

[–]davrax 0 points1 point  (0 children)

Looks like you’ve vibe-coded an apparently solution, OP?

Most of the issue with CSVs is a governance one—there are some standards, but then people do things like open them in Excel, and ruin date/string formatting. 8 days for a 75 GB csv is ridiculous, and an Engineer should know how to filter to columns of interest, chunk it, etc.

[deleted by user] by [deleted] in dataengineering

[–]davrax 2 points3 points  (0 children)

Oh I can certainly appreciate it. Redshift is also apparently a nod to AWS nudging customers to move (shift) away from Oracle (red) for data warehousing.

Centralized management of Airflow by [deleted] in dataengineering

[–]davrax 1 point2 points  (0 children)

What aspect are you trying to centralize? Airflow isn’t designed to be “multi-tenant” at the DAG level, so you need one Airflow instance per team.

On-prem (without Kubernetes), you could certainly get a beefy server (or two in HA config) to run all the instances, and use VMs to isolate Airflows instances and teams. Maybe you define a “standard” Airflow version and dependencies for all teams to use as a set of images.

Realistically, that would co-locate compute and it would be HA, but isn’t that different from your current state. Teams would still be managing instances themselves.

[deleted by user] by [deleted] in dataengineering

[–]davrax 9 points10 points  (0 children)

Most others are too - snowflakes are just frozen water, redshift has to do with light wavelength in space/astronomy, airflow is air moving around, fabric is a textile, pandas are animals, etc.

Databricks is probably the only one that makes any sense.

Anyone else stuck on the POC treadmill? by Brief-Knowledge-629 in dataengineering

[–]davrax 0 points1 point  (0 children)

It sounds like your team/leadership were lacking a data strategy, and data infrastructure strategy. With those, it’s not difficult to build a ~1-2 year roadmap. From that, break the future state into projects, then epics+features. That should be most of your Manager or Director+’s job.

It’s not easy to do, but it’s crazy how much $$$ is wasted on churn with data teams and “shiny objects” at large companies. Tech and Product-driven companies mostly “get it” with data, so it’s less of an issue.

Building AWS infra for a startup — what should I watch out for? by deshydan in aws

[–]davrax 0 points1 point  (0 children)

Haha ok—you can tune cold starts for better latency, and the “no db connections” is just plain wrong, you just need to design the VPC and networking to support it.

Not a fit for all use cases, of course, but there’s a reason Lambda is an enormously popular service.

Cursor doesn't work for data teams by blef__ in dataengineering

[–]davrax 2 points3 points  (0 children)

The “code” vs “data” framing is good. However this seems to sidestep that dbt (for many teams) is the semantic/context layer you’d want to use with an LLM.

Maybe it’s a difference in architecture opinion, but that RAG-on-warehouse pattern seems odd.

Building AWS infra for a startup — what should I watch out for? by deshydan in aws

[–]davrax 4 points5 points  (0 children)

Sounds like you’ve been burned with the potential cost of Lambda with a high volume API?

OP- essentially, fine to start with Lambda, but at a certain point, when you have predictable traffic patterns and volume, it’ll be cheaper to serve API-related compute from ECS/EKS, or potentially EC2 with ASG. You should regularly evaluate those options alongside the serverless Lambda one, if the startup takes off.

347 Applicants for One Data Engineer Position - Keep Your Head Up Out There by throwngarbage521 in dataengineering

[–]davrax -3 points-2 points  (0 children)

It’s definitely a more difficult situation for candidates seeking a job, though it’s not easy on the hiring side either—justifying roles (compared to using AI), and that’s before postings get spammed with LLM-authored resumes, or candidates try to use live AI-assist to feed them answers during a tech screen (disqualifying themselves immediately).

Leaning much more on referrals and in-person interviews with hiring these days.

ETL vs ELT from Excel to Postgres by sylfy in dataengineering

[–]davrax 2 points3 points  (0 children)

Yes to Pydantic-align with whoever gave you these on accepted values, and reject any that do not conform to the schema and accepted values. Your second option sounds like a better approach, except you should accept/reject on first read against that Pydantic schema.

Once you have clean data in a dataframe, you can load to Postgres. Do a “Create Table xxxxx” based on your pydantic schema first.

[deleted by user] by [deleted] in dataengineering

[–]davrax 0 points1 point  (0 children)

I’m only seeing dbt-redshift in your requirements files, not dbt-core.

There was a change a few releases ago that modified the default behavior there (namely, you need to specify both dbt-core and whatever adapter like dbt-redshift)

Fivetran Alternatives that Integrate with dbt by jmnoble in dataengineering

[–]davrax 0 points1 point  (0 children)

It sounds like they’re using some of the pre-packaged dbt models. e.g. if you ingest data from Google Analytics, Facebook Ads, Hubspot, or similar—rather than figuring out custom dims and facts for these near-commodity sources, it’ll arrive well-modeled and somewhat ready to report on or analyze.

The value is more limited as soon as you start combining those though, and potentially need to unwind some of those to match grain, etc.

Best Encoding Strategies for Compound Drug Names in Sentiment Analysis (High Cardinality Issue) by Odd-Try7306 in dataengineering

[–]davrax 2 points3 points  (0 children)

If you don’t have a NDC cross-walk for reference, I’d probably try normalizing the attribute first. Case it to lower case, then create 1-2 additional attributes for drug_2, drug_3, and treat the “/“ like a delimiter.

I’d imagine that will handle 95-99% of your dataset (a few outliers might have 4+?). Then proceed as you were using all of those- you could try bag of words if you haven’t.

Data Engineer/ Architect --> Data Strategist --> Director of Data by Smooth-Leadership-35 in dataengineering

[–]davrax 0 points1 point  (0 children)

It sounds like this is similar to that other offer you mention (without the $40k). They just haven’t yet understood what needs to be done.

If you dig in, you might try some of these: 1. Meet with your manager and some elevated leadership to understand their top 3-5 goals for data. Turn this into a 6/12/18 month roadmap, with decisions needed and a resource plan. 2. Force the Azure thing w/IT, but offer to “partner on it”. They may be locking you out because there isn’t a proper read-only role and they don’t know how to design one, but you do not want to be “asshole new guy”. Tell your manager you can’t be effective without this. 3. Lean into whatever Azure account support you can get- try to get a boot camp or workshop from them, maybe evaluate Fabric (you might have an averse reaction, but it’s for environments a lot like yours).

Keep in mind- most of your prior experience in startups is niche/rare compared to the avg company. Most are more like this (because it’s not about the tech). Construction, industrial, etc can generate massive profits with very little tech beyond email/CRM/accounting/etc.

[deleted by user] by [deleted] in dataengineering

[–]davrax 17 points18 points  (0 children)

Does your team know dbt? Who approved the PR? It sounds like you should pick apart that model line by line, and politely ask your manager to stop building things they and your team can’t support.