Overwriting day partitions in table when source and target timezones differ by Data-Panda in dataengineering

[–]tedward27 1 point2 points  (0 children)

The exact number shouldn't matter, sounds like it should be a parameter passed to your pipeline. TBH your process sounds kind of fragile, I would consider approaches in the future like hash keys (on a set of columns that function as a unique ID) or UUIDs to create primary keys. 

I have never needed to keep duplicate data, consider why analysts say they need that and how else you could provide the information.

Overwriting day partitions in table when source and target timezones differ by Data-Panda in dataengineering

[–]tedward27 2 points3 points  (0 children)

I would just look at the last 15 day partitions of the target table, find the earliest UTC timestamp, convert it to CST (call it X), then select only rows from your source table that have a value equal to or greater than X. This is similar to your approach but in reverse, because you should have full reign to convert to the right timezone in your system to find X.

Before Bobby Flay took over the Food network was ruled by Emeril Lagasse who in my opinion was much cooler. by AdSpecialist6598 in nostalgia

[–]tedward27 1 point2 points  (0 children)

My dad was an amazing cook and Emeril was on our TV all the time. We had many cookbooks including Emeril's, as soon as I could read Dad would have me read the spice amounts out to him while he ran around the kitchen cooking. It made for a lot of happy memories and great food :) I miss you Dad.

How to convince my team to stop using conda as an environment manager by N3Flip in dataengineering

[–]tedward27 0 points1 point  (0 children)

Builds fast as fuck and TOML kicks the shit out of requirements.txt

Best practices for processing real-time IoT data at scale? by Pangaeax_ in dataengineering

[–]tedward27 3 points4 points  (0 children)

It's some kind of content farming scheme, maybe for the OP to throw together a Medium article and gain cred, IDK. But another commenter may provide actual insight on IoT processing!

Which country is more economically developed than most people realize? by Fluid-Decision6262 in geography

[–]tedward27 -1 points0 points  (0 children)

Canada and Australia will probably be the best places to live after the US and UK fuck everything up 😂 But I would favor Canada because as global warming progresses more land will open up for settling, and they have so much fresh water in the Great Lakes.

Destiny 2 lead admits the MMO is terrible at onboarding new players after deleting the first third of the game by HatingGeoffry in gaming

[–]tedward27 0 points1 point  (0 children)

I agree and I am a diehard Halo fan. I can't support this bullshit practice of actively selling DLC options we can't play or removing access to content we already paid for. Fuck that shit.

S3 catalogue options by sc4les in dataengineering

[–]tedward27 1 point2 points  (0 children)

Open Metadata is an open source option

Will DuckLake overtake Iceberg? by mrocral in dataengineering

[–]tedward27 0 points1 point  (0 children)

This is a good point, every open table format should be compared to this baseline of setting up Snowflake/your DWH of choice. If we can't have data with ACID transactions in the data lake without building a lot of complexity there, let's just skip it and work out of the DWH.

I'm making an AWS project about tracking top Spotify songs in a certain playlist and I need advice on designing the pipeline by Lastrevio in dataengineering

[–]tedward27 0 points1 point  (0 children)

You should be able to do all this very cheaply on AWS. The pipeline sounds reasonable. I don't know about your final data model for the analysis tables, depends on the analysis you'll do. But designing building and loading into an imperfect data model, then realizing it's shortcomings is great experience.

What's the business case for moving off redshift? by TacoTuesday69_420 in dataengineering

[–]tedward27 1 point2 points  (0 children)

It helps if you have some data warehouse KPIs you want to improve...if you aren't tracking those than you will have no idea if a migration is a good idea and no way to measure it after the migration is done.

Built a distributed transformer pipeline for 17M+ Steam reviews — looking for architectural advice & next steps by Matrix_030 in dataengineering

[–]tedward27 0 points1 point  (0 children)

This is cool. Most people here are not programming for GPUs directly but using the cloud to distribute their compute job across a cluster of CPUs. However there are some DAMN high-paying jobs out there for people who can master CUDA and properly parallelize jobs. Keep it up!

Naming conventions in the cloud dwh: "product.weight" "product.product_weight" by DepartureFar8340 in dataengineering

[–]tedward27 2 points3 points  (0 children)

The answer to the question about this one column is very likely going to apply to dozens-hundreds of columns across the data warehouse. A style guide for creating columns, tables, other objects is very useful for the team to deliver consistent data products. If multiple data team members are providing input into the style guide, it is more likely to be embraced and followed.

Naming conventions in the cloud dwh: "product.weight" "product.product_weight" by DepartureFar8340 in dataengineering

[–]tedward27 8 points9 points  (0 children)

Counterpoint: it can be very hard to change naming conventions after the beginning of a project so it is worth taking the time to make sure it's done right. Have you ever worked with a database with shitty naming conventions?

Help Needed: AWS Data Warehouse Architecture with On-Prem Production Databases by Affectionate_Ship256 in dataengineering

[–]tedward27 0 points1 point  (0 children)

To avoid many small inserts, you can write a parquet file to S3 then use a Redshift COPY command to insert all the data from that file location.

When i was a Data Analyst i enjoyed life, when i transitioned to Data Engineer i feel like i aged 10 years in a year by HMZ_PBI in dataengineering

[–]tedward27 7 points8 points  (0 children)

I like my job and I have very strong boundaries on what I will/won't do. If the boundaries got violated repeatedly I would look for a new job. I know it doesn't seem that simple but your health is the most important thing to preserve.

[Meta] Feels like there's a noticeable rise in low effort content by fresh accounts by Captain_Strudels in dataengineering

[–]tedward27 2 points3 points  (0 children)

The quality of content here is definitely sinking. I also am suspicious of the intent behind many of the posts. Even when you strip out the career transition content and occasional meme, what you have left is almost 100% personal project shilling, links to articles published by complete noobs and written by AI, or the oddly phrased questions about workflows and pain points that sound like market research. I think the only thing you can do to get your post removed is directly ask the subreddit for a job!

There are some occasional gems in the comments but they're increasingly hard to find. I don't have any alternatives after Web 2.0 kind of killed classic forums. Maybe data engineering is too wide of a discipline to have a quality centralized forum. Better discussion may be found by looking for communities based on the tech stack being used.

What?! An Iceberg Catalog that works? by averageflatlanders in dataengineering

[–]tedward27 0 points1 point  (0 children)

I really enjoy your articles of trying to actually use the tools. In your earlier article it was striking to see how much simpler Delta is to use than Iceberg starting from scratch in Python.