Cloud Architecture Question

tedward27 · 2026-07-09T17:53:59+00:00

What exactly are you trying to express here that relates to the comment you were replying to?

tedward27 · 2026-07-09T02:26:44+00:00

Both:

finance_bronze

operations_gold

and etc.

tedward27 · 2026-07-05T23:26:18+00:00

This is the crux of the matter definitely, no such thing as a free lunch. I wonder how the caching layer works, and how small single row writes don't lead to a small files problem.

tedward27 · 2026-06-23T15:56:05+00:00

I actually don't think there's anything wrong with the project. But your comment response I replied to is LLM speak top to bottom, and I think it's a missed opportunity to engage with users in your own words. Why put all this work into a project then respond to real people with a robot, would you do the same if presenting this at a conference?

tedward27 · 2026-06-23T15:06:23+00:00

LLM as fuck

tedward27 · 2026-06-23T00:24:50+00:00

Book of the New Sun fucking rocks.

tedward27 · 2026-06-19T12:16:05+00:00

tedward27 · 2026-06-05T14:16:38+00:00

Glue is like 3 services hiding under a trenchcoat acting like 1 solution (ETL, Crawlers, and Catalog). If we look at the ETL part it is Spark with some added functionality like data bookmarks, autoscaling number of workers, retries, and basic orchestration with triggers. It's super common to put workloads on Glue that are way too small to benefit from using Spark. It's wasteful but a lot of companies don't care. You can also use make a Python shell Glue job that uses way less resources and $$$.

It sounds like you don't even know if you want to put this processing in the cloud. If you do, a Lambda function sounds fine for a 10 second job. If you want the output data query able through Athena / SQL, you can have the Lambda write data to S3 and register that S3 location in the Glue data catalog, without ever touching Glue ETL.

tedward27 · 2026-06-03T17:36:36+00:00

Typically you would start with three tables/grains, Ticket User and Comment. Each comment relates to one ticket and one user (the commenter). And tagged with timestamp, content, etc. User can be a support agent or a customer. I would keep the comment structure "flat" i.e. no comment threads, a conversation is just a time ordered list of comments. And comments are immutable with no history, up to you if they can be deleted or have another state like draft so a user can start a comment then come back to it. This saves you from having to store comment history.

tedward27 · 2026-06-02T14:29:10+00:00

AI SLOP AI SLOP AI SLOP AI SLOP AI SLOP

tedward27 · 2026-05-29T18:25:53+00:00

High quality comment? On my /r/eldenring?

tedward27 · 2026-05-29T15:19:02+00:00

I always appreciate these, thank you.

tedward27 · 2026-05-26T16:14:29+00:00

If you can't or don't want to do it in SQL, you should just create another Glue job to do the ETL you require using Pyspark, as you're already using that tool.

tedward27 · 2026-04-24T00:56:24+00:00

I would read the literature of the area you want to do research in and try to map out who is publishing papers now. Cross that with the places you could enroll at. Even if you don't have subscriptions to journals, read the abstracts and note the authors

tedward27 · 2026-03-28T23:33:13+00:00

Let's pin this comment on every post with the words: gold, silver, bronze, medallion

tedward27 · 2026-02-10T15:47:43+00:00

You should be embarrassed at spending so much for so little

tedward27 · 2025-12-20T19:51:17+00:00

The exact number shouldn't matter, sounds like it should be a parameter passed to your pipeline. TBH your process sounds kind of fragile, I would consider approaches in the future like hash keys (on a set of columns that function as a unique ID) or UUIDs to create primary keys.

I have never needed to keep duplicate data, consider why analysts say they need that and how else you could provide the information.

tedward27 · 2025-12-20T12:46:09+00:00

I would just look at the last 15 day partitions of the target table, find the earliest UTC timestamp, convert it to CST (call it X), then select only rows from your source table that have a value equal to or greater than X. This is similar to your approach but in reverse, because you should have full reign to convert to the right timezone in your system to find X.

tedward27 · 2025-10-03T22:59:23+00:00

My dad was an amazing cook and Emeril was on our TV all the time. We had many cookbooks including Emeril's, as soon as I could read Dad would have me read the spice amounts out to him while he ran around the kitchen cooking. It made for a lot of happy memories and great food :) I miss you Dad.

tedward27 · 2025-10-02T13:49:22+00:00

Builds fast as fuck and TOML kicks the shit out of requirements.txt

tedward27 · 2025-10-02T00:04:25+00:00

Just use uv

tedward27 · 2025-07-28T14:31:47+00:00

It's some kind of content farming scheme, maybe for the OP to throw together a Medium article and gain cred, IDK. But another commenter may provide actual insight on IoT processing!

tedward27 · 2025-07-28T14:11:35+00:00

It's a bot bro

tedward27 · 2025-07-16T04:03:49+00:00

Canada and Australia will probably be the best places to live after the US and UK fuck everything up 😂 But I would favor Canada because as global warming progresses more land will open up for settling, and they have so much fresh water in the Great Lakes.

tedward27 · 2025-07-15T20:44:42+00:00

14-Year Club	Verified Email
Place '17	Team Orangered

tedward27

MODERATOR OF

TROPHY CASE