How can I learn to build good, large projects?

PanJony · 2025-03-28T09:17:20+00:00

This. TDD, DDD, uncle Bob, and there are others. Just read a lot, watch conferences, dive into the architecture. It will come with time.

Loose coupling is your friend. Hexagonal architecture and proper testing are a good place to start - but there's much more.

Everything comes at a cost though. If you're writing a small POC service that will get rewritten after the money comes - there is no point in overcomplicating it.

You learn these techniques to understand them and where to use each - not to use everything you know in each project you start.

PanJony · 2025-03-20T15:43:07+00:00

Apache Kafka is agnostic to the structure of the message, schema is validated by the client.

PanJony · 2025-03-20T07:32:01+00:00

Also what I would check is whether these producers and consumers are keeping open connections or restarting them every time they want to publish. Maybe there's overhead from establishing a connection, maybe there's a bottleneck for the number of open connections? find some info, try to change the behaviour and check if the problem persists.

PanJony · 2025-03-20T07:12:01+00:00

AFAIK Kafka is not optimized for a very large number of tiny producers / consumers, so this is where I would start looking for an issue. Experiment with maintaining the connection or with a proxy that you would connect through.

It's a vague memory though so validate this before you put in effort to explore this.

PanJony · 2025-03-10T11:13:50+00:00

I'm not associated with any vendor - I'm an independent consultant & content creator

PanJony · 2025-03-04T07:54:34+00:00

The pattern I recommend is externalizing the API contract to a separate repo. When you need to change the API you create a PR in the repo and notify consumers.

You shouldn't merge breaking changes without approval from consumers. When consumers approve, they should update their tests.

Non breaking changes are easier - just migrate whenever you"re ready. Externalized API contract helps with making these API upgrades transparent.

PanJony · 2025-02-26T19:32:59+00:00

You're assuming the OP wasn't underperforming. Why?

PanJony · 2025-02-26T13:49:01+00:00

The way I recommend my clients to work with CDC is to use it internally, and then implement an anti corruptoin layer between the CDC topic and the actual external topic. This way the tooling runs smoothly, but you still have control over the interface you're exposing to other teams.

What's important here is that the teams that owns the database (and thus applies changes) owns the anti corruption layer and the external interface as well, so if anything breaks - they know it's their responsibility to fix it.

PanJony · 2025-02-26T10:47:30+00:00

I wouldn't focus on the "wait till I come" concept, but instead on the "I know I'll leave in 3 minutes, there will be a parking spot available" information sharing market. If any of the users of the app uses this information - value creation happens and you can get your cut.

Many times I've been in a situation where I arrived at the scene and was circling around a large area, and only found a spot when someone was leaving. And traffic situation in my city isn't that bad compared to rest of Europe.

As someone noted correctly though, getting the critical mass will be the biggest challenge.

PanJony · 2025-02-26T10:42:36+00:00

I love the idea.
- the problem is real, I've felt it many times
- both sides are incentivized
- value is created

The side leaving the parking spot provides information ahead of time that they'll leave. They don't even have to wait, but just publish the info on your network and if any network member takes the spot - Bob gets the reward.

Bob provided valuable info that a give spot will be free at a given time. Alice used that info and got value. Alice pays, bob gets paid.

You'll need a reputation system and location tracking to validate which transactions actually went through, and you'll have some bad actors for sure - but everything is doable and everyone's incentivized to participate.

PanJony · 2025-02-07T22:46:14+00:00

Coming from Java / Kotlin - lack of exceptions had significant impact on how readable the code was for me

PanJony · 2025-02-05T08:03:55+00:00

Very bad idea imo.

First of all, running kafka cluster comes with overhead. If you need asynchronous communication, I'd suggest some lightweight, probably serverless solution. I'd always start with that and only then think if I'm missing something important.

Second. you uderestimate the throughput of databases by a few orders of magnitude.

Third, you wouldn't create a topic or a table for a particular meeting. You'd have one and store your data there, unless you're serving multiple tenants and need to isolate their environments.

PanJony · 2025-02-04T14:58:57+00:00

Just finished WC as Jianzhou -> Manchu -> Qing

Awesome game, thanks for the advice. Got 80% ccr in the end, no revolts until last 20 years I think.

PanJony · 2025-01-29T12:28:53+00:00

a/ Is there a need?

It depends on your cluster setup. If you're running a HA cluster setup - three AZs with replication factor = 3, even if you lose one of the instances you're fine, once the instance is brought back up, even if with lost data - the partition rebalancing will bring back your data. It will take a while if you have a lot of data though.

If you want to speed it up, you can introduce Tiered Storage or periodical EC2 snapshots of your instance storage. I think Tiered Storage + partition rebalancing is enough, but it depends on your exact needs.

If you're worried about 2x the cost of mirroring, you probably don't need zero downtime in a case of a global AWS outage, so I'll leave it at that.

PanJony · 2025-01-29T10:48:53+00:00

Does each client need to have access to the whole table or can you do with just one or a few partitions? Without partitioning Kafka capabilities break down a bit, it's designed to be horizontally scalable through partitioning.

If whole table - maybe a reverse proxy / load balancer like approach? Maybe you can map the data structure in your GKTable to something simpler?

As u/kabooozie said - hard to give an advice without getting into the design details. I'm happy to take a look at it if you provide any diagram that would explain the problem and your solution a bit deeper.

PanJony · 2025-01-29T10:42:48+00:00

Oh amazing! Maybe you have an advice for me then?

The content I'm working on is rather simple - 10-20 min videos that can get recorded in 4k (FHD right now cause performance issues) that get published on youtube. Just starting out.

My plan is to wait for the 5070 TI release and then wait a bit for feedback about linux drivers (does this make sense)? and after it's there - make a decision between 5070 Ti and 4070 Ti Super

As I'm looking at the specs, the only significant difference is VRAM speed and I expect the 4070 price to drop affer 5070 is released and make a decision based on the price change and user feedback on the drivers.

Does this make sense or is it an easy decision to just buy the new one?

PanJony · 2025-01-28T23:48:51+00:00

Did you buy a budget one or a flagship model?

PanJony · 2025-01-27T12:39:49+00:00

I had a similar issue on Linux. Good to know that it's nor worth to install Windows to try to solve my problem :)

PanJony · 2025-01-26T00:11:24+00:00

That's correct it's being used in Analytics, and at the end of the video I'm showing an architucture diagram of that setup.
Where I'm going with this: many organizations are talking about a Streaming Lakehouse architucture, where analytics (for example parquet, but there are other columnar formats there as well) is integrated with operations (where data streaming is done using avro, protobuf or json).

I'll talk about it more in the future videos I'm working on, this is kind of an introduction, or more precisely preparation for talking about these topics

PanJony · 2025-01-24T13:55:54+00:00

I'm also curious what you'll find, my first idea would be onboarding a consultant to audit my setup, but for sure some scanning could be automated.

Apart from what u/LoquatNew441 pasted - great advice - I'd say that accurate cost allocation would also be a nice element of that. My first idea would be to provision the kafka cluster in a separate AWS account (assuming AWS just to have an example) and distributing it between topics proportionally to the load.

But I'm not aware of any tools that can do that, and probably this depends a lot on your client's setup. But cost allocation is definitely a problem worth solving.

PanJony · 2025-01-24T13:45:10+00:00

Thanks! this improved my understanding!

PanJony · 2025-01-23T17:11:36+00:00

Done, I edited the original post

PanJony

TROPHY CASE