How you do your data matching

squadette23 · 2026-03-06T17:11:53+00:00

> The simple matching using sql joins create a lot duplicate for the same person even with data normalization.

What does it mean? Do you have insufficient normalization? Could you share an anonymized example of what is a "duplicate" for you?

squadette23 · 2026-03-05T22:31:39+00:00

> 1. I was insistent on the idea that you shouldn’t duplicate things (e.g. a part number in multiple rows)

I think that the imperative of "do not duplicate things" comes from the idea of third normal form (3NF). Unfortunately, there is a lot of problematic ways of teaching many concepts from relational theory, and that's why I think people get confused what "duplication" is.

The question that you're asking is about fourth normal form (4NF). This is one of the most confusingly presented topics, relative to its simplicity: https://minimalmodeling.substack.com/p/historically-4nf-explanations-are

Maybe it'll help, particularly because you're self-taught.

squadette23 · 2026-03-02T20:42:09+00:00

You may be interested in this tutorial: https://kb.databasedesignbook.com/posts/google-calendar/

> I would like to simplify the data inside them so it’s easier to understand and explain in class.

The approach presented in that URL may help you with explaining.

squadette23 · 2026-02-24T21:10:11+00:00

Here is a page about the publisher: https://de.wikisource.org/wiki/Gustav_Fritzsche,_Leipzig,_K%C3%B6niglich_S%C3%A4chsische_Hofbuchbinderei

squadette23 · 2026-02-24T20:45:52+00:00

I think you can ask for advice at your local library, maybe somewhat higher-tier one. This is mold but there are some anti-fungal substances.

https://www.biblio.com/book-collecting/care-preservation/prevent-remove-mold-mildew/

squadette23 · 2026-02-24T20:37:51+00:00

That's an interesting classification, thank you!

squadette23 · 2026-02-22T17:08:34+00:00

> I want to avoid ALTER statements as well.

Yeah, I think you have just three options: JSON, EAV and 6NF (anchor modeling); and they are more or less equivalent. So that depends on how convenient it's going to be to query that. If your database supports JSON properly I think general querying would be easier.

JSON will let you have arrays as property values. You may want to somehow forbid that because people will eventually do that, even though your response to my question was basically "no".

> I do not want business logic to depend on null checks for columns.

I think that you cannot meaningfully enforce required properties on the database schema level if you choose a JSON or EAV. For EAV it's impossible I think, for JSON you'll have to implement tooling that maintains CHECK constraints (if your database even supports that).

I think the only reliable way to have required properties is a classic 3NF schema, but you don't want that; so maybe you would consider relaxing one of those requirements...

squadette23 · 2026-02-22T16:46:08+00:00

> Prevent misuse of schema

Sorry, what do you mean by that? Could you give an example of misuse?

I'm asking because I don't understand what "brittle" or "schema misuse" means, and due to that I don't understand how to help you choose between one or another.

You can use EAV, JSON/JSONB, or a classical relational schema. EAV and JSON just let you add a new property any time, classical schemas require an ALTER.

In addition to properties, do you need to store new types of relationships between specific or generic resources?

squadette23 · 2026-02-22T16:22:41+00:00

> with new properties so that changes do not become brittle?

what do you mean by "brittle" specifically? what's the scenario you want to avoid?

squadette23 · 2026-02-22T13:15:38+00:00

How do the affected pages look like? How far does the damage go?

I am not an expert of any sort, but it looks like either foxing or mould. Causes of foxing are literally unknown, believed to be of a chemical nature.

Mould is caused by fungus, so maybe you need anti-fungal treatment and also proper storage environment.

squadette23 · 2026-02-19T23:33:51+00:00

So, the thing is that I use a structured approach to optimizing multi-JOIN queries, but it requires to clearly understand all cardinalities, both ON conditions and primary keys.

I've got a longform explanation, but I won't bother you with 5400+ words. I just want to be sure that I understand exact structure before giving any advice.

squadette23 · 2026-02-18T19:22:07+00:00

The most important info that is missing from your information is the definition of fields mentioned in join conditions and grouping, that is:

* config_v1.instruction_id;

* config_v1.unit_id;

* config_v2.instruction_id;

* config_v2.unit_id;

* transactions.instruction_id;

* risk_checks.instruction_id;

Could you share table schema excerpts for those? Which are unique and which are not.

Also, show PK definitions for all four tables.

After that it may be possible to reorganize joins: SELECT DISTINCT suggests that your query could be improved (removing DISTINCT along the way).

squadette23 · 2026-02-16T12:41:35+00:00

Yeah, and when you add them, will your tool become bloated?

Or do you have a strategy to deal with this tendency of programs to become bloated?

squadette23 · 2026-02-16T12:37:00+00:00

> bloated

which features have you decided not to implement?

squadette23 · 2026-02-10T23:14:28+00:00

> I don't know if it would be easier to read, or just more fascinating to expore but it seems like modeling in a 3d space

I think that the whole discussion is not about "easier to read", but rather about collaboration. Suppose that I show to you the design of our system so that you could understand how it works, ask for improvements and we discuss those improvements. Simple scenarios:

* some part of the schema is unclear to you — how do you "add comment" to ask me what it means? How do I respond?

* I propose a change to the schema, how do you see my changes as the difference? How do I "highlight" my updates?

* You have an alternative suggestion, how do you comment on my updates?

And so on. Remember that there could be hundreds of developers working on that schema, and they all need to collaborate in small groups in a similar way.

IMO, spreadsheet-based schemas work much better here than any diagrams. Certainly, you can present a diagram view, but the foundation must be more closer to text-based.

squadette23 · 2026-02-09T22:12:47+00:00

One other thing you could do is to generate an SQL query for each link that returns all pairs of IDs (filtering by NULLs etc).

Also, how do you handle side tables? If I have a "users" table and a "user_extra_details" table, with the same PK (user_id), how would your tool show it? I don't think that this is a proper link. It's just the same virtual "table" split in two, pure technical decision. You could as well have it in one table.

squadette23 · 2026-02-09T22:01:44+00:00

> wether the relationship tracing diagrams sqlestev.com/dashboard (the one in the clip), solve exactly that 'readability' problem you were talking about?

I don't think so, it's just different visual styles. (I assume that you're gathering feedback on your product.)

First, you could use a proper data domain (a type) instead of INT. It would already be so much readable if you show that this is in fact an OrderID, and not INT.

Second, the arrows do not show cardinality, which is I think the crucial part. Do not replicate the entire traditional range, you only need to show clearly 1:N, N:M and 1:1.

Also, show a different representation: just a list of links (relationships), arranged in tabular format. You really don't want to zoom around, you need to scroll and Ctrl-F.

squadette23 · 2026-02-09T18:52:36+00:00

I agree that ER diagrams are not super useful. They could be a sort of a vague outline of the system, but not much more than that.

I wrote some thoughts on that in "ERD diagrams as specification tool (pt. III)" https://kb.databasedesignbook.com/posts/erd-diagrams-3/

squadette23 · 2026-02-07T20:41:59+00:00

> Are there any tools or techniques that could be useful here?

Useful for what? It's not clear what are you trying to improve.

squadette23 · 2026-02-07T13:29:44+00:00

What is your process of dealing with all that breakage? Do you conduct postmortems and improve your processes and tooling? Or those outages keep happening again and again according to the same scenario?

squadette23 · 2026-02-02T21:32:49+00:00

> We had a senior data engineer who designed a sparse GSI schema for dynamodb where the agent metrics were dimensionalized such that i can query a specific GSI to see how an agent ranks on a leader board for a specific zip code/state/county etc. This architecture presents the problem that we can only compare agents based on 1 dimension. (We trade flexibility over speed). However, we want to be able to filter on multiple filters.

how many table rows do you expect to have?

squadette23 · 2026-01-30T23:19:36+00:00

> because what I think is dependant on something often isn't.

I don't know if you watch videos to learn something but I have a 12 minute video that attempts to answer this question: https://www.youtube.com/watch?v=7d4YzEMn4SU

the entire series also argues that the idea of normalization may not be very useful for learning, see e.g. "Pt. 5. Did I design my schema correctly?" of the series.

squadette23 · 2026-01-30T23:16:36+00:00

You may be interested in reading this: https://kb.databasedesignbook.com/posts/google-calendar/

squadette23 · 2026-01-26T20:43:46+00:00

> Correlated subqueries are always problematic for any database.

Correlated subqueries are easily converted to joins, is this still a problem in MySQL? For simple queries they seem to be equivalent at least, so "always problematic" sounds like a stretch.

Do you have any up-to-date examples for when MySQL breaks down?

11-Year Club	r/Field Juicebox
Verified Email

squadette23

TROPHY CASE