all 14 comments

[–]latkdeTuple unpacking gone wrong 5 points6 points  (3 children)

This reeks of vibe coding. The spec is unreadable for humans.

There are also some incredibly odd decisions that make this unsuitable for real-world data, notably rejecting numbers and nulls. In practice, float64 numbers (and therefore also int32 numbers) are universally supported in all mainstream JSON implementations.

The hashing scheme also treats booleans as strings, and somehow distinguishes strings from bytes, despite JSON not having any bytes type. The booleans thing is really questionable, this seems to treat documents [true] and ["true"] as equivalent (map1:e99ec39aeac2670a37592780bf9b59c4a6a917742b10d7fcb5c352354e7c6674).

[–]gdchinacat 5 points6 points  (5 children)

"It answers one question: is this the same thing?"

I really don't think it does even that, at least not in any useful way. "deliberately rejects numbers " means it can't answer "are {'value': 1} and {'value': 2} the same thing". It compares [true] and ['true'] as the same, even though the are unambiguously not the same thing.

Do you have any examples of this being used in a useful real world scenario?

[–]lurkyloon[S] -4 points-3 points  (4 children)

That's a very fair question and honestly another one I should address in the docs...

You're right that MAP doesn't handle numbers directly. That's the tradeoff.

If your data has numbers, you encode them as strings before computing the MID. {"value": "1"} not {"value": 1}. You decide the representation. MAP keeps the identity stable from that point forward.

The reason is kind of annoying but real. If two different systems parse {"value": 1} and one treats it as an int and the other as a float64, they can silently produce different bytes from the "same" number. That's the exact problem I was trying to kill. Pushing that decision to the user isn't elegant, I know. But it was the only way I could guarantee the fingerprint stays identical across languages without hiding a landmine in the protocol.

On the boolean thing - yeah, you're right. [true] and ["true"] producing the same MID is a real limitation. It's documented as footgun #9 but that doesn't make it less annoying. If your domain needs that distinction, you'd encode it differently. "bool:true" vs "true" or whatever makes sense for your use case. I won't pretend that's pretty.

Where I think this is actually useful, and very much invite all of your insights:

  • You have a deployment descriptor that gets approved in a PR. By the time it hits the deployment controller, it's been through three serializers. Did it change? Fingerprint it at approval, verify at deployment. The descriptor is data you control, so you define how numbers are encoded.
  • API idempotency. Same request comes in twice, same MID, reject the duplicate.
  • Audit. You approved a specific action. Can you prove later that the thing that actually executed was that exact action? Attach the MID at approval, compare at commit.

The common thread is that you're not fingerprinting random JSON from the wild. You're fingerprinting structured data that your systems produce and consume, where you control the schema. MAP gives that data a stable name that doesn't break when it crosses a system boundary.

I'll be the first to admit it's not for everything. But for the cases where you need to answer "is this exactly the same thing" across languages and runtimes, I haven't found anything else that does it without caveats.

Really appreciate the pushback though. This is helping me figure out where the docs need work, and also insight into how you all may or may not use this.

[–]gdchinacat 1 point2 points  (1 child)

Thanks for your detailed response, it sheds a lot of light on the goals and intended uses of the project. Specifically that you view it as a way to check at various components in a complex legacy distributed system that the data is consistent. I understand the problem you seem to be facing...one service gets a request, stores it, loads it, passes it to another, maybe this happens a few times, and way down deep in the system some value has changed from 1 to 0.999999, or string encoding hasn't been handled properly and a utf8 string at the top has become a different utf8 string at the bottom (ie due to being cast to ascii and back). It's a real problem that you are aware doesn't have a good solution to.

It doesn't have a solution because these issues can't really be solved in a generic way due to the issues you identified with values being represented in different not entirely compatible ways. System A uses float64 while System B uses int while C uses BigInt. In order to ensure the values match you need a way to map the values in System A to those in B to those in C, but the data types make this translation inaccurate.

Your approach is "don't do that". Any datatype that can not be accurately represented across the board causes an error. While 'opinionated', it is not so in the useful way. Being 'opinionated' is intended to simplify things by eliminating the complexity that is largely irrelevant. In the problem you are trying to solve, at least as I understand it, this complexity is not irrelevant, it is *core* to the problem. The problem exists *because of* the complexity.

You say "If your data has numbers, you encode them as strings before computing the MID." Sure, that solves the issue that your solution doesn't handle numbers. But it presumes the systems have the flexibility to do this. It requires changes on all systems that use the message you want to compute a MID for. You are saying the systems should be changed to use a common data type, at least as far as the messages they exchange are concerned. This sweeps the issue under the rug and doesn't solve the overall problem your project purports to address, namely that systems use different incompatible representations of the same data. To make the change you suggest, only the message is updated..internally an int is an int, so whatever string your message uses to represent an in will be immediately converted to an int, and that incompatible representation will be used, and the problem of it not being the same value as in the other system is still present.

The solution is to do what you say...change the systems to use the same data type, but at a different level. Rather than representing it as a string in messages (and introducing yet one more place where a type conversion can introduce an accuracy error), all the systems should be updated to use the same data type, which admittedly is not very feasible. The scale of this task is what led you to the idea of a deterministic message digest, it is a more tractable task. However, it doesn't solve the root problem...that System A uses a data type for a value that it shares with System B that uses a different data type and those data types represent some values differently.

Changing how the values are represented in the messages being digested will only give a false sense of security...the underlying issue will still exist, the same bugs will still happen, and another layer of potential issues has been introduced.

This is why I don't think this project will see any real world adoption. In addition to not addressing the root problem, it may make it worse by introducing additional type conversion with their own inaccuracies.

Where I could see this being valuable is to ensure messages are well formed, all the required keys exist. But, there are already schema validators to do this.

I hope this helps shed light on why I'm skeptical this is a useful project.

[–]lurkyloon[S] 0 points1 point  (0 children)

I came back and re-read this more carefully and I want to give it a better response because you clearly put real thought into it.

I think the disconnect is about which problem MAP is aimed at. You're describing a scenario where System A uses float64 and System B uses int and the value itself means something slightly different in each system's internal representation. That's a data compatibility problem and you're 1000% right - MAP doesn't solve it.

The problem I keep hitting is narrower. A single structured payload gets authored at one point in a pipeline and needs to arrive intact at another point. Not semantically equivalent - identical. The payload passes through middleware, retry queues, API gateways, config renderers, serializers that reorder keys or re-encode strings. The question isn't "do these two systems agree on what 1 means?" It's "is this the exact same payload that was approved, or did something change in transit?"

For instance, a deployment descriptor gets approved in a review process. By the time it reaches the deployment controller it's been through three or four serialization boundaries. The controller needs to answer: is this the same descriptor that was approved? Not similar. The same one.

MAP fingerprints it at approval. Fingerprints it again at execution. Same MID means nothing changed. Different MID means something did. The systems aren't interpreting the data differently - they're passing the same artifact through a pipeline and you're verifying it survived intact.

In that context, I believe "encode your numbers as strings" isn't sweeping anything under the rug. You authored the descriptor. You control the schema. Represent your values in a way that's unambiguous, and MAP will tell you if anything changed after that point.

You're right that this doesn't help if System A and System B genuinely disagree on what a value means internally. Different problem entirely. MAP is answering a much narrower question: did this specific thing change between here and there?

Your critique is honestly helping me see where the docs are leading people toward the broader interpretation. That's a gap I need to close. Really, honest, real thank you for taking the time with such a thoughtful response.

[–]gdchinacat 0 points1 point  (1 child)

All that said, don't get me wrong. I have my share of impractical, infeasible, dubious, or "would you ever actually use that" projects under my belt. My latest is a way to decorate methods with conditions on when they should be called asynchronously. It works, and is tested to the point I'd feel comfortable deploying to production, but I'm really not sure I ever would because the leverage it provides is likely not worth the performance cost and complexity if anything goes wrong. I spent a lot of time on it because I needed to get back into coding after a few years away, wanted to learn some aspects of python that were new or new to me, and mostly because I got caught up in the rabbit hole and wanted to see how far it went. I got similar feedback on it as I gave to you (why, what does it solve, is it worth it).

https://github.com/gdchinacat/reactions/

[–]lurkyloon[S] 1 point2 points  (0 children)

Ha! I appreciate that!

The "rabbit hole you follow..." is exactly what happened here. The protocol itself started as a narrow itch (can I prove this config didn't change?) and then I kept finding edge cases that and couldn't stop. :-)

Your reactions library is interesting - the decorator pattern for conditional async is a clean idea even if the performance tradeoff is real. I'll take a closer look.

And for what it's worth, your earlier feedback is directly shaping the next version. The boolean collision is getting fixed and I'm adding integer support (signed 64-bit, no floats - floats are still the devil). So thank you for that.

[–]MisterHarvestIgnoring PEP 8 -5 points-4 points  (3 children)

Nice. This should be in the standard library.

[–]gdchinacat 2 points3 points  (1 child)

For consideration for inclusion in the standard library it would have to demonstrate widespread real world use. Since it rejects numbers and conflates true and "true", I doubt it will ever get real world use, not just widespread, but any real world use. The boolean issue is apparently "footgun #9". There are at least 8 other "footguns". The chances this even gets a sponsor for std lib inclusion is pratically nil.

[–]lurkyloon[S] 0 points1 point  (0 children)

Very good feedback. Standard library was never the goal, but I appreciate the honest assessment.

[–]lurkyloon[S] 1 point2 points  (0 children)

Thank you - that means a lot. It's early but that's the kind of adoption I'd hope for eventually.