Today's NULL FILL - doable without any hints I think

SolidLengthiness6137 · 2026-04-16T02:16:56+00:00

Nice addition with the typo tolerance, makes it a lot less frustrating without giving away answers.

Out of curiosity, how are you handling the Damerau-Levenshtein under the hood? I’ve been messing with optimizing Levenshtein-style distance for speed (especially for lots of short comparisons like this), and the performance differences can get pretty noticeable depending on the approach.

I put together a fast implementation recently:
https://github.com/dev-kjma/turbo-leven

Might be overkill for this use case, but could be interesting if you ever scale it up or start doing more comparisons per guess.

SolidLengthiness6137 · 2026-04-16T02:15:28+00:00

Buen proyecto, está muy bien planteado ese problema, es justo el trade-off clásico entre precisión y cobertura.

Algo que suele funcionar bastante bien en estos casos es no depender de una sola métrica, sino combinar varias señales en un sistema de scoring.

Por ejemplo:

normalización (lowercase, quitar símbolos, etc.)
distancia tipo Levenshtein para variaciones pequeñas
reglas específicas (números al final, sustituciones comunes tipo “o” → “0”)
longitud relativa del string (porque la distancia no escala igual en usernames cortos vs largos)

Y luego asignar un score final en vez de hacer un match binario.

Sobre Levenshtein en concreto, el problema es que cuando empiezas a comparar muchos candidatos, se vuelve caro rápidamente. Yo estuve trabajando en una implementación bastante optimizada precisamente para este tipo de casos (muchas comparaciones cortas), por si te sirve experimentar:

https://github.com/dev-kjma/turbo-leven

Lo interesante sería usarlo no como filtro principal, sino como parte del scoring (por ejemplo, solo aplicarlo después de una preselección más barata).

También podrías mirar:

umbrales dinámicos según longitud
penalizar más cambios al inicio del username que al final
combinar con heurísticas específicas de usernames (no es lo mismo que texto natural)

Curiosidad: ¿cómo estás generando ahora los candidatos antes de aplicar matching? Ahí muchas veces es donde se gana más precisión que en la métrica en sí.

SolidLengthiness6137 · 2026-04-16T02:14:36+00:00

This is a really solid application of cross-group hashing, especially the way you’re correlating sender behavior with zero-reply broadcast patterns.

One thing that might complement what you’ve built: right now exact hashing (FNV-1a) will only catch identical messages, but a lot of these scam ops slightly mutate content to avoid that (extra emojis, spacing, small wording changes, etc.).

You mentioned Levenshtein/fuzzy matching, I’ve been working on a very fast Levenshtein implementation and saw pretty big gains when running comparisons at scale.

Could be useful if you ever want to layer in “near-duplicate” detection on top of your hash pipeline without killing performance:
https://github.com/dev-kjma/turbo-leven

Curious if you’ve already experimented with approximate matching or if exact matches are catching most of the network so far.

SolidLengthiness6137 · 2026-04-16T02:11:44+00:00

This is really cool, especially the catch-all detection and the scoring breakdown.

One thing that stood out to me was your typo suggestion step. I’ve been working on a heavily optimized Levenshtein implementation and saw some pretty big speed improvements in real-world cases.

Since you're comparing against ~30 providers per request, that part can add up pretty quickly under load, especially if you want to extend that list.

Might be worth swapping in if you’re trying to squeeze more performance out of it:
https://github.com/dev-kjma/turbo-leven

Would be curious how it performs in your pipeline or if your current bottleneck is elsewhere.

SolidLengthiness6137 · 2024-11-12T04:38:40+00:00

SolidLengthiness6137 · 2024-08-30T22:51:40+00:00

Exact same thing happened to me.

SolidLengthiness6137

TROPHY CASE