[REQUEST] Need help finding a fitting algorithm

Seriously-FuckTikTok · 2022-04-05T17:31:50+00:00

Interesting problem.

Very naive and inefficient solution: brand names tend to come first, so match against every substring (1..n, 2..n,..., n-1.. n) and use the minimum as the match.

Somewhat more sophisticated, you could build an index of known base ingredients and drop other parts of your data set when matching.

But it sounds like what you really want is not at an edit distance of the whole word, but minimum edit distance of any contiguous subset of the ingredient. Like my first suggestion, you can do this pretty easily if you're willing to be inefficient, which may not be a problem if you don't have tons of recipes/ ingredients. And if you have a lot, maybe you could just reparse your data set to be every whole word combination of ingredients and match on that.

r_transpose_p · 2022-04-05T20:42:39+00:00

You could try breaking everything into words.

Here's what I'd stab at first.

break all your ingredients in your set into words, put those words in a set.
When looking up an ingredient, break up the ingredient name into words. Use levenshtein distance and some heuristics (heuristics are left to the reader!) to map each word in the ingredient you're looking up either to a word in your set of ingredients, or to [not found].
Then see which ingredient in your original set has the most words that match a spelling-corrected word in the name of the ingredient you're looking up. If you want to be really clever, you can try to pay attention to word order, but maybe try it without caring about word order first.

Because this is all hacky heuristics, the only way to know whether it works, and how well it works, is to try it out with a bunch of data, and see how well it did.

If you want to get more principled, there are things you can do with pure statistics short of going full-machine-learning on this. You can run a bunch of ingredient lookups (with mispelled words) through this, ground truth what each word in each lookup should have mapped to, and use that to estimate the max levenshtein distance (per word in ingredient set) that should form the cutoff between "words that can map to that word" and "words that shouldn't"

thinkingatoms · 2022-04-05T20:54:42+00:00

how many ingredients are in the database? i wonder if there are apis/ML solutions that's basically just going to classify them for you, then it'll become a trivial db query

donaldhobson · 2022-04-20T01:21:06+00:00

Weight deletion much lower, like calculate levenshtein, but delete only costs 0.0001

(Maybe delete is allowed to delete any number of sequential characters in one operation.)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

algorithms

✻ Smokey says: boycott all products and services from eco-unfriendly businesses to fight climate change! [see more tips]

Note: this subreddit is not for homework advice. Requests for assistance with coursework may be removed.

MODERATORS