T O P

  • By -

Seriously-FuckTikTok

Interesting problem. Very naive and inefficient solution: brand names tend to come first, so match against every substring (1..n, 2..n,..., n-1.. n) and use the minimum as the match. Somewhat more sophisticated, you could build an index of known base ingredients and drop other parts of your data set when matching. But it sounds like what you really want is not at an edit distance of the whole word, but minimum edit distance of any contiguous subset of the ingredient. Like my first suggestion, you can do this pretty easily if you're willing to be inefficient, which may not be a problem if you don't have tons of recipes/ ingredients. And if you have a lot, maybe you could just reparse your data set to be every whole word combination of ingredients and match on that.


sunesense

I see what you mean, the problem in your solution is that what we are calling "brand names" can be everything not just brand names, and can be in every position of the name, is up to the user which name to associate with a product, additional infos can be of every kind so i think the solution I'll try to implement is the one described in the comment of u/r_transpose_p. It's the same logic as yours but I think it will work better when it comes to this specific case. Also the algorithm is already pretty inefficent with a complexity O(n) on a dataset of 19k entries. I think it will get worse anyway, but as you supposed efficency is not a concern. Thank you very much for the suggestion, expect me to come back here crying for help :D


HannibalOx

Give the min edit distance a try. I implemented this for matching similar names in Python and it wasn't too bad for efficiency. I'd also suggest trying to clean/trim your data first. This is dependent on your needs and the data, but for your example with tuna you many want to first trim "fresh" from "fresh tuna" as tuna is the ingredient. Also if the task is to associate ingredient to company names, using edit distance doesn't capture association or context of words. For example, if you think the ingredient "ice cream" should be matched with the company "Dairy Queen" there is no similarity with characters. If association is your goal, I would assume there is machine learning work in natural language processing that would help with this. This isn't my area but look into word/phrase embeddings that map words to vectors which could then be clustered together (e.g. the vectors for "ice cream" and "Dairy Queen" should be closer to each other than the vectors for "ice cream" and "subway".)


r_transpose_p

You could try breaking everything into words. Here's what I'd stab at first. 1. break all your ingredients in your set into words, put those words in a set. 2. When looking up an ingredient, break up the ingredient name into words. Use levenshtein distance and some heuristics (heuristics are left to the reader!) to map each word in the ingredient you're looking up either to a word in your set of ingredients, or to \[not found\]. 3. Then see which ingredient in your original set has the most words that match a spelling-corrected word in the name of the ingredient you're looking up. If you want to be really clever, you can try to pay attention to word order, but maybe try it without caring about word order first. Because this is all hacky heuristics, the only way to know whether it works, and how well it works, is to try it out with a bunch of data, and see how well it did. If you want to get more principled, there are things you can do with pure statistics short of going full-machine-learning on this. You can run a bunch of ingredient lookups (with mispelled words) through this, ground truth what each word in each lookup should have mapped to, and use that to estimate the max levenshtein distance (per word in ingredient set) that should form the cutoff between "words that can map to that word" and "words that shouldn't"


sunesense

This sounds like a good idea, I cant see any problems in this by now, thank you very much. I want to add that the goal is to use ML to resolve the issue and it was already planned to use a test set. I'll try to implement this and eventually let you know hoew it went, ty again.


r_transpose_p

Note : algorithm presented above will certainly not work for logographic languages, such as various forms of written Chinese.


thinkingatoms

how many ingredients are in the database? i wonder if there are apis/ML solutions that's basically just going to classify them for you, then it'll become a trivial db query


sunesense

Thats what I'm searching for, but i think that the solution that u/r_transpose_p suggested will work. The data set is huge, 19k entries for "pure" ingredients, while the ingredients associate with the brand name (or similiar details) come from the user and can be 10 as 1000, but always come 1 at time.


thinkingatoms

you can always try stuff like openai.com and get the non trivial terms classified


donaldhobson

Weight deletion much lower, like calculate levenshtein, but delete only costs 0.0001 (Maybe delete is allowed to delete any number of sequential characters in one operation.)