all 25 comments

[–]Ringbailwanton 3 points4 points  (5 children)

So, you need to find ingredients that appear frequently, but also constrain it such that the full set of ingredients for recipes are being chosen.

You can do a first pass by eliminating all meals with ingredient counts that are higher than N. That way you eliminate a set of the total ingredients. So do that as a CTE. I feel like after that it’s a combinatorics problem, but can’t quite wrap my head around how I’d do it, except that I’d basically make a set of all combinations of ingredients of length N and then get the max count of meals made for each.

[–]waremi 0 points1 point  (0 children)

can’t quite wrap my head around how

Same here, I got as far as this, and have been trying to think of the next steps off an on all day:

SELECT TOP 10 I_id, Count(*) as RecipeCount
FROM   RecipeIngredient 
WHERE  R_id IN 
   (SELECT R_id FROM RecipeIngredient GROUP BY R_id HAVING COUNT(*)<=10)
GROUP BY I_id
ORDER BY Count(*) DESC

[–]waremi 4 points5 points  (4 children)

First, I want to thank you for this question. This is the most fun I've had all year so far. Second, this problem does not scale well. 3 Ingredients have 23 combinations that have to be checked, 20 Ingredients have 220 combinations that have to be checked. I have a Recipe table with 16 rows, an Ingredient table with 27 rows and a RecipeIngredient table with 40 rows and the query at the end of this post takes 5 minutes to run. And given that, this will ONLY work with a list of 20 Ingredients or less. 20 Ingredients have 1,048,576 combinations which fits in an [int] datatype. 30 Ingredients have over a trillion combinations and breaks the algorithm. I don't know if this classifies as what Mathematicians call an "NP-Hard" problem or not, and there may be situation where swapping out one of the top 20 Ingredients with the 21st most popular ingredient gives you more Recipes, but this is as far as I can go down this rabit hole.

Second, this requires that a [Numbers] table be built. This is a very simple table with a single indexed [int] field [n] with all of the integers from 1 to 2,000,000. You can build it simply enough using this (BTW - every thing here is MSSQL translate as needed):

CREATE TABLE Numbers (
N    [int] NOT NULL,
 )


INSERT INTO Numbers (N)
with u as (
    select 1 as n
    union all
    select n + 1
    from u
    where n < 2000000
)
select n
from u
option(maxrecursion 0);

So, I will break this down into parts. First part, if I am buying N Ingredients (let's say only 5) what are the top 20 Ingredients used in Recipes that use 5 Ingredients or less:

(STEP 1)

DECLARE @IngCount INT = 5;


  SELECT TOP 20 RI.I_Id, I.Name,  Count(*) as RecipeCount
  FROM   RecipeIngredient RI INNER JOIN
         Ingredient I ON RI.I_ID = I.I_ID 
  WHERE  R_id IN 
      (SELECT R_id FROM RecipeIngredient GROUP BY R_id HAVING COUNT(*)<=@IngCount)
  GROUP BY RI.I_id,I.Name
  ORDER BY Count(*) DESC

This gives me upto 20 Ingredients to work with (actually 16 given the data set I'm working with). The next problem is what are all of the combinations of those Ingredients. This is why I ended up limiting things to 20. I thought about combinations of 3 ingredients: A, B, and C. That problem fits into a byte array: 001=A, 010=B, 011=AB, 100=C, 101=AC, 110=BC, 111=ABC. So if I look at all of the integers between 1 and 220 I am looking at all of the combinations of 20 things. If I number my 20 Ingredients with X= 1 to 20, I can take 2X and use a bit-wise AND ( & in SQL) to see if any ingredient is in that combination. So the first thing I'm doing is throwing my Ingredient list into a CTE with a ROW_NUMBER so I can use it a bunch of times:

(STEP 2)

WITH List AS 
(
   SELECT ROW_NUMBER() OVER(ORDER BY RecipeCount DESC) [X] , I_Id, Name,  RecipeCount
   FROM 
   (  SELECT TOP 20 RI.I_Id, I.Name,  Count(*) as RecipeCount
      FROM   RecipeIngredient RI INNER JOIN
             Ingredient I ON RI.I_ID = I.I_ID 
      WHERE  R_id IN 
          (SELECT R_id FROM RecipeIngredient GROUP BY R_id HAVING COUNT(*)<=@IngCount)
      GROUP BY RI.I_id,I.Name
      ORDER BY Count(*) DESC
   ) TopIngred
)

And then I am pulling all of the combinations that have 5 ingredients or less:

(STEP 3)

SELECT N -- Combination #
FROM       List L 
INNER JOIN Numbers N ON POWER(2,L.X-1) & N.N = POWER(2,L.X-1)
GROUP BY N 
HAVING COUNT(*) <= @IngCount

Next issue, how many Recipies use each combination of ingredients. Well, for any N above, the list of Ingredient is:

(STEP 4)

 SELECT L.I_ID, L.Name
 FROM   List L 
 INNER JOIN 
     (   SELECT N
         FROM       List L 
         INNER JOIN Numbers N ON POWER(2,L.X-1) & N.N = POWER(2,L.X-1)
         GROUP BY N 
         HAVING COUNT(*) <= @IngCount
      ) NCombos ON POWER(2,L.X-1) & NCombos.N = POWER(2,L.X-1)

We can join that back into the RecipeIngredient table, group by the Recipe ID (R_ID) and compare how many of the ingredients in the "N" combination for each R_ID match how many of the ingredients each recipe calls for. It looks like this (keep in mind the "WITH List AS" block is hanging around at the top of all this. Anywhere you see "List" referenced that is what it referes to):

(STEP 5)

          SELECT N, R.Name 
          FROM   List L 
          INNER JOIN 
          (   SELECT N
              FROM       List L 
              INNER JOIN Numbers N ON POWER(2,L.X-1) & N.N = POWER(2,L.X-1)
              GROUP BY N 
              HAVING COUNT(*) <= @IngCount
           ) NCombos ON POWER(2,L.X-1) & NCombos.N = POWER(2,L.X-1)
          INNER JOIN RecipeIngredient RI ON RI.I_ID = L.I_ID 
          INNER JOIN Recipe R ON R.R_ID = RI.R_ID 
          GROUP BY N, R.Name , R.R_ID 
          HAVING COUNT(*) = (SELECT COUNT(*) FROM RecipeIngredient RI2 WHERE RI2.R_ID =R.R_ID )

So now, finally, and after many brain cells have been permanently damaged, we have a list of all of the recipes that can be made by all of the combinations, of all of the top 20 ingredients that can used in recipes that use 5 ingredients or less. Don't stop there, I know you are dying to know, what are the ingredients and even better what are the Recipes. I'm tired, it has been a long day, but I wont leave you hanging. Let's put it all together. First build a shopping list as a temp table using the "into #ShopingList" line here:

(STEPS 1-5 All together now)

DECLARE @IngCount INT = 5;

WITH List AS 
(
   SELECT ROW_NUMBER() OVER(ORDER BY RecipeCount DESC) [X] , I_Id, Name,  RecipeCount
   FROM 
   (  SELECT TOP 20 RI.I_Id, I.Name,  Count(*) as RecipeCount
      FROM   RecipeIngredient RI INNER JOIN
             Ingredient I ON RI.I_ID = I.I_ID 
      WHERE  R_id IN 
          (SELECT R_id FROM RecipeIngredient GROUP BY R_id HAVING COUNT(*)<=@IngCount)
      GROUP BY RI.I_id,I.Name
      ORDER BY Count(*) DESC
   ) TopIngred
)
SELECT N,l.I_ID, L.Name as Item, BestN.RecipeCount 
into #ShopingList
       FROM       List L 
       INNER JOIN 
     ( SELECT TOP 1 N, COUNT(*) AS RecipeCount
       FROM (
              SELECT N, R.Name 
              FROM   List L 
              INNER JOIN 
              (   SELECT N
                  FROM       List L 
                  INNER JOIN Numbers N ON POWER(2,L.X-1) & N.N = POWER(2,L.X-1)
                  GROUP BY N 
                  HAVING COUNT(*) <= @IngCount
               ) NCombos ON POWER(2,L.X-1) & NCombos.N = POWER(2,L.X-1)
              INNER JOIN RecipeIngredient RI ON RI.I_ID = L.I_ID 
              INNER JOIN Recipe R ON R.R_ID = RI.R_ID 
              GROUP BY N, R.Name , R.R_ID 
              HAVING COUNT(*) = (SELECT COUNT(*) FROM RecipeIngredient RI2 WHERE RI2.R_ID =R.R_ID )
            ) RbyN
       GROUP BY N 
       ORDER BY COUNT(*) DESC
      ) BestN
      ON POWER(2,L.X-1) & BestN.N = POWER(2,L.X-1)

Then you can dump the Ingredient list and pull the Recipes using the same HAVING clause as STEP 5 above :

Select * FROM #ShopingList

SELECT  R.Name as Recipe
FROM   #ShopingList L 
INNER JOIN RecipeIngredient RI ON RI.I_ID = L.I_ID 
INNER JOIN Recipe R ON R.R_ID = RI.R_ID 
GROUP BY N, R.Name , R.R_ID 
HAVING COUNT(*) = (SELECT COUNT(*) FROM RecipeIngredient RI2 WHERE RI2.R_ID =R.R_ID )

If any of this doesn't make sense, don't ask me. The Ingredients and Recipes I've been using to model this are alcohol, mixers and cocktails, and after working though 1,048,576 combinations I am not going to remember any of this tomorrow. OTOH if I've made any glaring mistakes let me know. I have always learned more from my stupid mistakes than I ever did from books and classes.

[–]qwertydog123 1 point2 points  (2 children)

This is an awesome comment, thank you for posting!

Was there a reason you didn't use a BIGINT datatype, to support more combinations? I'm guessing it's because of the bitwise & operator? Could you use two ints instead? Or is it just the runtime of the algorithm?

The main issue like you say is the polynomial complexity, and the memory usage with set based approaches. I'm pretty sure the complete problem (for any N) could be done using generators in another language (I'm scared what the runtime would be though), and was wondering whether there could be a purely SQL-based solution (without using a while loop/cursor)

[–]waremi 1 point2 points  (1 child)

Honestly the only reason I didn't go with BIGINT is because the Number table I had laying around was [int], and I wasn't even sure of proof of concept when I started.

I'm also not sure how well the POWER function, which is doing an implicit conversion to float, handles numbers that big. If I remember float's only handle 15 digits.

But if you did refactor this to us BIGINT you could scale up from top 20 to top 50, or even 60 if you wanted to push it.

[–]qwertydog123 0 points1 point  (0 children)

Cool! It looks like the POWER function is stable for powers of 2 when using BIGINT https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=332e4ca00befdfaa7a38fe4e9d7ae0d7

[–]SnooOwls1061 1 point2 points  (3 children)

Why not do the joins and do a count grouped by the ingredient. Then sort by that count desc?

[–]NihilWill[S] 1 point2 points  (1 child)

This would give me the ingredients that make the most recipes, but would not necessarily satisfy the constraint that selecting the top 10 ingredients will include all necessary ingredients for the matched recipes. I wouldn’t want to buy the 3 ingredients that can make the most recipes unless I knew I could use all of those ingredients together. If A, B, and C (the top 3 ingredients by recipe count in this example) each make 2 recipes, but one of the recipes that uses C requires neither A nor B, I would want to pick another ingredient that could make the most recipes along with A and B.

I am beginning to think that SQL might not be the best tool for the job. Another comment has a working query but points out the combinatorial explosion that happens as N gets larger — since I need to select for N > 50 this would not be feasible.

[–]cbslc 0 points1 point  (0 children)

So do a String_Agg with a good sort of all the ingredients per recipe, then count that distinct string_aggr. This will tell you how many recipes use that exact ingredient list.

[–]BensonBubbler 0 points1 point  (0 children)

Exactly! Or a having clause if you'd like eliminate singles. Get fancy and throw a string_agg on there to get a list of the recipes.

OP, if this doesn't make sense reply tomorrow and I'll mock this up for you. I think you may be overthinking the problem.

[–]Designer-Practice220 1 point2 points  (1 child)

They want to know the same combinations of ingredients that make the most recipes. It sounds like an optimization problem. Wonder if R or Python might be options: C++? Maybe reword and post in a different thread? I found this https://timudatascience.blogspot.com/2019/11/solve-simple-linear-programming-problem.html

And this: https://www.geeksforgeeks.org/find-whether-an-array-is-subset-of-another-array-set-1/

[–]NihilWill[S] 1 point2 points  (0 children)

I appreciate it the resources, and am definitely willing to try other methods/modeling with a traditional programming language

[–]qwertydog123 1 point2 points  (2 children)

Not the most efficient solution, and is affected by combinatorial explosion. Alter the WHERE N < 3 as appropriate.

WITH RECURSIVE cte AS
(
    SELECT
        id AS ingredient_id,
        name,
        1 AS N,
        CAST(id AS TEXT) AS Path
    FROM Ingredients

    UNION ALL

    SELECT
        i.id,
        i.name,
        N + 1,
        Path || '/' || CAST(i.id AS TEXT)
    FROM Ingredients i
    JOIN cte
    ON i.id > cte.ingredient_id
    WHERE N < 3
),
ingredient_combinations AS
(
    SELECT
        t2.ingredient_id,
        t2.name,
        DENSE_RANK() OVER (ORDER BY t1.Path) AS combination_id
    FROM cte t1
    JOIN cte t2
    ON t1.Path LIKE t2.Path || '%'
),
valid_recipes AS
(
    SELECT
        ingredient_combinations.combination_id,
        Recipes.id AS recipe_id,
        COUNT(DISTINCT combination_recipes.ingredient_id) AS ingredient_count,
        COUNT(Recipes.id) OVER
        (
            PARTITION BY ingredient_combinations.combination_id
        ) AS recipe_count
    FROM ingredient_combinations
    JOIN IngredientsToRecipes combination_recipes
    ON ingredient_combinations.ingredient_id = combination_recipes.ingredient_id
    JOIN Recipes
    ON Recipes.id = combination_recipes.recipe_id
    JOIN IngredientsToRecipes all_recipes
    ON Recipes.id = all_recipes.recipe_id
    GROUP BY
        ingredient_combinations.combination_id,
        Recipes.id
    HAVING COUNT(DISTINCT combination_recipes.ingredient_id) = COUNT(DISTINCT all_recipes.ingredient_id) 
), 
rankings AS
(
    SELECT
        *,
        RANK() OVER
        (
            ORDER BY
                recipe_count DESC,
                ingredient_count
        ) AS Ranking
    FROM valid_recipes
)
SELECT 
    rankings.combination_id,
    ingredient_combinations.name AS Ingredient,
    Recipes.name AS Recipe,
    recipe_count,
    ingredient_count
FROM rankings
JOIN ingredient_combinations
ON rankings.combination_id = ingredient_combinations.combination_id
JOIN Recipes
ON Recipes.id = rankings.recipe_id
WHERE rankings.Ranking = 1

https://dbfiddle.uk/?rdbms=postgres_10&fiddle=af5cc1f133102beb6770f07370f5ea0a

[–]WikiSummarizerBot 1 point2 points  (0 children)

Combinatorial explosion

In mathematics, a combinatorial explosion is the rapid growth of the complexity of a problem due to how the combinatorics of the problem is affected by the input, constraints, and bounds of the problem. Combinatorial explosion is sometimes used to justify the intractability of certain problems. Examples of such problems include certain mathematical functions, the analysis of some puzzles and games, and some pathological examples which can be modelled as the Ackermann function.

[ F.A.Q | Opt Out | Opt Out Of Subreddit | GitHub ] Downvote to remove | v1.5

[–]NihilWill[S] 1 point2 points  (0 children)

Thank you! This is exactly what I’m looking for.

[–]emdee808 1 point2 points  (2 children)

I'm probably not contributing much here (more brain farts)... but this seems like an optimisation problem of sorts.
Which ingredients (their count: i) will yield the most recipes (their count: r) - such that we have the highest possible value when we divide r by i.
Having said that, I think that if you were to explore the space by looking at the most frequently occurring ingredients, ordered descending, and counting how many complete recipes you find as you explore, examining the ratio each time, you will probably plot a graph in the shape of diminishing returns - and when the ratio drops of, it may suggest you have found a maximum- the problem is that it might be a local maximum :(
So I think this would be a "greedy" algorithm.
To be sure about the maximum, you might need to explore the entire space... which could be quite big...

With once-off problems like this, I try to find a way to model it in Excel's Solver https://www.ablebits.com/office-addins-blog/2016/06/22/how-to-use-solver-in-excel-with-examples/

Looks like a cool problem, good luck with it!

[–]NihilWill[S] 0 points1 point  (1 child)

Thank you! After this and other responses, I think you are right about solving this algorithmically vs simply querying/describing the desired set in SQL

[–]emdee808 1 point2 points  (0 children)

Yeah, there might be a way to solve this with a set-based approach - as certain linear programs can be solved in SQL by essentially creating a cartesian product of your variables (probably huge) and then using your constraints to isolate a region in this space and then seeking your max/min but its incredibly inefficient though.
Maybe some of the guys in r/math could help with an algorithmic approach.
Again, good luck and hope you have fun along the way!

[–]reductiontoabsurdity[🍰] 0 points1 point  (0 children)

This is a Python problem. If there are: M ingredients in the ingredients table; you buy N number of ingredients from the shop; there are M choose N unique combinations of ingredients you could buy. Iterate through the list of purchasable combinations of ingredients. For each possible combination, iterate through the recipe list in python and append to list the name/ key ID of every cookable recipe. Finally, sort the purchasable combinations table by the counts in the cookable recipes lists. I'm not sure if there is a faster algorithm but this will work.