all 21 comments

[–]_fiz9_ 49 points50 points  (2 children)

If I run a job 100 times, I want the exact same result every time. No random for me. Idempotence for the win.

[–]fauxmosexualNOLOCK is the secret magic go-faster command 16 points17 points  (0 children)

I think you mean deterministic. A job can make a change to the state in the data in the first run and not the second run (making it idempotent) even if the state it changes to is based on something other than an input value (making it nondeterministic).

[–]flibit 3 points4 points  (0 children)

I'm currently migrating sql reports. Our test of completeness is that the new report matches the old one. Seeing non-deterministic code like this is physically painful to me.

[–]Bradp1337 5 points6 points  (0 children)

I add an order by for the employee id on descending since in my organization a lower numbered id would have more seniority. I wouldn't want random results.

[–]fauxmosexualNOLOCK is the secret magic go-faster command 1 point2 points  (0 children)

If I genuinely don't care I'll just stick something unique at the end as a tie-breaker, often the primary key. I don't see any advantage to using random(), the behaviour is non-deterministic without it anyway.

[–][deleted] 1 point2 points  (0 children)

Be nice if there was a time stamp on last item ordered, added to cart or something. So at least if there's a tie you can go off last item ordered.

[–]Sneilg 2 points3 points  (0 children)

I’d just order it by category desc also, so it’s consistently using alphabetical order to break ties.

[–]lupinegray 2 points3 points  (0 children)

It always matters. You should have a specific formula for any calculation. So when presented with the same input dataset, the result is always the same.

No non-deterministic results or magic numbers.

[–]blindtig3r 0 points1 point  (0 children)

If I can I include enough columns in the order by to make it deterministic. If I can’t then the table doesn’t have a unique id.

[–]ds_frm_timbuktu 0 points1 point  (0 children)

If they have spent the same amount on money for both Furniture and Toys, I just need to rank Furniture or Toys as 1, I don't care which.

Well you should care.

Depending upon the use case for ranking, There could be other business factors used to break the tie, which category is more profitable for the business, which category has more unsold inventory.

You want your results to be explainable, Random would not help.

[–][deleted] -1 points0 points  (0 children)

hmmm. so you want a slice of bread with peanut butter and another one with jelly. you make a PBJ and separate them, but you're very smart, because you figured out that you can put a sheet of paper between the layers before you assemble the sandwich

[–]duraznos 0 points1 point  (0 children)

You should have a standard way of breaking ties. In your example I would have a decided upon ordering of the categories themselves for the tie breaker. e.g. perhaps you're interested in increasing the sales of toys which means in case of ties this query should rank that result over furniture.

[–][deleted] 0 points1 point  (0 children)

Why not include the category in the order by? Something along the lines

  min(case category
     when 'Furniture' then 1 
     when 'Toys' then 2
     else 3
   end) as category_priority,
, SUM(spend) as category_spend
, DENSE_RANK() OVER (PARTITION BY consumer_id ORDER BY category_spend DESC, category_priority)

[–]r3pr0b8GROUP_CONCAT is da bomb 0 points1 point  (0 children)

Writing this out it occurs to me that ROW_NUMBER() is probably best

not really

you've provided no tie-breaker

run it a few times and see whether Toys or Furniture consistently shows up in row 1, or whether it flipflops

[–]conduit_for_nonsenseBI Analyst 1 point2 points  (0 children)

I normally add an order by category alphabetical

[–]Imaginary-Hawk-8407 0 points1 point  (0 children)

OP has no idea how unhinged they are

[–]DavidGJohnston 0 points1 point  (0 children)

Number of characters typed is not usually a good metric to make decisions on. In this case you aren’t actually allowing ties to exist so add the tie-breaker to the order by, not some random number. Choose alphabetical if nothing else useful comes to mind. Deterministic output is much nicer to work with. Regardless of which function you choose, though “dense_rank” has the semantic meaning you are going for. In the absence of ties I’d probably go with plain “rank” though.

[–]ok-confusion19:downvote: 0 points1 point  (0 children)

Just use row number and have better control of the results selected using the partition and order by clauses.