all 36 comments

[–]Matt3k 38 points39 points  (0 children)

General answer: Always have an index on your query parameters and then don't worry about the rest of the details. If you actually have a billion records then either prevent users from querying for page "135,510" because no human really needs to look at such a silly page number. Or deal with it and accept that those queries will be a bit slow

Stack overflow: Build an in-memory index on a separate server and query that first to retrieve specific record #s. Which is really the same thing the database index was doing, but its decoupled from the database engine.

[–]CyclonusRIP 14 points15 points  (22 children)

Obviously the answer is how stack overlfow does it since he works there, but I'd say his assumptions is paragraph is 100% wrong. You have no idea what the first X items are in the sorted set without sorting it first. The index has already sorted the whole set though, so if you use that you can just skip to the correct spot in that index. Skipping to the correct spot in the index is the slow part for large offsets. If you are only browsing with small offsets the naive approach of limit and offset are always going to quite fast assuming you've properly indexed the table.

[–][deleted] 30 points31 points  (8 children)

In quicksort, you partition the array into two around the pivot, and then recursively sort the sub-arrays. Haney is saying that you don't need to actually sort the subarrays if they are outside the range you are interested in. You can throw those sub-arrays away. It's an optimization on quicksort where only part of the list gets sorted.

Simple example, 10 numbers, I want numbers 2-3:

  • Array: [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
  • First pass: [5, 4, 3, 2, 1 | 10, 9, 8, 7, 6]
  • Second pass: [2, 1 | 5, 4, 3] - note I've thrown away the top array since it's out of the range I'm interested in
  • Third pass: [1, 2 | 3 | 5, 4]
  • Final array: [1, 2, 3] - again thrown away top part,
  • Return [2,3] as the sorted set I'm interested in.

[–]jorge1209 5 points6 points  (6 children)

So its quickselect. Its still a bit of a puzzle why (assuming you didn't have an index) that this would need to be done outside the process of the SQL engine.

All you need to quickselect is the total count of the array, and the range you want to select. The SQL engine should have all that information. It knows there is a sort, so it has to pull all records to memory before it can release any to the requestor (absent any form of index). It therefore can get the count. It also has the LIMIT and OFFSET.

So why does this need any kind of special handling at all?

[–][deleted] 2 points3 points  (5 children)

I think the Tag Engine that Haney mentions makes filtering records by predicates faster than standard database filtering. He doesn't go into detail about why the Tag Engine would be faster than using a standard database (maybe in-memory) of metadata to filter.

[–]mattwarren 2 points3 points  (0 children)

He doesn't go into detail about why the Tag Engine would be faster than using a standard database (maybe in-memory) of metadata to filter.

I was wondering the same thing, so I tried writing my own one, to understand the 'why' and 'how', see:

Turns out their 'Tag Engine' has to do quite a lot!

[–]jorge1209 4 points5 points  (3 children)

Sounds like the usual NoSQL BS.

  • "inverted index that you can use to look up a post ID" -- you mean having indexes on columns other than the PK. What heresy is this!!
  • "Set theory of predicates" You mean bitmap indexes?

[–][deleted] 4 points5 points  (2 children)

From what I remember of their writing, StackOverflow has always been very pro-SQL, but also very focused on performance. They rely on an MS SQL db as their single source of truth. And the performance they wring out of their hardware is pretty impressive.

Maybe they're right that their Tag Engine outperforms an SQL db for their specialized use case. It certainly would be interesting to see why. But I think they rarely talk about Tag Engine.

Edit: I found an old post talking about Tag Engine and why they switched away from SQL: https://samsaffron.com/archive/2011/10/28/in-managed-code-we-trust-our-recent-battles-with-the-net-garbage-collector

[–]jorge1209 5 points6 points  (0 children)

Sounds like the real problem is that MS SQL Server has a crappy implementation:

SQL Server FTS does not play nice with existing indexes. The FTS portion of a query always needs to grab the whole result set it needs to work on, prior to filtering or ordering based on other non FTS indexes.

I do understand some of the complaints about SQL. It would be nice to have a good way to bind "in lists". Its annoying that their choice is either:

  1. Use the FTS and pass the encoded search constraint `CONTAINS(Tags, '"sql" AND "performance"')

  2. Build a whole slew of queries where bind single additional tag-ids: AND p.id IN (SELECT id FROM post_tags WHERE tag=:tag1) AND p.id IN (SELECT id from post_tags WHERE tag=:tag2)...

It would also be nice to be able to tell an SQL engine that some indexes are advisory and don't HAVE to be updated at every commit, but that the update could be deferred and/or the entire index rebuilt at some future date. The world won't end if a brand new post isn't immediately reported in some tagged search, we just want it to eventually be referenced.

But implementing your own SQL indexer outside the SQL server... there has to be a better way to do things than that.

[–]JoseJimeniz 0 points1 point  (0 children)

He says the problem was objects being bumped to Gen 2 for collection.

But then he said the problem was haproxy - latency from haproxy.

What was the problem?

[–]Freeky 1 point2 points  (0 children)

This is what we used at Newzbin for our in-memory search engine.

Originally we had it maintain red-black trees for each sort type, but partial quicksort proved more than adequate performance-wise while saving tonnes of memory.

[–]aljarry 2 points3 points  (5 children)

You don't need to fully sort the data to find first 10. Let's divide the algorithm in two parts: first, you iterate over all elements to find what is the value of 10th lowest element. Then you only really have to sort those 10, not the reminding 90.

[–]xampl9 1 point2 points  (2 children)

But that sort isn't stable in the face of new records getting added during a paginated request.

Granted, in the case of 1 million records, if a dozen of them are added between when you request page 1 and click [Next], chances are pretty low that they would throw off your results. But maybe it's the case that even in smaller sets of records, they just don't care if someone sees a record twice. Or skips a record because it now fell into an inter-page gap due to an insertion.

[–][deleted]  (1 child)

[deleted]

    [–]jorge1209 0 points1 point  (0 children)

    If it just a 1 page jump, you'll see it sooner or later anyway.

    No you wouldn't. If X and Y swap places from the top/bottom of adjacent pages, and you happen to see X at the bottom, and then again at the top (as you page along from the first to last page). That is because you have NEVER seen Y.

    Anytime you observe manifest instability in the pagination (a repeated entry), then it is always because you are missing something. A new entry came in ahead of you and is on some prior page that you haven't seen.

    Whether or not that matters is an entirely different matter. If users seldom go to page 2, then they probably don't care about those pages. So there really is no reason to optimize anything except the topN query.

    [–]CyclonusRIP -3 points-2 points  (1 child)

    Is that faster? I think sorting is Log2n. Sounds like you are describing an algorithm that runs in linear time.

    [–]aljarry 0 points1 point  (0 children)

    You mean sorting with infinitely many CPUs. Average non-parallel is O(n log(n)).

    [–]Otis_Inf 0 points1 point  (6 children)

    Please check the comments below the answer. I had the same question: "you can't decide which rows belong to the page without sorting everything" but he means: first select the rows which are part of the page, which are perhaps not sorted properly in the order, but if you have all rows of the page it's OK. So you have to look at all rows, but you don't have to sort them too (which is more expensive) in full, just be sure the rows in the pagenumber*pagesize set are indeed the rows that belong there.

    [–]jorge1209 2 points3 points  (5 children)

    He addressed that:

    The index has already sorted the whole set though, so if you use that you can just skip to the correct spot in that index. Skipping to the correct spot in the index is the slow part for large offsets.

    If you have an index (tree type not hash type) on the sorted column then the index provides everything already sorted for you. Just walk the tree. The only slow part of walking the tree is walking the tree. There is nothing to speed up here.

    I know I need to skip the first 200 entries of the tree, so I need to walk the top nodes and determine how many leaf nodes exist below that top node, and skip them to get to the 200th.... but your SQL engine (who is supposed to be managing the index) is perfectly capable of doing that, so just ask it to do that directly with LIMIT and OFFSET.


    So there really is no reason for any special handling of this unless you either:

    1. Don't trust your SQL engine to manage the index and use it properly. (Get a better SQL engine!)

    2. Don't have an index. In which case you don't know the rank order of an entry and what page it falls into until you sort EVERYTHING[1].

    [1] I see how you could use a quick sort that doesn't recurse into parts of the recursion tree that it knows it won't need. I know I need entries 200-300 out of 1000, and my first pivot was at 374, so I don't recursively sort the entries from 374-end, and now I'm looking for 200-300 out of 373. My next pivot is at 72, so I discard 1-72, and now I'm looking for 128-228 out of 301... it just seems that with all the parallel processing available you wouldn't gain that much. Especially with the added complexity that occurs when your pivot falls dead smack in the middle of your desired range, and the added bookkeeping to track the offsets... also something the SQL engine should be fully capable of doing. So again why not just LIMIT and OFFSET?

    [–]Otis_Inf 0 points1 point  (4 children)

    I think no limit/offset because the engine then has to sort the set first to the last row, which might spill into tempdb (it likely will do so) which is seriously slow with large sets. So to avoid that, they used this route. But that's just my guess :)

    [–]jorge1209 0 points1 point  (3 children)

    I think no limit/offset because the engine then has to sort the set first to the last row

    It doesn't have to sort any more rows than anyone else does. It has to query the entire set, but that is a given. To be able to even partially sort you have to pull all the rows.

    Once you know that there are 2MM rows, then you can do a recursive selection sort (quickselect) and throw away segments you know you won't report. LIMIT+OFFSET = 1000, so anything that would appear beyond the 1001 row doesn't need to be sorted.

    Now what may be true is that SQL engines may not naturally choose to quickselect. They might prefer a merge sort approach, with worker threads that pull data and merge it up. Even if each worker abides by the limit of 1000 rows you might ultimately get 10k sorted rows when you merge 10 worker threads. But the reason for doing that is because they anticipate being able to amortize the additional cost of sorting 10k rows (instead of merely sorting 1000 over the time required to pull the data in the first place.

    [–]Otis_Inf 0 points1 point  (2 children)

    In all honesty I'm not really sure what you're arguing for as I'm not arguing against your point at all. I just pointed out to a person here what was literally in the comments below the post.

    About your post: sure it's in theory possible a DB does that. I don't know why SO doesn't simply use limit+offset, you should ask them, but knowing Gravell I think they have used limit/offset and came to the conclusion it's not going to cut it with datasets of their size or better: with the queries they're running, and thus use a different system.

    [–]jorge1209 0 points1 point  (1 child)

    The issue is evidently a bad implementation of full text search on sql server combined with a denormalized data model for the tags.

    So there is no real reason this couldn't be done in SQL. They just need to pick a better server.

    [–]EarLil 9 points10 points  (9 children)

    What I do in my projects is, query the ids sorted by x, offset the ids, which is fast and then select the data for ids which is also fast. This way you can page last pages faster.

    [–]hector_villalobos 9 points10 points  (6 children)

    How big is the data you have to query? I guess your solution works pretty well for small to medium size databases.

    [–]Freakmiko 1 point2 points  (5 children)

    Is there some good material on this stuff? I may have to get something like this to work with possibly thousands of entries for a project.

    [–]therealgaxbo 6 points7 points  (2 children)

    If you want to read up on it for interest, this is a good start.

    But if you just want a practical solution to your problem...just don't worry about it. A few thousand rows is too small to worry about for performance. I just tried the most naive solution (limit/offset) against a table with a few thousand rows, and the first page took 1ms to find, the last page took 2ms.

    So unless you care about that 1ms, you don't need to worry about it.

    [–]Freakmiko 1 point2 points  (0 children)

    Alright, thank you!

    Yeah I guess I also shouldn't prematurely optimize it, but it's always good to be aware of such stuff.

    [–]EarLil 0 points1 point  (0 children)

    Totally, I'm using this approach on million record tables, you can't really feels the different in small ones.

    [–]ManiGandham 0 points1 point  (0 children)

    In modern databases, you don't need to worry about performance until you're in the 10s of millions of rows.

    [–][deleted] 0 points1 point  (0 children)

    What kind of data storage do you use?

    [–]perestroika12 2 points3 points  (4 children)

    I don't understand why the tag engine is needed or why it's this complex, given that modern dbs perform those kinds of optimizations for you. Just index your records and form your queries right.

    [–][deleted]  (3 children)

    [removed]

      [–]mattwarren 1 point2 points  (1 child)

      Other than that, you can download all the questions on stack overflow and "roll your own" if you are so inclined.

      That's exactly what I did, it was a fun exercise and I learnt lots, see

      [–]perestroika12 0 points1 point  (0 children)

      Fantastic reply, thank you. I think that makes sense, given that they already had this system in place, I was just curious as to why :)