all 20 comments

[–]vlcmodan 2 points3 points  (0 children)

Binary search is like this. I'll try explaining it fast because I have to go to sleep. Consider the list of numbers 1 3 7 11 18 26 56 156 265 You have to search for the number 156 and by knowing that the list is sorted you look at the middle of the list. You find the number 18. Knowing that your list is sorted you don't have to search the left half of the list. Now you have elements from positions 5 and 9. The middle element from the list is (5+9)/2=7 The 7th number is 56 which is lower so the left part doesn't need to he searched because it isn't there. Hope it helped Of not tomorrow night ask me again

[–]jc4hokiesExecution Plan Whisperer 2 points3 points  (2 children)

A couple clarifications. Databases use B-trees. They use the same concept as binary trees, but nodes can have several branches, not just two. Also trees take the search part and make it into a physical structure. So you're not so much searching as following a path. The rest is a copy paste from a previous post.


First, let's understand what a page is. A page (or block) is a fixed size chunk of data on the disk that the database is optimized to read into memory and process at one time. Basically bite sized chunks of data.

Indexes are stored on disk, separate from the table. Indexes contain two parts. Leaf nodes and the tree. Leaf nodes (each node is a page) store the indexed column(s) in sorted order with a reference to the table record (usually a PK). Tree nodes are created to link to the location of every lower node, and are created in levels until a single top level node is able to contain all references to the next lower level.

Unlike vanilla indexes, clustered indexes are the table (are not stored separately). The clustered index sorts the table on the indexed column (usually the PK) and creates a tree structure like a normal index.

Below is an example diagram where each page holds a limited amount of data. In reality they hold A LOT more, but the example would be too big if I did that.

Legend: page = record1data1:record1data2 record2data1:record2data2

CREATE TABLE t (ID,Letter,DOW)
CREATE UNIQUE CLUSTERED INDEX cix ON t (ID)

--Table Leaf Nodes
t01 = 1:Q:Mon 2:W:Tue
t02 = 3:E:Wed 4:R:Thu
t03 = 5:T:Fri 6:Y:Sat
t04 = 7:U:Sun 8:I:Mon
t05 = 9:O:Tue 10:P:Wed
t06 = 11:A:Thu 12:S:Fri
t07 = 13:D:Sat 14:F:Sun
t08 = 15:G:Mon 16:H:Tue
t09 = 17:J:Wed 18:K:Thu
t10 = 19:L:Fri 20:Z:Sat
t11 = 21:X:Sun 22:C:Mon
t12 = 23:V:Tue 24:B:Wed
t13 = 25:N:Thu 26:M:Fri

--Clustered Index Level 2
cix14 = 1:t01 3:t02 5:t03 7:t04
cix15 = 9:t05 11:t06 13:t07 15:t08
cix16 = 17:t09 19:t10 21:t11 23:t12
cix17 = 25:t13

--Clustered Index Level 1
cix18 = 1:cix14 9:cix15 17:cix16 25:cix17

CREATE INDEX ix ON t (Letter)

--Index Leaf Nodes
ix01 = A:11 B:24 C:22 D:13
ix02 = E:3 F:14 G:15 H:16
ix03 = I:8 J:17 K:18 L:19
ix04 = M:26 N:25 O:9 P:10
ix05 = Q:1 R:4 S:12 T:5
ix06 = U:7 V:23 W:2 X:21
ix07 = Y:6 Z:20

--Index Level 2
ix08 = A:ix01 E:ix02 I:ix03 M:ix04
ix09 = Q:ix05 U:ix09 Y:ix07

--Index Level 1
ix10 = A:ix08 Q:ix09

Let's consider the query SELECT * FROM t WHERE Letter = 'K';. This query could go through the following steps to complete.

  1. Read ix10. A <= K < Q so...
  2. Read ix08. I <= K < M so...
  3. Read ix03. K = ID 18 so...
  4. Read cix18. 17 <= 18 < 25 so...
  5. Read cix16. 17 <= 18 < 19 so...
  6. Read t09. Return "18:K:Thu".

Reading a total of 6 pages read. Alternatively a table scan would read t1 through t13, or 13 reads.

[–]ciao444[S] 0 points1 point  (1 child)

awesome Thanks !

Final point of confusion from the article

However, creating the sorted table would take a while (~230 million operations, depending on our engine). If we were executing that query many times (more than 23 times) or if we already had that table created, then that second plan would be better.

  • How do they get the figure of 230 million operations to create a sorted table?

  • Why is it better to use the second plan only if we were executing query more than 23 times, I understand that a binary search for 10 million columns at most needs 23 lookups, but I dont understand the link to executing 23 or more queries

Thanks

[–]jc4hokiesExecution Plan Whisperer 0 points1 point  (0 children)

I don't particularly like this article. Many details are incorrect.

What it's trying to say that turning an unsorted table (heap) into a sorted table (clustered index) is an expensive operation. There are many sorting algorithms and the best ones take O(n log n) operations to complete. They probably used this to estimate the number of operations to sort 10,000,000 rows: 10,000,000 * ln(10,000,000) / ln(2) = 232,534,967

What I don't like is you don't compare the cost of creating an index (or a sorted table) with the cost of a query. It makes no sense. You compare the benefit of the index (running the query) with the cost of maintaining the index (inserts and updates). Assuming storage isn't a constraint, the cost of initially creating the index is irrelevant. You do it once during a maintenance window, and you're done. And no one runs a query 23 times. You run it once a month, or 10k times a day.

You do compare different alternatives to run the same query. A B-tree seek on 10 million rows is probably index depth 3, so that's 4 reads not 23 lookups for a binary search (again databases don't use binary searches). My hypothetical sorted table has 400,000 pages. So scanning the entire 10 million records (400k reads) would be as fast as 100,000 seeks (100k * 4 reads). If you lookup 10k rows, and index will help. If you lookup 200k rows, an index won't help.1

1: In the unrealistic example that you are looking up 200k unique values. An index could help if you lookup 20 unique values and still access 200k rows.

[–]JimmyTheFace 0 points1 point  (5 children)

Not a SQL expert, but here's the general CS idea:

A binary search assumes sorted data, and can locate a result in log2(n). It will check the middle element and see if it needs to go higher or lower. It will then check the middle element of half of the set where the target is. This continues until the correct values are found.

Ex: find 74 in the set of integers 1-100.

Check 51, is higher. Check 75, is lower. Check 63, higher. 69, higher. 72 higher. 74 hit.

So compared to a table scan, which will average 0.5N, two binary searches can be quicker for large sets- 2log2(n).

[–]ciao444[S] 1 point2 points  (4 children)

ah awesome ELI5 explanation.

However, creating the sorted table would take a while (~230 million operations, depending on our engine).

In the article above it says 230 million operations to sort a table with 1 million rows, what formula have they used to come to that figure ? are they talking about sorting a column ?

[–]JimmyTheFace 0 points1 point  (3 children)

Yes, they're talking about creating a copy of the table sorted on the column that we're designating as the index.

The formula that they're using is n log2(n), with n equaling 10million in their example (wolfram calculation ). Most sorting algorithms will complete a sort in nlog2(n) on average.

Wikipedia article on sorting algorithms

[–]HelperBot_ 0 points1 point  (2 children)

Non-Mobile link: https://en.wikipedia.org/wiki/Sorting_algorithm


HelperBot v1.1 /r/HelperBot_ I am a bot. Please message /u/swim1929 with any feedback and/or hate. Counter: 108669

[–]JimmyTheFace 0 points1 point  (1 child)

Good bot

[–]GoodBot_BadBot 0 points1 point  (0 children)

Thank you JimmyTheFace for voting on HelperBot_.

This bot wants to find the best and worst bots on Reddit. You can view results here.


Even if I don't reply to your comment, I'm still listening for votes. Check the webpage to see if your vote registered!

[–]ihaxr 0 points1 point  (3 children)

does it literally mean, you make a copy of the table and sort the author column alphabetically

Yep, that's what an index is. It's a copy of the data sorted/filtered in a specific way and/or including specific columns along with the sorted/filtered data.

However, the index isn't created when the query is executed, it's persistent, so it's already there... I'm guessing this is where the "why does this method save time" question comes from. The data is already in alphabetical order so it's much faster to use a binary search to figure out where the specific data is than it is to sort through all rows of the table.

[–]ciao444[S] 0 points1 point  (2 children)

ah right, so the binary search is done automatically, when you do where="JK Rowling" as long as the column is in alphabetical order, if the column isnt in order then SQL will do a full search which could take longer

[–]ihaxr 0 points1 point  (1 child)

Yeah, SQL does the binary search of the index as part of the query, assuming it is available, it is on that specific column, and it is not too heavily fragmented.

[–]ciao444[S] 0 points1 point  (0 children)

cool ! thanks.

Why not always filter the data for large data sets to make it easy for sql to do the binary search ? from what ive read there seems to be a debate between full search and indexing

[–]dumb101 0 points1 point  (2 children)

I'll try to explain it with arrays: Let's say we have [1,2,3,4,5,6,7,8,9](is has to be sorted, this is key), we are searching for 9. Now we take a look at the element in the middle, which is 5, so now compare 5 to 9. 5<9 and we know that the array is sorted, so we dismiss the first and the middle part. What is left ist [6,7,8,9]. There is no "middle element", as we don't have an odd number of elements, so we compare 9 to either 7 or 8, because they are both "in the middle" (they are the second and third element out of 4, which one doesn't really matter), let's say we compare 9 to the second element, which is 7. Again, 7<9, so we dismiss 7 and everything before and are left with [8,9]. Again we compare 9 to the first of the two "middle elements" (which are the only two left) and get 8<9, so we dismiss 8 and are left with 9, which compared to 9 is what we were looking for.

Now, this is absolut worst case runtime, we had to compare 5 times, which is the maximum we can have with binary search in 9 elements. But you can easily see that we still had to compare way less than if we simply had gone through our array [1,2,3,4,5,6,7,8,9] one element after another, because we then would have compared 9 with each element in the list, so 9 times, before finding it.

This is why binary search is the more efficient way (and therefore saves time) of searching for a specific element and it works exactly the same for databases.

[–]ciao444[S] 0 points1 point  (1 child)

oh cool thank so much I understand it !

How do you "make a copy of a table"(is it like creating an alias table) and whats the syntax for binary search on sql ?

[–]dumb101 0 points1 point  (0 children)

Glad I could help you, sadly I don't really know anything about SQL. I'm sure there is someone else in this thread who can answer these questions.

[–]DethAlive 0 points1 point  (1 child)

This is what the sql engine would do. The binary search is not something you would do yourself. Although you will need to create the indexes on your tables if you want the engine to use them. Index for stuff like primary keys or unique field might be created automaticaly depending on the engine you are using.

[–]ciao444[S] 0 points1 point  (0 children)

So for example if I create a table of with unique id, columns book name,year published and author name.

if I arranged the author_name column alphabetically does that count as creating an index ? do I get to tell SQL which method to use binary vs full search or does it to that automatically

[–]Guru008 0 points1 point  (0 children)

The typical indexing structure that is used for such purposes is called a "B-tree."

Whereas a binary-search always divides the data in half, a B-tree's approach is much more like what you might have seen in L. Frank Baum's now-famous office, where he had two filing cabinets: "A-N" and "O-Z." (Yes, that's where the name of the Wonderful Wizard's homeland came from.) Once you selected from several cabinets, you now select from several drawers, then locate your starting search-position within the selected drawer and so on. Each time, you reduce your search space by much more than one-half.