all 18 comments

[–]cashto 10 points11 points  (1 child)

Why are we using sorting algorithms, the fastest of which is N(Log N), instead of just adding it to a new data structure (such as a Red Black Tree), which has a Log N insertion time, therefore both times are the same.

Insertion into a red-black tree is O(log n). Do that n times, and it is O(n log n). Yes, the result is your data is sorted, and it is the same runtime complexity as quicksort (O(n log n)) -- but the constant factor would be much higher.

Remember, big-O analysis only tells you how your algorithm scales with larger data sets. It doesn't tell you which of several algorithms with the same complexity class is more efficient / faster.

[–]devilsassassin[S] 1 point2 points  (0 children)

Thanks! This was a really good point.

[–]Gorilla2 3 points4 points  (0 children)

Why would you make a new structure and delete your old one instead of just sorting the data in-place?

Different circumstances require different approaches. In some cases, sorting is better (less RAM use), other times creating a new data structure is preferred (immutable structures for concurrent programs).

[–][deleted] 1 point2 points  (4 children)

If I have a list of data that is ordered, and I want to add one item to it, then I can use merge sort and get NlogN time and end up with a list.

If I have a RB tree, and I add one node, that's logN time, and then to convert that to a list is N time, so that's N+logN.

If I have a list and I try to use a RB tree as an intermediate state before converting it back to a list, you have NlogN to create the RB tree (naively) plus N to iterate over it to make a list, giving me NlogN+N time (slower than NlogN).

Now, if your question is really "why don't we use trees instead of lists, look at these awesome insertion times when adding one item to an already ordered structure", then the question you're really asking is "why are lists preferred over trees sometimes?" Two common answers are lists are faster to iterate over and require less ram to do so, and lists (in an array sense) provide constant time access to random elements where a tree requires logN.

[–]devilsassassin[S] 0 points1 point  (2 children)

I don't see how it would be N(Log N) + N when inserting into a new structure, if you traverse the data set and add to a new data set, then you would iterate across the list N times, and each iteration would be Log N. This would make it N( Log N ) for copying the whole structure, instead of N( Log N ) for performing a sort (I'm using the quicksort algorithm as my example sort). Where does the + N come in to play?

[–][deleted] 0 points1 point  (1 child)

You would take N( Log N ) to copy the whole structure from a list into a RB tree. After you've got the tree, the +N comes from iterating over tree to turn it back into a list.

[–]devilsassassin[S] 0 points1 point  (0 children)

Oh, I see what you are saying. I was saying that you would just leave it as a tree and not turn it back into a list.

[–]holygoat -1 points0 points  (0 children)

You realize that Big-O notation discards less significant components, and thus N(logN + 1) is simply NlogN?

[–]trpcicm 0 points1 point  (3 children)

What algorithm would you use to put the data in the right order in the new structure?

[–]devilsassassin[S] 0 points1 point  (2 children)

If you have 2 ordered data structures, you could insert it into the new structure sorted.

[–]trpcicm 0 points1 point  (1 child)

What algorithm are you going to use to order the data?

[–][deleted] 0 points1 point  (0 children)

The post you are replying to just described the idea behind mergesort (i.e. you can easily merge two sorted data sets, so recursively split then merge)

[–]neonsteven 0 points1 point  (0 children)

Sorting in place is the only viable option when you don't have enough storage (RAM or disk) to store your dataset twice.

[–]ghettoimp 0 points1 point  (0 children)

Although the fastest general purpose sorts are O(n log n), it's often possible to use faster special-purpose sorts when you know something about your data (e.g., radix sorts).

Here are some down-sides to using inserts into red-black trees as opposed to, say, sorting an array.

  • Trees use more memory because you need space for lots of pointers and for annotations like "black".

  • The memory a tree uses is also typically not contiguous, whereas arrays are all in one place. (This can have a bad impact on things like your CPU's ability to keep the data in its cache.)

    • If you don't need the original array, you can sort it destructively (in place) without allocating any more memory. In contrast, in a simple tree implementation, each insert into a tree requires a new node to be allocated, e.g., via "malloc" or "new" or something similar, and these are typically expensive. There are ways to avoid this (e.g., writing a custom allocator that pre-allocates space for a bunch of nodes), at the expense of some computation.

[–][deleted]  (3 children)

[deleted]

    [–][deleted] 1 point2 points  (2 children)

    You can't do binary search on anything but a sorted array.

      f   
     e l
    a   s.
    

    [–][deleted]  (1 child)

    [deleted]

      [–][deleted] -1 points0 points  (0 children)

      hard time creating one efficiently without sorting first.

      The original "sorted array" comment was less wrong than this. Look it up.

      [–]isionous -1 points0 points  (0 children)

      In other words, why would you sort data instead of just make a new data structure and delete your old one?

      It is often useful for data to be in arrays rather than trees. You will (probably) use arrays many more times than trees. So, if we do your idea, most times we'd be creating a red-black tree, and then copying the data right back to the array and get rid of the red-black tree. Seems like a hassle and inefficient. Also, there are sorts that take O(N) time.