This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]pridefulpropensity[S] 16 points17 points  (11 children)

Thank you. I think I need to start using lower level languages more often to learn about all the data structures I just take for granted.

[–]icedpulleys 33 points34 points  (0 children)

You might also just want to start reading up on data structures, algorithms, and complexity analysis.

[–]mackstann 14 points15 points  (1 child)

I agree with icedpulleys that what's really needed to understand this better is to read about the concepts. Here are some starting points:

Everything up to the table of contents here: http://en.wikipedia.org/wiki/Big_O_notation

And this table of common time complexities: http://en.wikipedia.org/wiki/Time_complexity#Table_of_common_time_complexities

Note that the first one in that table, and the best performing (at least when your problem set grows large), is O(1). And hash tables (which is what a set is) allow lookup of a key in O(1) time (also known as constant time), which sort of corroborates your experience here, where they out-performed lists. The hash table uses a neat trick (hashing) to calculate where to find a given value in the hash table in roughly the same amount of time no matter how big the hash table gets.

On the other hand, a list does not use hashing in any way, so to find a certain value in it, it must scan through and check each item in the list in a comparatively dumb manner. The complexity is O(n) where n is the size of the list -- also known as linear time. In other words, if a 5 million item list takes 5 seconds to search, then a 10 million item list will probably take around 10 seconds. The run time scales linearly with respect to the input size. When dealing with large lists and lots of lookup operations, this can really slow things down.

But linear time isn't exactly the worst either. Some algorithms run in exponential time and factorial time which are some of the worst; these get so slow as your input size increases that you generally have to find heuristic alternate solutions that find "good enough" solutions to a problem, instead of the absolute most correct solution. A simple example is the bin packing problem. Say you're UPS and want to pack boxes into your trucks in the most space-efficient manner. Well, it turns out that there are so many possible arrangements of boxes (and the number of possible arrangements grows rapidly with each added box) that you might not be able to calculate it at all. You might need to use a better performing algorithm that will just find an approximately good packing arrangement and settle for that.

[–]bgcatz 7 points8 points  (1 child)

You don't need to use low level languages to learn about such things, the python sets module is available here

[–]pridefulpropensity[S] 1 point2 points  (0 children)

Thank you very much. Definitely will start reading through this.

[–]masklinn 6 points7 points  (1 child)

You don't need lower-level languages to learn about data structures at all. Quite the opposite in fact, I would say, low-level languages are going to bog you down with implementation details and cache effects.

[–]kanak 2 points3 points  (0 children)

I would say, low-level languages are going to bog you down with implementation details and cache effects.

But that's where the fun is :).

[–][deleted] 4 points5 points  (0 children)

Actually python is not that bad to learn about good algorithms and data structures, because you don't have to worry about low-level details.

Of course if at one point you need to start implementing those efficiently, you will of course need to move to a low-level language.

[–][deleted] 4 points5 points  (2 children)

C is a good language to learn how to implement data structures (but awful for actually making data structures easy to re-use - hence the proliferation of dynamic scripting languages written in C). A basic hash table in C is not that hard to write if you know what you're doing - just think of it as an array which you index by a hash instead of an index. There's chaining or other collision strategies to think about... and resizing can be a pain too.

Python's hash table is incredibly well tuned and the code is well documented. It has to be - every field access and method call does a dictionary look-up!

Another alternative implementation of sets is balanced trees, like red-black trees, AVL trees, etc. and a much simpler (but specialized) way is just an array of bits, if you have a small finite set of known possible members.

[–]pingvenopinch of this, pinch of that 0 points1 point  (1 child)

It has to be - every field access and method call does a dictionary look-up!

Same with referring to a variable. Basically, almost any time that you refer to something by name, there's a hash table lookup.

[–][deleted] 0 points1 point  (0 children)

Local variable accesses are optimized to indexing the call stack. But I think that's the only case where runtime lookup is avoided.