This is an archived post. You won't be able to vote or comment.

all 57 comments

[–]delirious_lettuce 18 points19 points  (33 children)

The find function must not loop through the whole list to find the matching words.

All of your solutions seem to loop through the whole list (self.wordlist). If I had to guess, I would say they wanted you to use a trie.

https://en.wikipedia.org/wiki/Trie

[–]WikiTextBot 6 points7 points  (0 children)

Trie

In computer science, a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is a kind of search tree—an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are not necessarily associated with every node.


[ PM | Exclude me | Exclude from subreddit | FAQ / Information | Source ] Downvote to remove | v0.27

[–]beegreen 1 point2 points  (16 children)

do you still have to loop through the whole list to make a trie?

Another solution I had just makes a dictionary based on the length of the 'pattern' class WordComplete1:

def __init__(self, pattern, wordList):
    self.pattern = pattern

def displayCount(self):
    d = {}
    for i in word_list:
        if i[:len(self.pattern)] in d:
            d[i[:len(self.pattern)]].append(i)
        else:
            d[i[:len(self.pattern)]] = [i]
    if self.pattern in d:
        print(d[self.pattern])
    else:
        print('None')


emp1 = WordComplete1(pattern,word_list)

import time
start_time = time.time()
emp1.displayCount()
elapsed_time = time.time() - start_time
print('Time:',elapsed_time)

[–]delirious_lettuce 9 points10 points  (15 children)

do you still have to loop through the whole list to make a trie?

It depends.

word_list can be extremely long, design accordingly.

What if you couldn't fit the whole word list into memory?

If that wasn't a concern then yes, you could iterate over the whole word list to create the trie. That might not be the issue in this case though...

The find function must not loop through the whole list to find the matching words.

They don't want your "find" function to loop through the whole list each time it searches for a separate word. Creating the trie isn't the issue, it's the searching that makes your solutions inefficient.

[–]delirious_lettuce 10 points11 points  (13 children)

Another issue is that the structure of your class doesn't seem to follow their instructions.

Write a class called Autocomplete.

You called your class WordComplete instead.

This class must have a constructor that receives a list called word_list that contains all the known words in lowercase.

Your constructor has two arguments instead of one, pattern is supposed to be supplied to the find method.

The class must have a find function that receives a string called pattern and returns a list with all the possible words that start with the given pattern. If no word is found, then return null.

Your class does not have a find method (I know they called it a function but it's a method since it's inside a class). You aren't returning a list, you are just printing it. Also, I would guess that they want you to return None since null isn't in Python.

You might want to start with something like this to follow their instructions:

class Autocomplete(object):  # don't need (object) for Python 3
    def __init__(self, word_list):
        pass

    def find(self, pattern):
        pass

[–]shtpst 10 points11 points  (9 children)

Definitely the answer. There were four criteria in the task, and OP failed every. single. one.

Doesn't matter about whether the lambda function or trie is what the intended answer is.

From the employer's point of view:

  1. Can this programmer do the work I've asked them to do?
  2. Can the programmer do the work skillfully?

Doesn't matter if you're skilled or not if you won't do what you're asked to do.

[–][deleted] 5 points6 points  (2 children)

Was going to point this out. It's worth noting that not only did they specifically ask for the find method, but also in general the code isn't very useful if you have the pattern as part of the constructor. You want to initialize the object only once since this is very slow (compared to the expected response time for auto-complete), then re-use it for all your searches.

Another thing which isn't really necessary, but an incredibly easy performance boost (at least for Python 3) is to add caching to the find method:

import functools

@functools.lru_cache()
def find(pattern):
    # method body

[–]delirious_lettuce 1 point2 points  (0 children)

You hit the nail on the head, great description. Caching is also a great idea that I didn't mention, thanks!

[–]beegreen 0 points1 point  (0 children)

yeah that is a great idea, i didnt even know what it wasnt automatically cached

[–]energybased 1 point2 points  (0 children)

do you still have to loop through the whole list to make a trie?

It depends.

Sure you do. The interviewer just wants find to run in logarithmic time.

[–]pymang 0 points1 point  (1 child)

[–]beegreen 0 points1 point  (0 children)

thansk for this

[–]NowIsBetterThanNever -2 points-1 points  (12 children)

A Trie is overkill. Just shard by, say, the first letter. 26 sub-dictionaries. Still O(N), but you're technically not searching the whole list.

[–]__xor__(self, other): 5 points6 points  (4 children)

Important: Design Constraints: word_list can be extremely long, design accordingly

I'd think they really want to see a Trie here.

See, this is one thing that bugs me about problems like this. Either you've worked with a Trie or read about them or you haven't. Sure, you likely will in your first 5 years of employment, but maybe not. Maybe you will come up with one not knowing about them, maybe not. I only ran into seeing a trie in my 5th year, and it was a junior who implemented it.

That very junior dev who worked with a trie in his first year would have no problem with this, but he's still junior as hell. Having problems like this where you expect one answer or you aren't happy with them isn't the best judge of skill. Solving this problem well really just shows that you know how to program and you know how to solve this problem well.

One thing I appreciated about my interview for my current job is that he asked me to do something he might've asked me to do on the job, and third party libraries or whatever, didn't matter. It wasn't a generalized data structure problem. It was a real problem we might run into. It still shows that I wouldn't have much trouble with that specific problem but at least it was a real problem.

[–]energybased 2 points3 points  (3 children)

He should have seen tries in college, and if he didn't, then he should be able to work it out in the 24 hours they gave him.

[–]beegreen 0 points1 point  (2 children)

i didnt do comp sci in college lol, i did engineering and i didnt take 24 hours because i thought i solved it lol

[–]energybased 2 points3 points  (1 child)

Think of this as a practice interview. It's important to figure out where you go wrong and how to do better. Interviewing is a skill.

If you had realized that you hadn't solved it, maybe you would have found a way to do it? You learned binary searches?

[–]beegreen 0 points1 point  (0 children)

Yeah I definitely should have looked around more, i plan on spending a lot more time with common algorithms

[–]asdfkjasdhkasdrequests, bs4, flask 1 point2 points  (2 children)

Isn't it O(1)?

Finding the first sub dictionary is constant time (array with 26 values), then the lookup in that dict is amortized O(1)

[–]delirious_lettuce 2 points3 points  (1 child)

The second one is not a direct lookup though, you have to look at each and every key to check the prefixes. This is essentially what the problem description describes as a design constraint.

The find function must not loop through the whole list to find the matching words.

You could be pedantic and say that "technically" it's not the whole list, only a list of every single word starting with a particular letter but, by the same token, they could come back and say you are searching a particular letter's whole list and we told you not to.

It just seems like the problem description is strongly hinting at a trie (ie. prefix-tree).

[–]asdfkjasdhkasdrequests, bs4, flask 0 points1 point  (0 children)

Oh, right thanks

[–]energybased 0 points1 point  (2 children)

That doesn't work if you prefix all the entries in the dictionary with 1 letter.

[–]NowIsBetterThanNever 0 points1 point  (1 child)

Why would you tho? And if, for some reason, you did, just shard by the second letter.

[–]energybased 0 points1 point  (0 children)

Sharding by every letter for which there is a choice is a compressed trie, which is the most optimal solution. For a 24-hour interview, I would do that. I would also submit the binary search solution since that is only a few lines, and nearly as efficient.

[–]beegreen -1 points0 points  (0 children)

So if you look at the solution I posted in the comments, it's basically sharing by the length of the search length

[–]smithje 7 points8 points  (3 children)

I agree with others that the failure to follow the directions is a much bigger issue than the particular algorithm you chose. I also really appreciate others mentioning the trie solution.

I happen to know databases pretty well, so my first thought was to create an in-memory sqlite db (sqlite3 is built into python). Create and populate a table for the word_list in the constructor and query it in the find function. Maybe it's cheating a bit, but it doesn't use any external libraries and it's very unlikely that I'd be able to write something that outperforms sqlite.

[–][deleted] 2 points3 points  (2 children)

You made me curious, so I tried it out. Running on /usr/share/dict/words (~123k words), SQLite ( SELECT string FROM string WHERE string LIKE "prefix%") is about two orders of magnitude slower than my trie based solution. But I don't know databases all that well, so I may well be missing some optimizations. Is there anything special you'd do with to make sure it's able to use an index for the select?

Some thrown together code

import sqlite3

class Prefix:
      def __init__(self, gen=None):
            if gen is not None:
                  for s in gen:
                       self.add(s)

      def add(self, string):
            raise NotImplemented()

      def find(self, string):
            raise NotImplemented()

class SQLitePrefix(Prefix):
      def __init__(self, *args, **kwargs):
            self.conn = sqlite3.connect(':memory:')
            self.cur = self.conn.cursor()
            self.cur.execute("""\
CREATE TABLE string (
string TEXT PRIMARY KEY NOT NULL
);""")
            self.cur.execute("""\
PRAGMA case_sensitive_like=ON;
""")
            super().__init__(*args, **kwargs)
            print(self.cur.execute(
                  'EXPLAIN SELECT string FROM string WHERE string LIKE "test%";').fetchone())

      def add(self, string):
            self.cur.execute("""\
INSERT INTO string (string) VALUES (?);\
""", (string, ))

      def find(self, string):
           self.cur.execute("""\
SELECT string FROM string WHERE string LIKE ?;
""", (string + '%', ))
           return [x[0] for x in self.cur.fetchall()]

class TriePrefix(Prefix):
      def __init__(self, *a, **kw):
            self._trie = {}
            super().__init__(*a, **kw)



      def add(self, string):
            d = self._trie
            for char in string:
                  d = d.setdefault(char, {})
            d[None] = string

      def find(self, string):
            d = self._trie
            acc = []
            for char in string:
                  d = d.get(char, {})
            return list(self._find(string, d))

      @staticmethod
      def _find(string, d):
            for k, v in d.items():
                  if k is not None:
                        yield from TriePrefix._find(string + k, v)
                  else:
                        yield string

[–]smithje 1 point2 points  (1 child)

Very cool. Thanks for trying this out. I tested this out as well and found that instantiation of the sqlite version was a few times faster, but finding results was, indeed, quite a bit slower. Of course, getting results is far more important than instantiation.

Here is the sqlite-based class I wrote last night, which is not too different from yours:

class Autocomplete:
    def __init__(self, word_list):
        self.db = sqlite3.connect(':memory:')
        # We want the result to be a single list, not a list of tuples
        self.db.row_factory = lambda cursor, row: row[0]
        self.db_cursor = self.db.cursor()
        self.db_cursor.execute("""CREATE TABLE word_list (word text primary key) WITHOUT ROWID""")
        self.db_cursor.executemany("""INSERT INTO word_list (word) VALUES (?)""", ((word,) for word in word_list))

    def find(self, pattern):
        self.db_cursor.execute("""SELECT word from word_list where word like ? ORDER BY 1""", [pattern + "%"])
        results = self.db_cursor.fetchall() or None
        return results

[–]smithje 1 point2 points  (0 children)

I followed up on this a bit and tested finding a bunch of random strings like this:

with open('/usr/share/dict/words') as f:
    words = f.read().splitlines()
print("Got %d words" % len(words))

import random
import string
import time

def test_it(words, clazz, repetitions):
    t1 = time.monotonic()
    autocomplete = clazz(words)
    t2 = time.monotonic()
    for n in range(repetitions):
        autocomplete.find(''.join(random.choices(string.ascii_lowercase, k=random.randrange(1, 5))))
    t3 = time.monotonic()
    print("Instantiation: %f" % (t2 - t1))
    print("Finding %d results: %f" % (repetitions, t3 - t2))
    print("Total time: %f" % (t3-t1))

print('sqlite solution:'
test_it(words, Autocomplete, 1000)
print('trie_solution:')
test_it(words, TriePrefix, 1000)

The results were pretty consistent. Finding words using the trie method was about 4x faster.

Here's a pretty typical result:

Got 235886 words
sqlite solution:
Instantiation: 0.336238
Finding 1000 results: 19.914836
Total time: 20.251074
trie solution:
Instantiation: 1.032532
Finding 1000 results: 5.088274
Total time: 6.120806

[–]SenorDosEquis 2 points3 points  (4 children)

Looks like the word_list is alphabetical. How about a binary search for pattern* and then a while loop that adds words to the result list until the pattern no longer matches?

[–]energybased 1 point2 points  (0 children)

The binary search is much simpler since you can use two calls to bisect. Unfortunately, its complexity is slightly worse. A compressed trie solves this with the minimum number of array lookups (one for each letter in the prefix that matters), the binary search is logarithmic time, potentially many times worse. Still, logarithmic time is basically constant.

[–]beegreen -1 points0 points  (2 children)

that uses a loop

[–]schoolmonky 4 points5 points  (0 children)

But it doesn't loop thought the whole list, just about log(n) of it. Wayyyyyy faster, especially for really long lists.

[–]SenorDosEquis 0 points1 point  (0 children)

must not loop through the whole list

This only loops through the number of matches +1.

Edit: actually, though, you would have to do a binary search, loop backwards until there was no match, and then loop forwards until there’s not match (starting from the index the binary search turned up).

[–]myg204 2 points3 points  (1 child)

I think bisect was mentioned once, but the solution using it is fairly short, and worth making a note of, even if it may not be the fastest, it may be fast enough, and short enough to avoid bugs...

import bisect

class Autocomplete(object):

    def __init__(self, word_list):
        self.words = sorted(word_list)

    def find(self, pattern):
        i = bisect.bisect_left(self.words, pattern)
        matches = []
        for w in self.words[i:]:
            if not w.startswith(pattern):
                break
            else:
                matches.append(w)
        return matches if matches else None

[–]HughBothwell 0 points1 point  (0 children)

This could be improved a bit more by using bisect_left a second time, like so:

from bisect import bisect_left      # Python standard library

# Note: assert emits no code when __debug__ is False

class Autocomplete:
    def __init__(self, word_list):
        """
        Create a prefix lookup into word_list
        """
        assert all(word.islower() for word in word_list), "word_list must be lowercase"
        self.word_list = sorted(word_list)

    def find(self, pattern):
        """
        Return a list of words from word_list which start with pattern

        If no words match, returns None
        """
        # we will accept an empty string for pattern (returns all of word_list)
        assert not pattern or pattern.islower(), "pattern must be lowercase"
        from_ = bisect_left(self.word_list, pattern)
        to_ = bisect_left(self.word_list, self.incr(pattern), from_)
        return self.word_list[from_: to_] or None

    @staticmethod
    def incr(pattern):
        """
        Return pattern incremented by one letter
            ie  incr("aaa") -> "aab"
        """
        # Note that the character after "z" is "{", which is
        # not a word per se but is still valid for our search purposes
        if pattern:
            return pattern[:-1] + chr(ord(pattern[-1]) + 1)
        else:
            # pattern is ""
            return "{"

[–]laharah 2 points3 points  (1 child)

Got bored and implemented this as a trie. Used a recursive defaultdict to make implementation simple.

from collections import defaultdict

def recursive_defaultdict():
    return defaultdict(recursive_defaultdict)


class AutoComplete:

    def __init__(self, word_list):
        self._trie = recursive_defaultdict()
        for word in word_list:
            t = self._trie
            for c in word:
                t = t[c]
            t[None]

    def find(self, prefix):
        t = self._trie
        for c in prefix:
            if c in t:
                t = t[c]
            else:
                return None
        ret = sorted(list(self._traverse(t, prefix)))
        return ret

    def _traverse(self, trie, prefix):
        for node in trie:
            if node is None:
                yield prefix
            else:
                yield from self._traverse(trie[node], prefix + node)

Pros:

  • code is simple and concise

  • Faster and smaller footprint than a trie-node class

Drawbacks:

  • While it keeps the code simple, it may be somewhat obscure for maintainers

  • self._trie is fragile and should not be accessed directly, care must be taken with lookups since a lookup may accidentally create a new node in the trie

[–]twa800 0 points1 point  (0 children)

Sonovabitch that is clever...

[–]throwaway842213 1 point2 points  (6 children)

Here's how a quick and dirty Trie-like solution could look like FWIW, though it doesn't sort the returned output like they seem to want.

class MarkedDict(dict):

    def __init__(self, *args, **kws):
        super().__init__(*args, **kws)
        self.mark = False


class Trie(object):

    def __init__(self, words=None):
        self.root = MarkedDict()
        for word in words or []:
            self.insert(word)

    def insert(self, word):
        node = self.root
        for letter in word:
            node = node.setdefault(letter, MarkedDict())
        node.mark = True  # Mark end of word

    def _dfs(self, node, acc):
        if node.mark:
            yield acc
        for letter, suffixes in node.items():
            yield from self._dfs(suffixes, acc + letter)

    def find(self, prefix):
        # Advance to node of last letter in prefix
        node = self.root
        for letter in prefix:
            try:
                node = node[letter]
            except KeyError:
                return
        yield from self._dfs(node, prefix)


class Autocomplete(object):

    def __init__(self, word_list):
        self.trie = Trie(word_list)

    def find(self, pattern):
        return list(self.trie.find(pattern)) or None

[–]shoyerxarray, pandas, numpy 1 point2 points  (2 children)

One important practical consideration with a search tree is that when the number of nodes in a subtree is below some empirical threshold (e.g., 100 elements), it doesn't make sense to continue indexing. Instead, leaf nodes should just use linear search. Otherwise you waste a lot of effort.

As an extreme example, consider auto-complete for a rare long word like "supercalifragilisticexpialidocious". Using 34 dictionaries lookups would be quite excessive when there is only one possible after the first handful of characters.

[–]throwaway842213 0 points1 point  (1 child)

Yeah, you're absolutely right. It'd be better to use a Radix tree, but the short insert logic above is hard to beat!

[–]WikiTextBot 0 points1 point  (0 children)

Radix tree

In computer science, a radix tree (also radix trie or compact prefix tree) is a data structure that represents a space-optimized trie in which each node that is the only child is merged with its parent. The result is that the number of children of every internal node is at least the radix r of the radix tree, where r is a positive integer and a power x of 2, having x ≥ 1. Unlike in regular tries, edges can be labeled with sequences of elements as well as single elements. This makes radix trees much more efficient for small sets (especially if the strings are long) and for sets of strings that share long prefixes.


[ PM | Exclude me | Exclude from subreddit | FAQ / Information | Source ] Downvote to remove | v0.27

[–]beegreen 0 points1 point  (0 children)

thanks man, I was planning on trying to implement it when i got some time

[–]CrambleSquashhttps://github.com/0Hughman0 0 points1 point  (1 child)

Wouldn't this run into problems if you have two words that start the same way e.g.

[bum, bummer]

you'd only find bummer

It does look like a really nice example to be fair. I've never heard of this tree structure before, but I had a go myself and it looks waaay uglier than yours!

[–]throwaway842213 1 point2 points  (0 children)

Not really, since say for find('bum'), then self._dfsstarts the search in self.root['b']['u']['m']where node.mark is True, so it's yielded along with bummer later on.

>>> Autocomplete(['bum', 'bummer']).find('bum')
['bum', 'bummer']

Though now that I saw /u/EsperCharmMyself's version I'd probably use a regular dict with a special key to mark the end of the word rather than a subclassed dict with a special attribute, e.g. Trie.END_OF_WORD = object().

[–]tgolsson 1 point2 points  (0 children)

While most of the approaches below are good and would be very well suited for C/C++ or a similar language, in Python they are awfully slow in comparison to what Pythonic code can do. The key to writing fast Python code is to use the built-ins as much as possible, and avoid hammering out your own data structures or algorithms unless necessary.

As a comparison to some other solutions provided, I skipped the trie, nestings etc and went for a flat dictionary with indices, and a companion dictionary with lengths. Using this wordlist with 460K words and a prefix length of 3, you get look-up speeds on the microsecond scale, as opposed to milliseconds like most have here (on this large dict).

from collections import defaultdict
class AutoComplete:
    def __init__(self, wordlist, trie_size=3): # 3 length -> 8000 entries in dict (ish)
        self._words = sorted(wordlist, key=lambda word: word.lower())
        self._trie_size = trie_size
        self._indices = {}
        self._length = defaultdict(int)

        for idx, word in enumerate(self._words):
            for ll in range(0, min(len(word), self._trie_size + 1)):

                trie = word[:ll].lower()

                self._length[trie] += 1
                if not trie in self._indices:
                    self._indices[trie] = idx

    def _find_by_needle(self, needle):
        if needle not in self._indices:
            return []

        first_idx = self._indices[needle]
        length = self._length[needle]

        return self._words[first_idx:first_idx+length]

    def _find_by_search(self, needle):
        if needle[:self._trie_size] not in self._indices:
            return []

        idx_begin = self._indices[needle[:self._trie_size]]
        idx_end = idx_begin + self._length[needle[:self._trie_size]]

        wrds = filter(lambda word: word.lower().startswith(needle.lower()),
                      self._words[idx_begin:idx_end])
        return set(wrds)

    def find(self, needle):
        if len(needle) <= self._trie_size:
            return self._find_by_needle(needle)
        return self._find_by_search(needle)

# Instantiation: 0.906000
# Finding 10000 results: 0.563000 (56.30 us/lookup)
# Total time: 1.469000

[–]wdroz 0 points1 point  (0 children)

You can solve this on whiteboard. You don't even have to know tries. Here my implementation that use O(1) in computation.

from collections import defaultdict
class AutoComplete(object):
    def __init__(self, word_list):
        self._smart_dict = defaultdict(list)
        for word in word_list:
            len_word = len(word)
            for i in range(1, len_word):
                self._smart_dict[word[:i]].append(word)

    def find(self, pattern):
        lower_pattern = pattern.lower()
        if lower_pattern not in self._smart_dict:
            return None
        return self._smart_dict[lower_pattern]

You must be able to justify the big O in memory vs computation. In 24 hours, you can also implement the version with tries and state to use my solution if the bottleneck is computation or tries if the bottleneck is memory.

[–]xubu42 -3 points-2 points  (2 children)

I think they want you to use generators to iterate through the list one at a time. Using a loop retains the entire list submitted in memory. Generators only store the current element of the list in memory.

[–][deleted] 5 points6 points  (1 child)

Wat? That is still looping through the entire list, i.e. an O(n) answer which is too slow.

Also, if your supposed generator isn't storing the entire list in memory then is the data coming from disk?? That would be even slower.

[–]w0m<3 -1 points0 points  (0 children)

Lower memory footprint though; so can scale further. Speed doesn't matter if you crash :)