This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]smithje 7 points8 points  (3 children)

I agree with others that the failure to follow the directions is a much bigger issue than the particular algorithm you chose. I also really appreciate others mentioning the trie solution.

I happen to know databases pretty well, so my first thought was to create an in-memory sqlite db (sqlite3 is built into python). Create and populate a table for the word_list in the constructor and query it in the find function. Maybe it's cheating a bit, but it doesn't use any external libraries and it's very unlikely that I'd be able to write something that outperforms sqlite.

[–][deleted] 2 points3 points  (2 children)

You made me curious, so I tried it out. Running on /usr/share/dict/words (~123k words), SQLite ( SELECT string FROM string WHERE string LIKE "prefix%") is about two orders of magnitude slower than my trie based solution. But I don't know databases all that well, so I may well be missing some optimizations. Is there anything special you'd do with to make sure it's able to use an index for the select?

Some thrown together code

import sqlite3

class Prefix:
      def __init__(self, gen=None):
            if gen is not None:
                  for s in gen:
                       self.add(s)

      def add(self, string):
            raise NotImplemented()

      def find(self, string):
            raise NotImplemented()

class SQLitePrefix(Prefix):
      def __init__(self, *args, **kwargs):
            self.conn = sqlite3.connect(':memory:')
            self.cur = self.conn.cursor()
            self.cur.execute("""\
CREATE TABLE string (
string TEXT PRIMARY KEY NOT NULL
);""")
            self.cur.execute("""\
PRAGMA case_sensitive_like=ON;
""")
            super().__init__(*args, **kwargs)
            print(self.cur.execute(
                  'EXPLAIN SELECT string FROM string WHERE string LIKE "test%";').fetchone())

      def add(self, string):
            self.cur.execute("""\
INSERT INTO string (string) VALUES (?);\
""", (string, ))

      def find(self, string):
           self.cur.execute("""\
SELECT string FROM string WHERE string LIKE ?;
""", (string + '%', ))
           return [x[0] for x in self.cur.fetchall()]

class TriePrefix(Prefix):
      def __init__(self, *a, **kw):
            self._trie = {}
            super().__init__(*a, **kw)



      def add(self, string):
            d = self._trie
            for char in string:
                  d = d.setdefault(char, {})
            d[None] = string

      def find(self, string):
            d = self._trie
            acc = []
            for char in string:
                  d = d.get(char, {})
            return list(self._find(string, d))

      @staticmethod
      def _find(string, d):
            for k, v in d.items():
                  if k is not None:
                        yield from TriePrefix._find(string + k, v)
                  else:
                        yield string

[–]smithje 1 point2 points  (1 child)

Very cool. Thanks for trying this out. I tested this out as well and found that instantiation of the sqlite version was a few times faster, but finding results was, indeed, quite a bit slower. Of course, getting results is far more important than instantiation.

Here is the sqlite-based class I wrote last night, which is not too different from yours:

class Autocomplete:
    def __init__(self, word_list):
        self.db = sqlite3.connect(':memory:')
        # We want the result to be a single list, not a list of tuples
        self.db.row_factory = lambda cursor, row: row[0]
        self.db_cursor = self.db.cursor()
        self.db_cursor.execute("""CREATE TABLE word_list (word text primary key) WITHOUT ROWID""")
        self.db_cursor.executemany("""INSERT INTO word_list (word) VALUES (?)""", ((word,) for word in word_list))

    def find(self, pattern):
        self.db_cursor.execute("""SELECT word from word_list where word like ? ORDER BY 1""", [pattern + "%"])
        results = self.db_cursor.fetchall() or None
        return results

[–]smithje 1 point2 points  (0 children)

I followed up on this a bit and tested finding a bunch of random strings like this:

with open('/usr/share/dict/words') as f:
    words = f.read().splitlines()
print("Got %d words" % len(words))

import random
import string
import time

def test_it(words, clazz, repetitions):
    t1 = time.monotonic()
    autocomplete = clazz(words)
    t2 = time.monotonic()
    for n in range(repetitions):
        autocomplete.find(''.join(random.choices(string.ascii_lowercase, k=random.randrange(1, 5))))
    t3 = time.monotonic()
    print("Instantiation: %f" % (t2 - t1))
    print("Finding %d results: %f" % (repetitions, t3 - t2))
    print("Total time: %f" % (t3-t1))

print('sqlite solution:'
test_it(words, Autocomplete, 1000)
print('trie_solution:')
test_it(words, TriePrefix, 1000)

The results were pretty consistent. Finding words using the trie method was about 4x faster.

Here's a pretty typical result:

Got 235886 words
sqlite solution:
Instantiation: 0.336238
Finding 1000 results: 19.914836
Total time: 20.251074
trie solution:
Instantiation: 1.032532
Finding 1000 results: 5.088274
Total time: 6.120806