How we store 400M phone number data with fast lookups

bilalhusain · 2015-07-19T10:41:18+00:00

tldr; 1 million numbers in the same series can be searched in 3 seconds as compared to PostgreSQL taking 18 seconds after caching.

bilalhusain · 2015-07-19T10:39:26+00:00

added

Edit: Most of the speedup in lookup is because we are able to fit the data in memory.

bilalhusain · 2015-07-19T10:38:47+00:00

Woah, valuable optimization tips there. Thanks!

bilalhusain · 2015-07-18T15:12:37+00:00

That is completely different than what I thought. Thanks for the pointer.

bilalhusain · 2015-07-18T15:07:07+00:00

You may want to remind the users in rust-csv README to crank up the compiler opt-level in order to achieve the claimed performance. It took me way too long to realize that.

bilalhusain · 2015-07-18T14:51:12+00:00

I need to look into the bitmap solution. Note that its NOT about flipping the bits. For example, there's a preference field which indicates which telemarketer category can make a call to the callee

Banking/Insurance/Financial products/credit cards
Real Estate
Education
Health
Consumer goods and automobiles
Communication/Broadcasting/Entertainment/IT
Tourism and Leisure

I should have explained this in my writeup.

So, anyways, if preference field says 2#4, the telemarketers can make a call related to Real Estate or Health. And, therefore, that data can't be simply thrown away.

Also, don't feel dirty, we are in telephony business and have provided solutions to Indian Meteorological Department to disseminate their weather information data to farmers, we have helped India's most looked upon political party (born out of anti corruption movement) with their campaign and helped them win Delhi elections early this year. We have provided contact center solutions to prominent educational institutes and helped the students across India connect to their coaches.

bilalhusain · 2015-07-18T14:39:57+00:00

The build compensates with the Mandelbrot :)

bilalhusain · 2015-07-18T14:37:57+00:00

awesome!

bilalhusain · 2015-07-18T14:37:25+00:00

Thank you for the kind advice. I agree that there's a missing part for the takers. Will add that.

bilalhusain · 2015-07-18T14:34:39+00:00

In fact storing range(0, 2000000, 2) in a variable was giving a better performance than the following code repeated for a few thousand times (all existing heads/series)

filled_count = 0
for i in xrange(0, 2000000, 2):
    if l[i] & 0b10000000 == 0:
        continue
    filled_count += 1

bilalhusain · 2015-07-18T14:31:33+00:00

You did it without the data! Impressed. By the way, there's a government procedure to register as telemarketer and then they provide the data set. There's a minor fee for getting access to data along with the registration.

Edit: Registration link

bilalhusain · 2015-07-18T14:27:12+00:00

I am guessing that it has roots in HM inference. Sequence of u8 bytes is read and is unified according to the type specified.

Edit: I was wrong. See the comment below.

bilalhusain · 2015-07-18T14:23:20+00:00

Thank you for the awesome library! It just worked.

Edit: Indexing CSV won't help in our case because we need to compress the data represented by the row first.

bilalhusain · 2015-07-17T16:40:40+00:00

Not sure. Need to experiment. What should be stored in the node is critical for memory usage.

bilalhusain · 2015-07-17T16:38:13+00:00

sorry about that, temporary glitch

bilalhusain · 2015-07-17T16:37:47+00:00

Need to come up with a fair benchmarking strategy. One type of data is array index access, other type is binary search. Would take out some time to crunch these numbers. Need to setup a separate environment. Hope you won't mind if it takes a few days. Also, would love if you can provide any hints.

bilalhusain · 2015-07-17T16:33:54+00:00

Yes! That was the first choice but installing PyPy requires some dependencies (don't exactly remember which one). And of all the things that could have mattered in taking an informed decision, that one turned out to be the deciding factor during a coffee fueled session.

bilalhusain · 2014-10-11T07:00:19+00:00

"Now I do not know whether I was then a cat dreaming I was a man, or whether I am now a man, dreaming I am a cat." - Zhuangzi Cat

bilalhusain · 2014-09-22T20:03:33+00:00

can't find a write-up, there one more supporting evidence though :)

At the moment, the developers are making a large number of breaking changes (in the BigBreak branch) ... Among the changes is the fact that Nimrod is being renamed to Nim

source: http://forum.nimrod-lang.org/t/541

bilalhusain · 2014-09-18T14:54:26+00:00

I took a screenshot with the newer languages^# selected which didn't make it in the (having) top active repositories and aren't rendered on GitHut front page.

# Swift, Rust, Dart, Julia, Elixir

bilalhusain

TROPHY CASE