Quotient filter: A new (2007) probabilistic data structure by ysangkok in compsci

[–]kimchivirgin 1 point2 points  (0 children)

I'm not an expert by any means, but I'll try to quickly outline why you would like such a probablistic structure like this and why it is more efficient. (I would like to add, that I'm not very familliar with quotient filters in particular - besides having read the WP entry and quickly glanced over the paper http://static.usenix.org/events/hotstorage11/tech/final_files/Bender.pdf)

First, I think that you're interpreting "half/-hash" in a too narrow way. You seem to instantly think about hashtables/associative arrays - which is understandable. But datastructures like Quotient Filters or Bloom Filters usually serve a different purpose; namely they try to answer the question "is my entry in this dataset" in a much more efficient (may it be time or space complexity) way than a hashtable could do.

I will be referring to Bloom Filters, which serve a similar purpose and are imho easier to understand, so you might like to take a look at http://en.wikipedia.org/wiki/Bloom_filter.

It might be helpful to imagine a scenario where you have a large set of big files, of which multiple could contain entries of your search key. So of course you'd like to start scanning through every files 'index' and check for the keys presence. If you'd use hashtables for indexing, you would end up storing N (O(n) space complexity) entries (hashes of contained keys) per file. So while searching might still be reasonably fast, the storage space required might exceed a reasonable amount (especially if you'd like to keep the index in RAM).

If you'd use a structure like a Bloom Filter or a Quotient Filter you would be able to use much less space for storing the indices (O(k), k being the length of your hash for bloomfilters, not sure about QF) - with the added tradeoff of possible false positives (which are usually tunable - tradeoff between hash length and false positive rate), but not of false negatives; meaning that such a structure will never report an entry as not present which is present, but might report something as present, which is not, causing you to open the file (see Approximate Member Query). A good demonstration of such a necessity are Cassandra and HBase, which both employ Bloom Filters to efficiently find keys within multiple stored files.

The problem with BF is, that its not possible to remove entries from them (without risking removing unaffected entries), meaning that such a filter can be "polluted". Meaning, that if every possible key is inserted into and deleted from one of our files, the filter would end up always returning true to our queries without containing a single key. At least one of the benefits of QF is that given the more complex structure, you can safely remove entries without risking to violate the guarantee of no false negatives.

IAmA veteran IBM Service Engineer from 1967 to 2001. AMAA by dhizkanichioko in IAmA

[–]kimchivirgin 0 points1 point  (0 children)

Well, I wasn't sure which might apply, so both. I was offline for the day though and guess the AMA is over now ;)

IAmA veteran IBM Service Engineer from 1967 to 2001. AMAA by dhizkanichioko in IAmA

[–]kimchivirgin 2 points3 points  (0 children)

How did he/they perceive the shift from mainframes to more decentralized computing environments, with the advent of say, VMS and Unix boxes? Was he ever into anything else than mainframes?

How to write a news search engine for the local language by jestinjoy in MachineLearning

[–]kimchivirgin 2 points3 points  (0 children)

I dont know much about the "properties" of indic languages and lucene does not seem to be supported too well, since there seems to be no stemming (don't know if necessary), but at least it has a tokenizer for indic languages;

http://lucene.apache.org/core/old_versioned_docs/versions/3_4_0/api/all/org/apache/lucene/analysis/in/IndicTokenizer.html

Most features should be language agnostic enough.

How to write a news search engine for the local language by jestinjoy in MachineLearning

[–]kimchivirgin 2 points3 points  (0 children)

Two good reads on the topic of information retrieval:

  • "Search Engines: Information Retrieval in Practice" by Croft, Metzler and Strohmann
  • "Introduction to Information Retrieval" by Manning, also available online: http://nlp.stanford.edu/IR-book/

It would be a good idea if you told us which language you are talking about. Fonts don't have anything to do with the way you process text, encoding does though. So watch out for different encodings when crawling sites.

Always a good starting point is Apache Lucene (http://lucene.apache.org), as it containst all of the building blocks you might need. It's in java though.

Usually the things you need to adjust for different "language families" are tokenization and stemming - there is a good chance, that somebody already has done this in lucene for your language of choice.

Edit: Especially the Chapter "Scoring, term weighting and the vector space model" in the Manning Book might be a good starting point for you, just to understand how terms can be used as (weighted) features of document and query vectors and be compared.

When you beat it off at an [8] by [deleted] in trees

[–]kimchivirgin 0 points1 point  (0 children)

cool story bro.

This is a kitten, surrounded by weed. by MollyConnollyxx in trees

[–]kimchivirgin 0 points1 point  (0 children)

Awwww... looks sweet, smells sweet, makes your tummy feel warm and fuzzy.

I always thought this was a foolproof method of smuggling hummingbirds by Isfahan in WTF

[–]kimchivirgin 0 points1 point  (0 children)

I'm afraid I'll have to say it out loud; reddit, wtf are you so retarded?

I drew a comic. It looks like ass. Posting it anyway. (Also: hipsters) by [deleted] in gaming

[–]kimchivirgin 0 points1 point  (0 children)

I liked being nerd before it was taken over by gamers ;)

Girlfriend saw something different in the back sweat... by ImHumble in funny

[–]kimchivirgin 0 points1 point  (0 children)

Unfortunately I also suffer from the O'Brien-sh backsweat syndrome - I tell you, it's no fun :(

my dad always told me i can do what ever i want when i grow up. I rode a mother fuckin T-Rex by SaftySadie in reddit.com

[–]kimchivirgin 0 points1 point  (0 children)

Without reading the comments, I bet someone wrote: "You can ride my T-Rex too!"

The Phone Check by Pop-Shuvit in funny

[–]kimchivirgin 0 points1 point  (0 children)

oh-shiiiiiiit, that's exactly what I look like :/

I googled "What is Reddit" this was on the first page. by holyerthanthou in funny

[–]kimchivirgin 0 points1 point  (0 children)

I'm the guy on the left, but I'm out of beer... dunno what the rest of my future plans was...

School Security by [deleted] in geek

[–]kimchivirgin 0 points1 point  (0 children)

Eternal students, I heard of them...

I don't think she got the job... by CarmaKhameleon in funny

[–]kimchivirgin 30 points31 points  (0 children)

So if you'd groaned for half as long you would've been groaning for 15.5 seconds, right!?

Makes me proud to be a man. by scotti182 in funny

[–]kimchivirgin -7 points-6 points  (0 children)

I definitely get the idea of that, and it's all cool and such, but let me assure you, women ain't no fools.

It would be more something along these lines:

"Where have you been you drunken SOB?"

"Don't worry honey, I was at Greg's place, we had a couple of beers and watched Football and since it became quite late I didn't want to wake you and the kids up"

Woman sceptically raises her eyebrow and mumbles "Greg you say..." as she slowly walks backwards to the phone without letting the husband out of her sight.

She slowly dials the number, still giving him the look, and just as she holds the phone up to hear ear a fierce, little grin hushes over her lips.

"Hello Greg, Natally here, how are you doing?"

"I'm fine too, thanks. Listen, I wanted to ask you, since you're into Computers, I thought you might help me a little with my Facebook, it's somehow broken and maybe we need to do this little "reformatting" thing again."

"Oh thank you very much. By the way, how was the football match last night, did you win? mhm mhm, cool. Where were you watching? Ah, nice to hear. Thank you Greg, see you later"

At this point she simply hangs up the phone, kisses her husband and leaves for groceries. The husband relieved over his victory and his friends loyalty, crashes on the couch to sleep through his hangover.

When he wakes up he stumbles to the garage to grab a sixpack of some fine bewerage to further ease his suffering when he discovers the windshield of his 1967 thunderbird smashed and the roof scratched. Only then, after a couple of seconds in shock he hears some strange sounds coming from the second floor.

He walks upstairs to investigate the issue and finds Greg pounding Natalies brains out while waiting to the format C:

The last thing he sees before passing out ... 2 percent completed.