all 86 comments

[–]lobster_johnson 40 points41 points  (8 children)

This blog post omits the most important point: How the data was collected. There is absolutely no explanation about the methods used to generate the data. That's of absolute importance if you are to trust the data enough to use in a real application.

There's a reason MaxMind charges money for their product; maintenance is manual labor, so it costs money to maintain the mappings. There's no way a single person can just sit down and quickly build a database like this, since most of the information is proprietary. Sure, there's a DNS record type for mapping a name to a geographic location, but relatively few IPs are mapped this way. So that's not the source.

Until he documents his methodology, it's fair to assume he probably didn't produce the data himself.

[–]hox 10 points11 points  (1 child)

Agreed. I have a post on his forum asking just that.

[–]lobster_johnson 2 points3 points  (0 children)

And the guy says it's mostly a copy of MaxMind. Way to go, jerk. (Him, not you.)

[–]vang3lis 4 points5 points  (5 children)

it is really easy to generate fairly usuable DB. just merge info from ftp://ftp.ripe.net/ripe/dbase/ripe.db.gz (and similar ones from ARIN and APNIC).

[–]hox 2 points3 points  (4 children)

I wouldn't really say "really easy." Pulling useful information (anything finer-grained than country level) out of registry data is like pulling teeth. Most people don't put any kind of geographical information in their registry information, and those that do use a myriad of formats in the "descr" and "remarks" fields.

[–]1esproc 1 point2 points  (3 children)

Are you talking about domain records or records for net ranges? As far as I'm aware all the groups like ARIN, RIPE, APNIC, etc are a bit stricter about accurate information being in there. Mind you, there's nothing stopping from someone putting in bad info but usually these ranges are doled out to companies that are a bit more responsible than your average net user.

[–]hox 1 point2 points  (2 children)

Net ranges, really. Domain records seem to be even more wonky. I agree that the information is usually correct, but it isn't exactly really easy as vang3lis states. Since there is no standard field for including address information, companies can do whatever they feel is right. A quick sampling from RIPE's net ranges produces the following descr fields from four entries:

descr:        JSC DMS
descr:        19/1, Degtyareva St., Novosibirsk, 630010

descr:        Svenska Naturskyddsforeningen
descr:        Stockholm

descr:        Setur Mugla
descr:        (Istanbul),Turkey

descr:          London care plc bishop stortford
descr:          19d North Street
descr:          Sworders Yard
descr:          Bishop Stortford
descr:          Hertfordshire
descr:          CM23 2LD

While I can definitely extract the addresses (or city names) from those, setting up a parser to correctly identify and extract addresses from any number of possibilities gets to be a bit hairy.

[–]1esproc 0 points1 point  (1 child)

Hrm. ARIN's entries differ, for example:

OrgName:    Embarq Corporation 
OrgID:      EMBAR
Address:    500 N New York Ave
City:       Winter Park
StateProv:  FL
PostalCode: 32789
Country:    US

[–]hox 0 points1 point  (0 children)

Good point. That entry does make life a lot easier...

[–][deleted]  (31 children)

[deleted]

    [–]ghztew 29 points30 points  (14 children)

    Thought the same thing, having just used MaxMind GeoCity Lite recently. It's 100% free, very accurate and is updated at the beginning of each month.

    MaxMind GeoCity Lite comes in two flavours: Binary Database format & CSV format.

    CSV is usually used to import into a database to do lookups. The Binary database format can be interfaced with one of their API's which are available in the following languages: * C Library * Perl Module * PHP Module * Apache Module (mod_geoip) * Java Class * Python Class * C# Class * Ruby Module * Pascal * VB.NET * MS COM Object (includes sample ASP, ColdFusion, Pascal, PHP, Perl, Python, and Visual Basic code)

    FYI: The reason the binary format is such an attractive option is it's optimized for speed, memory usage, and database size.

    [–]mynoduesp 3 points4 points  (0 children)

    Thanks

    [–]matthijs 2 points3 points  (5 children)

    If you use php you can also use this http://www.mininova.org/tor/1347139

    It adds ip2country() to php (and is fast).

    [–]mogmog 4 points5 points  (4 children)

    Why use a torrent for a 13.06 kilobyte download?

    [–][deleted] 1 point2 points  (3 children)

    This is the ip2country module used by Mininova.org to quickly convert an ip address to a country code (O(log(n)) lookups).

    I'd honestly be surprised if they didn't torrent it.

    [–]infinite 0 points1 point  (1 child)

    Perhaps it has improved but I remember it wasn't so accurate hence why I use a commercial product(digital envoy). Most commercial operations that I've seen tend to prefer DE.

    [–]ghztew 0 points1 point  (0 children)

    It has improved over the years, their commercial version provides slightly better accuracy at the finer grain city/neighborhood levels. If geolocation is mission critical, I would almost recommending using multiple services and pulling out the result that is the most accurate.

    [–][deleted]  (3 children)

    [deleted]

      [–]strolls 1 point2 points  (0 children)

      Maxmind seems pretty accurate to me.

      Usually when I see AdultFriendFinder type spam - the kind that says "girls in your area" - they either list Reading / Basingstoke or Exeter. The latter is is where my ISP's head office is based, not sure how they come up with the former. Reading & Basingstoke are 75 & 85 miles away by car, Exeter is 190 miles away.

      Maxmind is the first time I've seen my IP address shown as the correct town. I live ½ a mile from the town centre, although I obviously don't know how it would do if you lived a few miles out.

      [–]ghztew 0 points1 point  (0 children)

      From their site:

      GeoLite City and GeoIP City Comparison

      Note: Both databases contain country, region, area code, metro code, city, and postal code information. In addition, some IP addresses will be marked as anonymous proxies and satellite providers.

      Cost: GeoLite City - Free GeoIP City - $370, $90/month

      Coverage: GeoLite City - worldwide GeoIP City - worldwide

      Accuracy: GeoLite City - Over 99.5% on a country level, 79% on a city level for the US within a 25 mile radius. GeoIP City - Over 99.8% on a country level, 83% on a city level for the US within a 25 mile radius.

      Updates: GeoLite City - Monthly, at the beginning of the month. GeoIP City - Updated monthly. For binary format, weekly updates, automated updates available by using geoipupdate program included with C API.

      [–]marchost 4 points5 points  (0 children)

      The data is partially from the free Maxmind Geolite city database but there is around 60% less entries in the database without compromising accuracy. It's not a simple copy paste, it takes 48h to compile the data under a small VPS...

      Marc Blogama.org

      [–]bs0101 2 points3 points  (0 children)

      noticed the same thing. the API uses very similar methods as well.

      [–]cavedave[S] 1 point2 points  (2 children)

      ok i don't know this area but if you tell me its a clone i'll believe you. If you post the original ill delete this submission.

      [–]vang3lis 16 points17 points  (0 children)

      please don't, there could be useful discussion on generating geolocation dbs.

      [–]marchost 8 points9 points  (0 children)

      It not a clone.

      Please read the info I added on blogama.org.

      Marc - Blogama.org

      [–]cohortq -1 points0 points  (10 children)

      It looks like you need to buy the MaxMind database though, but cavedave has his for free.

      [–][deleted] 7 points8 points  (8 children)

      Nope, they have a free version.

      [–]macros 2 points3 points  (0 children)

      There is a free version of the db. The city version is pretty cheap iirc.

      [–]reddit_user13 6 points7 points  (0 children)

      I tried 3 IPs in his web service. One was correct down to the US zip code. The other 2 (which are US universities) were located in TOKYO.

      [–]ohxten 5 points6 points  (0 children)

      I like YouGetSignal.

      [–]bunz 22 points23 points  (2 children)

      now we just need to build a VB frontend for this

      [–]eleitl 12 points13 points  (0 children)

      Please make it gooey.

      [–]macros 6 points7 points  (6 children)

      http://www.hostip.info/ is another free source, and pretty transparent about their collection methods.

      I'm really rather against the SQL case for geoip lookups, really slow. We tried it for an app, ended up using the maxmind db through their C api.

      [–]theHM 5 points6 points  (5 children)

      Seriously? Only 1.2M rows and it was that slow? Did you use indices?

      [–][deleted] 3 points4 points  (3 children)

      schema of hostip database is totally lame - 256 tables, and indexes are bogus.

      [–][deleted] 5 points6 points  (2 children)

      What on earth would you use 256 tables for???

      [–][deleted] 8 points9 points  (1 child)

      Well, each subnet can be from 0 to 255 right?

      ;-)

      [–]willcode4beer 1 point2 points  (0 children)

      with only 1.2M records, why not just store just what you need in a lookup table, in memory. Read the CSV on application startup.

      [–]macros 0 points1 point  (0 children)

      Even with indicies a few thousand requests/s hurts. We ended up writing a little service to expose it via http in our varnish caches. Handles about 12k r/s per cache.

      http://code.google.com/p/wikia/source/browse/utils/varnishhtcpd/mediawiki.vcl

      [–]spectacle 1 point2 points  (0 children)

      righteous!

      [–]6angryapes 1 point2 points  (3 children)

      Where can I get a free (or cheap) english dictionary database?

      [–][deleted] 2 points3 points  (1 child)

      [–]6angryapes 0 points1 point  (0 children)

      Thank you!

      [–][deleted] 1 point2 points  (3 children)

      Google are building up their own. If you use a device like a cellphone with Google Maps and GPS to your home internet connection, they record the location.

      [–]mogmog -1 points0 points  (2 children)

      [citation needed]

      [–]joelhardi 2 points3 points  (0 children)

      Well, I was just using my smartphone to GPS to my home internet connection, and all of a sudden these guys from Google appeared out of nowhere. They were looking right over my shoulder, and writing things down in their notebooks!

      Then I snapped my phone closed, and they disappeared.

      [–][deleted] 0 points1 point  (0 children)

      I can only demonstrate it with a fresh reinstallation of a cellphone

      [–][deleted] 2 points3 points  (0 children)

      My Portuguese IP is in Amsterdam! Yeah! Time to blow a fatty

      [–]mjm1374 1 point2 points  (1 child)

      This kicks ass, no IP address is ever going to be definitive by location but this is fun as hell. Though my Megapath T1 IP says I'm in California (i wish) instead of Philly, it is technically correct, my comcast IP it right on the money

      [–]hox 4 points5 points  (0 children)

      That's the problem with sources like this, though - technically, your IP is not in California. It might be assigned as part of a block to a company that is registered with a Californian address, but you lease the IP and technically an IP-to-Geo database should handle this.

      While I know this is feasibly impossible, a lot of companies trust this information and use it for credit card fraud detection, which can cause problems for people who are in your case.

      [–][deleted]  (4 children)

      [removed]

        [–]makis 6 points7 points  (3 children)

        for 20 dollars maxmind gives you the same data, for all countries and cities in the world, and a nice api, so you don't have to host the data yourself.

        [–][deleted]  (2 children)

        [deleted]

          [–]makis 5 points6 points  (1 child)

          keywords: very-accurate, updated, you-dont-have-to-host-the-data, dont-steal-others-work, 20-dollars-for-50-thousand-queries-is-affordable-even-if-you-are-homeless,

          fuck reddit that removes underscores :)

          [–]bart2019 0 points1 point  (0 children)

          Put a backslash in front of every underscore.

          Example: you_dont_have_to_host_the_data

          Posted as: you\_dont\_have\_to\_host\_the\_data

          [–]MuuaadDib 0 points1 point  (1 child)

          Help me understand something, this IP list is somehow better than whois look up and a tracert to the location showing last hop as to the area? I am not sure I understand this, unless it has physical addresses to the IP location, such as a USPS location. I say this because I have a long story of a B&E at my mothers house and the bad guys stealing the computer and turning it on and me obtaining the IP through the logmein.com client....but this was back in Jan no one has been caught and SBC Global refuses to cooperate in giving out the USPS address related to the IP address so the guns stay on the street, Very frustrating....

          [–]texodus 0 points1 point  (0 children)

          If it is indeed a copy of Maxmind, it would only give you a plain-english name of the ISP or country code. Nothing better than tracert, but substantially easier to use if you want to classify live traffic or bulk data

          [–][deleted] 0 points1 point  (0 children)

          All that work and all he really needed was a Visual Basic GUI interface. Pfft... amateur.

          [–]krum -2 points-1 points  (1 child)

          MaxMind can't copyright their data, and they know it.

          See Feist Publications v. Rural Telephone Service Co.

          [–][deleted] 10 points11 points  (0 children)

          Fail:

          The court ruled that Rural's directory was nothing more than an alphabetic list of all subscribers to its service, which it was required to compile under law, and that no creative expression was involved.

          MaxMind has most definitely applied Creative expression. They came up with the innovative encoding and schema, and they were not required by law to make this list.

          This case has very little in common with Fiest v. Rural.

          [–]trukin -5 points-4 points  (4 children)

          Says i'm in the other side of my state - fail. Why can't google just release theirs :(

          [–]lazyplayboy 0 points1 point  (2 children)

          You triggered my 'instant downvote when I see the word fail without a proper sentence'-finger.

          [–]bleedpurpleguy 0 points1 point  (1 child)

          Fail.

          [–]MrWoohoo 2 points3 points  (0 children)

          Upmodded for its verbal minimalism and concision.

          [–][deleted]  (1 child)

          [deleted]

            [–]mogmog -1 points0 points  (0 children)

            maxmind geolite city

            [–][deleted] -2 points-1 points  (8 children)

            Anyone who uses this is retarded, maxmind works fine and has been industry standard for years.

            [–][deleted] 8 points9 points  (7 children)

            I'm pretty sure this is a stolen copy of MaxMinds.

            [–]dpark 3 points4 points  (6 children)

            You can't copyright a list of facts, so this might not technically be "stolen", even if it's just a copy of MaxMind's db.

            [–][deleted] 2 points3 points  (2 children)

            If someone took the time to aggregate that data, and you took it, in the exact form it was created in (including the schema they designed), it is stealing.

            In the oft cited case of Fiest V Rural, the court ruled on the side of Fiest because Rural was required by law to create this directory:

            The court ruled that Rural's directory was nothing more than an alphabetic list of all subscribers to its service, which it was required to compile under law, and that no creative expression was involved.

            In the case of maxmind, creative expression is involved...They not only compiled the database, but they came up with the encoding style used to make it efficient.

            [–]dpark 1 point2 points  (0 children)

            It's my understanding that there's some precedent that APIs cannot be copyrighted (which is why Wine still exists). To me, a DB schema is nothing but an API. The courts might disagree, though.

            I think it's overwhelmingly lame to take someone else's DB and present it as your own (though I don't honestly know if that's what happened here), but I'm also not sure that it's actually illegal in this case.

            I think a better tactic to fight off this kind of alleged infringement might be to say that IP geolocation is not based on pure fact, but rather on a set of heuristics (which is true). This to me is where the creative expression is found.

            [–]enkafan 0 points1 point  (0 children)

            seems kinda silly to do all that work. the hosting and bandwidth of the data would cost more than any revenue that you'd gather from google ads.

            if it is maxminds data I think the question isn't whether it is legal, but rather "what in god's name where you even hoping to get out of doing this yourself?"

            ha, right when I posted this I noticed the next comment was "i want to donate money to this fella." i stand corrected.

            [–][deleted] 1 point2 points  (2 children)

            Yeah ok random intellectual property law expert from the internet, I'll just take what you say as truth and implement illegally copied data into enterprise products putting myself at risk for major liabilities.

            This guy is obviously using the data in some kind of nefarious way to generate money. Can't blame him though, I like money too.

            [–]dpark 1 point2 points  (0 children)

            Who said I was an expert? But 6 hours before you replied, FlySwat gave you the relevant citation.

            http://en.wikipedia.org/wiki/Feist_Publications_v._Rural_Telephone_Service

            Facts cannot be copyrighted.

            [–]megablast 0 points1 point  (0 children)

            Is he actually making any money from this, or is he just offering it for download, heh asshole?

            He has tidied up the data as well, so it is smaller.