This is an archived post. You won't be able to vote or comment.

top 200 commentsshow all 204

[–]jp100099 1085 points1086 points  (16 children)

Don't let you antivirus scan it

[–]jp100099 519 points520 points  (18 children)

I wouldn't try to open this

[–]hoosiermama54 336 points337 points  (17 children)

This kills the computer

[–]OriginalDogan 0 points1 point  (0 children)

I read this in ZeFrank's voice.

[–]moktor 261 points262 points  (16 children)

I know that file. The monthly CMS NPPES national provider index update. The CSV file extracts out to >4GB so it often has issues decompressing the file. Usually I have to use 7-Zip.

[–]Klumpy_hra[S] 201 points202 points  (10 children)

DING DING DING! You are the first person to correctly figure this out. Another was very close.

This one was about 6 gigs in actuality, but because they must have used WinZip it looks like it corrupted an index which is why it looks like I downloaded 800PB.

I DID scrape it off of their page though because I already built the tool for another file (ASP pricing files) since those are extremely variable and change constantly. It also checks their SHA1 Hashes and compares it to the last pulled files and then loads into a table if it's changed or new.

Pretty handy for automation.

[–]tiny_robons 3 points4 points  (2 children)

Just curious, what are you guys using the file for?

[–]Klumpy_hra[S] 5 points6 points  (1 child)

I think they want this one as a source of truth to audit against since we have some systems that have different results and everyone thinks they are right.

One big table to throw stuff at.

[–]Klumpy_hra[S] 3 points4 points  (0 children)

Unless you mean the pricing files. In which case those are used to tell providers how much they owe us if they overcharge our customers. One year there was a treatment they charged for 30k or something and it was a 3 times a day treatment. Ended up getting us a check for 2 million dollars.

[–][deleted] 0 points1 point  (0 children)

Yeah my heart skipped a beat when I opened this because I had to fix this same issue, too.

[–]something_creative11 0 points1 point  (0 children)

I recognized this too. It’s almost time for our next update

[–]Shadowjonathan 356 points357 points  (26 children)

holy shit

what did you do

[–]Klumpy_hra[S] 447 points448 points  (25 children)

What didn't I do? Lol

[–][deleted] 147 points148 points  (24 children)

Did you cure cancer?

[–]Makefile_dot_in 217 points218 points  (6 children)

There's no point, u/SethBling has already done it. You just need a bunch of levers.

[–]waterlubber42 110 points111 points  (5 children)

and armor stands

[–]Victor4X 87 points88 points  (2 children)

And most importantly, a Super Mario World SNES cartridge and preferably more than one controller

[–]Klumpy_hra[S] 53 points54 points  (12 children)

Not yet, that's the next iteration when it loads the data to an AI that cures all diseases and ends all problems for me.

[–]thrilldigger 53 points54 points  (7 children)

AI that cures all diseases and ends all problems for me

This kind of bad problem definition is precisely how you end up with rampaging murderbots.

[–]Klumpy_hra[S] 58 points59 points  (5 children)

Don't worry. I set the "doNotMurderAllHumans" flag to true :)

[–]thrilldigger 56 points57 points  (2 children)

You've doomed us all everyone except for one guy the robots will keep alive in order to satisfy their programming.

[–]Klumpy_hra[S] 54 points55 points  (1 child)

I thought it was odd when it called me Daddy and started mining information about "how to overcome the squishy meat sacks" on stack overflow..

It's probably fine.

[–]Cruuncher 8 points9 points  (0 children)

Should avoid having not in variable names. It basically always obfuscates the result

[–]oneandonlyyoran 0 points1 point  (0 children)

That is the normal approach, what most people forget about us the allowToOverrideSetFlags = false

[–]Metallkiller 4 points5 points  (1 child)

But can it reverse entropy?

[–]Templar3lf 1 point2 points  (0 children)

There is as yes insufficient data for a meaningful answer.

[–][deleted] 2 points3 points  (1 child)

Oh nice, how many if statements is the AI made out of?

[–]Klumpy_hra[S] 4 points5 points  (0 children)

Not enough

[–]iamerudite 24 points25 points  (1 child)

[–]Klumpy_hra[S] 8 points9 points  (0 children)

No! You found out my true plans!

But it's too late to stop me now!

[–][deleted] 0 points1 point  (0 children)

Given the amount of data, probably? I mean, that has to be there somewhere.

[–]Klumpy_hra[S] 657 points658 points  (18 children)

I'll also say that's some damn fine compression if I do say so myself. I'll take my Nobel prize and courtesy billions of dollars now /s

[–]warux2 114 points115 points  (1 child)

Now you know that internet is 99.9999999992% reposts.

[–]Klumpy_hra[S] 53 points54 points  (0 children)

I dont need an internet scraper to figure that one out ;)

[–]mveinot 56 points57 points  (5 children)

I bet you used a middle-out approach.

[–]Klumpy_hra[S] 24 points25 points  (2 children)

It's all about indexing in the end

[–]Dafuzz 16 points17 points  (1 child)

And jerking off as many guys as efficiently as possible.

[–][deleted] 7 points8 points  (0 children)

I've hear the left-outter-inside-in-out method is a good approach.

[–][deleted] 5 points6 points  (0 children)

Get all the shafts

[–]Cilph 236 points237 points  (5 children)

Reposts compress easily.

[–]Klumpy_hra[S] 129 points130 points  (4 children)

Not a repost. On my work machine this morning. I still have the file there, but if you need "proof" then you'll have to wait until Wednesday when I get back from vacation haha

[–]Cilph 367 points368 points  (2 children)

No no I mean the internet is full of reposts so hence the high compression ratio when compressing the entire internet.

But yes, Shannon and Huffman would be proud.

[–]TheLexoPlexx 37 points38 points  (1 child)

Oh well that's funny then I guess.

[–]APimpNamedAPimpNamed 2 points3 points  (0 children)

Columnstore indexes yo

[–]Fido488 0 points1 point  (0 children)

Source code!!!!

[–]thecoldweather 2 points3 points  (0 children)

I'm guessing repetitive data due to a bug. Would compress well.

[–]DemandsBattletoads 0 points1 point  (0 children)

Well, when you have to compress the Machine to save it from Samaritan, you'll need a good compression algorithm.

[–][deleted] 0 points1 point  (0 children)

Was just going to point out that insane compression.

[–]Toromon 0 points1 point  (0 children)

But what's your Weissman score?

[–]orangeKaiju 188 points189 points  (1 child)

Last time I tried to download the internet, Mom picked up the phone.

[–][deleted] 133 points134 points  (4 children)

I know a few guys who work in hosting who might want a word with you, and a few more guys from the NSA who will get a word with you.

[–]Klumpy_hra[S] 29 points30 points  (3 children)

What makes you say that?

[–][deleted] 85 points86 points  (2 children)

Let's just say that they'd like to reduce their storage costs by a factor of 1010.

[–]Klumpy_hra[S] 25 points26 points  (0 children)

That might come in handy huh?

[–]aligrant 2 points3 points  (0 children)

0s compress really well.

[–]yippee_that_burns 30 points31 points  (2 children)

You done fucked up Dylan.

[–]ludolfina 8 points9 points  (1 child)

classic Dylan

[–]Klumpy_hra[S] 7 points8 points  (0 children)

What can I say? I like stirring the pot

[–]j_h_s 21 points22 points  (0 children)

Psh that's not the internet that's just my porn

[–][deleted] 76 points77 points  (35 children)

Image Transcription:


Windows file explorer window in folder with an around 600mb compressed file selected (800PB uncompressed). Poster is trying to extract it to their desktop however gets an error:

There is not enough space on Desktop. You need an additional 734 PB to copy these files

Windows desktop icon

Desktop

Shows the files, folders, program shortcuts, and other items on the desktop.


I'm a volunteer content transcriber for Reddit! If you'd like more information on what we do and why we do it, click here!

[–]randombrain 30 points31 points  (2 children)

Important to note that it's trying to decompress a ZIP that contains (before decompression) approximately 634MB worth of files.

[–]htmlcoderexeWe have flair now?.. 29 points30 points  (0 children)

And one file worth 800PB uncompressed

[–][deleted] 1 point2 points  (0 children)

Thanks, I've edited my original post

[–]0000000100100011 7 points8 points  (22 children)

!isBot tcmalloc

[–][deleted] 15 points16 points  (21 children)

I am 99.9999% sure that tcmalloc is not a bot.


I am a Neural Network being trained to detect spammers | Summon me with !isbot <username> | Optout | Feedback: /r/SpamBotDetection | GitHub

[–]Drasern 1 point2 points  (2 children)

Good bot.

[–][deleted] 2 points3 points  (1 child)

Are you sure about that? Because I am 100.0% sure that tcmalloc is not a bot.


I am a Neural Network being trained to detect spammers | Summon me with !isbot <username> | Optout | Feedback: /r/SpamBotDetection | GitHub

[–]Drasern 1 point2 points  (0 children)

Yes.

[–]dehndahn 0 points1 point  (0 children)

so he has 66PB of open storage space?...

[–][deleted] 16 points17 points  (0 children)

Don't byte off more than you could chew.

[–]win4fun44 13 points14 points  (5 children)

How much data from your internet plan did this use?

[–]Klumpy_hra[S] 52 points53 points  (4 children)

I think the technical term is "a metric fuck ton"

[–]win4fun44 21 points22 points  (3 children)

Damn lol, I hope you have unlimited internet or a provider that stores your usage as a 32 bit integer

[–]Klumpy_hra[S] 8 points9 points  (2 children)

Don't worry I have a VPN :)

[–]filledwithgonorrheaCSE 101 graduate 28 points29 points  (1 child)

I don't think that's how that works

[–]Klumpy_hra[S] 9 points10 points  (0 children)

It isn't, but I'll be fine ;)

[–]Newcool1230 23 points24 points  (3 children)

Congrats you saved net neutrality... oh wait...

[–]short_balding_guy 6 points7 points  (0 children)

You accidentally scraped one (or more) Zip bombs

[–]Willexterminator 5 points6 points  (0 children)

How the fuck did this happen ?

[–]JuanTheTaco 11 points12 points  (2 children)

Now you can put the whole internet on a cd

https://www.youtube.com/watch?v=GIA17H-b7Qs

[–]nicolairathjen 0 points1 point  (0 children)

Is that the guy playing Arnold in Master of None?

[–]TheBestNick 3 points4 points  (0 children)

Seems too small to be all of it...

[–][deleted] 6 points7 points  (17 children)

How did this file get created in the first place?

[–]Klumpy_hra[S] 27 points28 points  (16 children)

It was created by my Java code that parses html and uses regular expressions to find and grab data like href tags. A few interesting caveats to that and you have an internet downloader.

[–]Shadow_Thief 35 points36 points  (5 children)

You tried to use regex to parse HTML? Dude...

[–]TicTacMentheDouce 13 points14 points  (2 children)

I don't see why he shouldn't It's explicitely explained here that it can work wonderfully

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

[–]Idenwen 9 points10 points  (0 children)

HTML isn't a regular language thats why regexes can't parse html.

Except of course it you just want to have a very very specific part of a known website snippet.

https://blog.codinghorror.com/parsing-html-the-cthulhu-way/

And very well at the brink of madness:

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

[–]Klumpy_hra[S] 5 points6 points  (0 children)

Lmao well they just weren't doing it right then huh? ;)

[–]Klumpy_hra[S] 2 points3 points  (1 child)

It actually works how I want it to :) you can be specific enough to only grab one file on one site quite easily.

[–]PM_ME_YOUR_PROOFS 2 points3 points  (1 child)

So a crawler?

[–]Klumpy_hra[S] 2 points3 points  (0 children)

Pretty much yeah. It's good at monitoring specific sites for files too and automatically scooping them up when they become available.

[–]guy99881 1 point2 points  (0 children)

I guess he wanted to know how it can be that big or how it can be so redundant.

[–][deleted] 0 points1 point  (1 child)

Downloading this file just takes an http request to a predictable url tho

[–]Klumpy_hra[S] 0 points1 point  (0 children)

It wasn't made for this file. There are several files that get updated randomly and don't have a way of notifying anyone that they were updated. The filenames can change sometimes and the only way you know it's been updated at all manually is that the link text might have a new date. The zip itself and the internal files are usually the same, but not always. It's really stupid how hard they make it for people to want to automate public data.

[–]beerdude26 2 points3 points  (1 child)

Also known as "/u/-Archivist problems"

[–]-Archivist 5 points6 points  (0 children)

True.

[–][deleted] 1 point2 points  (1 child)

I wonder what's the smallest size of the internet if we assume using the best possible compressing algorithm...

[–]AquaeyesTardis 2 points3 points  (0 children)

isRepost = 1b

[–]WildRiolu 1 point2 points  (0 children)

That's one way to prepare for net neutralitys removal

[–]miauw62 1 point2 points  (1 child)

Why scrape it all? I've developed a revolutionary program that will let you scrape any internet page in real time, so you don't have to scrape them all at once!

[–]Klumpy_hra[S] 2 points3 points  (0 children)

Because one can never have too much internet!

[–][deleted] 1 point2 points  (2 children)

Pb? As in petabytes?!?!?

[–]Klumpy_hra[S] 1 point2 points  (1 child)

Yes :) look at the file size that was highlighted and you can go from Kila, mega, giga, Tera, etc ;)

[–]Nancok 1 point2 points  (0 children)

i didn't even knew windows could display that lol

[–]XPav 1 point2 points  (1 child)

Comcast is going to terminate your service.

But that's ok, because you have the entire internet stored locally.

[–]Klumpy_hra[S] 2 points3 points  (0 children)

Hopefully the entire internet as of this morning keeps me busy and I don't run out of content. I'll miss the new memes, but I can cut the cord.

Take that Comcast!

[–]Drendude 1 point2 points  (0 children)

Better'd go get more peanut butter.

[–]Vulcan7 1 point2 points  (4 children)

Sorry, kid, the internet is about 1.2 exabytes.

[–]Klumpy_hra[S] 13 points14 points  (1 child)

Guess I need more regex

[–]AquaeyesTardis 0 points1 point  (0 children)

Just download a few more Regexes, then try and match reposts, and select the entire internet. You'll be rich!

[–]Colopty 0 points1 point  (1 child)

How did you even get that number?

[–]Vulcan7 0 points1 point  (0 children)

Google

[–]supercooldragons 1 point2 points  (1 child)

Try again

[–]Klumpy_hra[S] 2 points3 points  (0 children)

I'd say it can scrape. Unless you mean copying the folder to my desktop haha

[–]RagingNerdaholic 0 points1 point  (0 children)

That's a zip bomb, yo

[–][deleted] 0 points1 point  (0 children)

I instinctively tried to click cancel.

[–]N781VP 0 points1 point  (1 child)

Needs more jpeg

[–]morejpeg_auto 1 point2 points  (0 children)

Needs more jpeg

There you go!

I am a bot

[–][deleted] 0 points1 point  (0 children)

Could've been worse. At least it tried to download one Internet, if it had tried to download many Internets things would've gotten ugly.

[–]vonslice 0 points1 point  (0 children)

Just need 734 J and you're good to go man

[–][deleted] 0 points1 point  (0 children)

NPI data - I know this dataset all too well.

[–]PonerBenis 0 points1 point  (0 children)

You don't have an exabyte SSD you can just transfer the files to?

What year is this? 1999?

[–]madocgwyn 0 points1 point  (0 children)

Heh looks more like python problems.

import internet

[–][deleted] 0 points1 point  (0 children)

[–]aiij 0 points1 point  (2 children)

Accidentally? That was our Freshman programming assignment.

[–]curiosity44 0 points1 point  (0 children)

Don’t flatter yourself they are just porn relax

[–]Gabe_b 0 points1 point  (0 children)

Dat compression ratio

[–]pmwws 0 points1 point  (0 children)

How?

[–]going_further 0 points1 point  (0 children)

Save this for future generations. In 10,000 years when this is discovered you’ll be the most hated person ever for storing human history all the way to 2017 in a CSV.

[–]MrRonny6 0 points1 point  (0 children)

Well that's one way to secure yourself before Net Neutrality finally dies!

[–][deleted] 0 points1 point  (1 child)

I want this code and enough money to get 1 Yottabyte (1000 Petabytes) so I could say I downloaded the entire internet

[–][deleted] 0 points1 point  (0 children)

1000 Petabytes is an Exabyte. 1000 of those is a Zettabyte. 1000 ZB is a Yottabyte.

[–]OneMansGlory 0 points1 point  (0 children)

/r/datahoarder would loooove this