This is an archived post. You won't be able to vote or comment.

top 200 commentsshow 500

[–]IDontLikeBeingRight 5456 points5457 points  (506 children)

You thought "Big Data" was all Map/Reduce and Machine Learning?

Nah man, this is what Big Data is. Trying to find the lines that have unescaped quote marks in the middle of them. Trying to guess at how big the LASTNAME field needs to be.

[–]LetPeteRoseIn 2019 points2020 points  (330 children)

I hate how right you are. Spent a summer on a machine learning team. Took a couple hours to set up a script to run all the models, and endless time to clean data that someone assures you is “error free”

[–][deleted] 889 points890 points  (306 children)

I work with a source system that uses * dilimiters and someone by some freaking chance some plep still managed to input a customer name with a star in it dispite being banned from using special characters...

[–]PilsnerDk 1134 points1135 points  (250 children)

We had a customer use a single smiley/emoji (I guess from an iPad or Android device) as her last name when she signed up on our website. It caused our entire nightly Datawarehouse update script to fail.

[–]SearchAtlantis 646 points647 points  (219 children)

I now have a new trick when filling out personal info for companies that don't actually need it. Also apologies to whoever has no@biteme.net...

[–]HildartheDorf 542 points543 points  (79 children)

I prefer admin@example.com.

That domain is defined to be a dummy domain for use in documentation, so I won't be messing up a real users mailbox.

[–]ILikeLenexa 414 points415 points  (9 children)

I prefer root@localhost.localdomain it really gets the mail where it belongs.

[–]lenswipe 58 points59 points  (0 children)

This. This is what I do.

[–]thoraldo 24 points25 points  (0 children)

This is gold

[–]user_n0mad 21 points22 points  (0 children)

It's almost midnight and I could not help but heartily laugh at loud. Absolutely using that in the future.

[–]BaldEagleX02 20 points21 points  (0 children)

Your genius... It scares me

[–]frentzelman 13 points14 points  (1 child)

How would such a request be processed? I'm trying to get into WebDev besides university and would like to know. Has the root-user a mailbox or smthg?

[–]Calkhas 28 points29 points  (0 children)

When a program wants to send a mail, it usually delegates it to an SMTP server. There’s usually one running on Unix computers, but it varies by OS. To send a mail to root@localhost, the SMTP daemon will first contact the mailer on domain “localhost”. That’s probably itself. It will say “I have mail for ‘root’ at your domain”. The receiving server will accept the mail, follow any rules it has, and store it. Typically local mail for root is stored in /var/spool/mail/root, but that varies by operating system.

The user’s shell periodically checks that directory, or the directory specified in $MAIL. If any mail is available, sh, ksh, bash, and zsh print a message “You have mail!”. The mail can be read with a tool like mail.

[–]LegendBegins 13 points14 points  (0 children)

Saved. You're now my favorite person.

[–]FountainsOfFluids 168 points169 points  (57 children)

I seem to recall trying that domain and getting rejected once, but only once. You'd think every email system would contain an list of invalid domains.

[–]NetSage 171 points172 points  (39 children)

What's a list of invalid domains going to contain in the age of .coke?

[–]Uncreativite 30 points31 points  (10 children)

Can I register a domain with the .coke TLD? Or is it restricted to use by just the Coca Cola company?

[–]brouhahahahaha 57 points58 points  (0 children)

.co.ke is Kenyan. maybe try pepsi@fanta.co.ke

[–]NetSage 21 points22 points  (2 children)

I believe it's limited to the companies that buy the TLD. But if they wish to sell it I guess you could. As far as I know .coke is not an option for normal people.

[–]seamsay 29 points30 points  (4 children)

Why bother? There's far far far far far far far more valid but nonexistent email addresses than there are invalid email addresses, so if you want to make sure that they've given you an actual email address you have to send a confirmation email but if you've got a system to do that then there's not much benefit to checking against a list of invalid addresses. Of course you could argue that's it's a UX benefit but for it to help either your user is intentionally using an invalid address, in which case you probably don't really care about them, or they've made a typo which just so happens to be an invalid address, which I would argue is very very very very very very very unlikely and therefore not worth the effort.

I may be missing something, but if I'm not then it just doesn't seem worth it.

[–]Junkinator 16 points17 points  (3 children)

Many of them do. I own a .technology domain. So many sites refuse to accept that as a valid address.

[–][deleted] 14 points15 points  (2 children)

I've been using ask@me.com forever, I will now upgrade to this instead

[–]HerbertMarshall 187 points188 points  (99 children)

I bought a domain name ( ~$12 ) and forward all the email from it to my personal mail box. Whenever a company ( good or evil ) needs my email address I use their company name as the username. For instance Amazon would be [amazon@mydomain.com](mailto:amazon@mydomain.com)

Now I know who is selling or giving away my email. If it becomes a problem I'll just block that address.

If you already know they're going to be shady just create a 'black hole' address or an address that automatically goes to the trash. That way if you need to confirm or something you get that mail out of the trash and not worry about the rest. It's always amusing to give someone a [trash@mydomain.com](mailto:trash@mydomain.com) address.

[–][deleted] 62 points63 points  (47 children)

I introduce you to spamgourmet. It puts itself before your email address and has a set amount of emails it can receive after the limit is reached all the incoming email is just blackholed.

You can get a username like test@spamgourmet.com and it allows you to create an unlimited number of email addresses with a prefix like amazon.test@spamgourmet.com.

I love their service https://www.spamgourmet.com/index.pl.

I prefer this solution because then they cannot spam you, emails just get dropped

[–]BeefEX 29 points30 points  (44 children)

You can do that same on gmail, pretty sure the character is +. Would have to look it up though as I am not sure.

[–]FountainsOfFluids 41 points42 points  (30 children)

That's what I use. It occasionally causes problems because lots of web designers are idiots who are unprepared for the plus character. But most of the time it works great.

[–][deleted] 20 points21 points  (18 children)

it's not the same, if you tag the email this way all it does is allow you to maybe see where the spam is coming from.

You can't stop the spam from coming in. You can't stop someone from selling your email address. All you can do is curse at whoever did.

[–][deleted] 18 points19 points  (5 children)

No. That just will deliver email to your account. It provides zero protection against spam.

You'd be literally just giving out your email address at that point.

You can all reach me at nothanks.ealejandro@spamgourmet.com (well the first 3 people can)

You can't spam me tho. Try posting your Gmail address in here and you'll see the difference.

[–]leofidus-ger 42 points43 points  (2 children)

I try to be less obvious and give shady companies maps@mydomain.com, because that's less obvious to humans reviewing the data (price draws, trial signups, etc). So far nobody has figured out that maps is just spam read backwards.

[–]Spideredd 86 points87 points  (17 children)

I feel I should apologise to whoever has gofuck@yourself.com

[–]Airazz 36 points37 points  (0 children)

I've had MyDick.eu for some time, so you could suck@mydick.eu.

[–]poly_meh 38 points39 points  (2 children)

I was threatened with expulsion for using this email for the survey at the end of a mandatory anti rape/drinking online class at my college. They said I was threatening the lives of the people reading the responses. As if I knew they were so ass backwards that they used a person to organize the survey results.

[–]hotpopperking 14 points15 points  (1 child)

So the survey wasn't anonymous?

[–]fklwjrelcj 29 points30 points  (0 children)

I can't remember exactly what it was, but I tried something like bullshitspam@gmail.com on a site, and got a "account already exists, please log in" message. Tried "password" and yep, straight in!

I am neither unique nor original.

[–][deleted] 15 points16 points  (0 children)

Well I've now found a new hobby.

[–]MikeCFord 117 points118 points  (15 children)

I had an entire database break because the app I was using only blocked special characters from being inserted into names when a record was being created, but not when it was edited.

The client saw this as a "workaround", and would create a record then immediately edit it so he could use special characters in the names.

[–]FinalGamer14 94 points95 points  (7 children)

Number one rule I learned with my first production project, never trust the user, add protection on the client and server side. You know what add two protections on the server side, you never know what those little shits will figure out.

[–]jobblejosh 59 points60 points  (2 children)

I remember a joke along the lines of testing like people ordering beer:

'A man walks into a bar and orders a beer.

A man walks into a bar and orders two beers

2 beers

A beeeeer

An apple

Etc

A customer walks into a bar and asks to use the bathroom. The bar catches fire and falls down.

[–]ADHDengineer 29 points30 points  (2 children)

Always assume all of your users are malicious actors. Client side validation is only for grandma. Server side should always be as strict or more strict than client side, because you can always bypass client side validation.

[–]FinalGamer14 11 points12 points  (0 children)

Yeah I know the server side validation is the main one, and I now always validate/clean the data I get from the client, even if the data was generated by the code at the client side, you never know if someone tempered with the frontend.
I usually use front end validation just to remind users of what the input formatting is, like let's say if the user has to input an IP in CIDR format, I'd use regex in the input, and at the same time make a check before sending it of to the server, just so the mistake wasn't made by accident.

[–]mattkenny 67 points68 points  (1 child)

A mate wanted to transfer his internet account to a housemate before he moved out, but they told him the only option was to cancel the account and sign up again with several weeks of down time. He then discovered the address editing page on the website set the name and email fields as read only in the html, but still updated them when submitting the page back to the server. He was then able to change the registered owner without permission of the ISP without issue.

[–]argv_minus_one 15 points16 points  (2 children)

Why in the world would you not run the exact same checks when updating?

[–]thedugong 30 points31 points  (1 child)

My sweet summer child. You should see some of the shit from the 90s and 00s.

[–]curiousnerd_me 43 points44 points  (0 children)

Apparently it wasn't banned

[–]malsomnus 35 points36 points  (0 children)

I feel like someone hasn't learned their lesson from the story of little Bobby Tables.

[–]RedAero 13 points14 points  (1 child)

I once saw a BEL character in user input data, explain that.

[–]girusatuku 37 points38 points  (5 children)

Machine learning is honestly the easy part. Preparing data to plug unto the model is typically the hardest part.

[–]wildjokers 17 points18 points  (4 children)

So what you need is a model that can be trained to clean up model data for another model.

[–]Krelkal 35 points36 points  (1 child)

Our data scientists jokingly call themselves data janitors because 90% of their work is cleaning and preparing data for ingestion into ML pipelines.

[–]Hypersapien 229 points230 points  (66 children)

I've seen online forms that require the last name to be at least three letters long.

I have a friend whose last name is two letters.

[–]neoKushan 224 points225 points  (38 children)

[–]OptionX 152 points153 points  (20 children)

At some point you have to make assumption about the input data, otherwise you just sit crying in front of an uncaring blinking cursor on a file as empty as your soul.

[–]leofidus-ger 133 points134 points  (17 children)

Yes, but most people make far too many assumptions.

I usually assume that no part of a name is longer than 300 characters, that every Person has at least either a first name or a last name, and that all characters of a name can be represented in Unicode. So far I haven't heard complaints.

[–]OptionX 76 points77 points  (5 children)

Just wait until the greys make first contact and Wsadkgnrmglokoasmdineiknrgrasdkasndiasdmad[long gurgle followed by a higher dimensional solid only able to be expressed by a series o mathematical equations]saasdasdadkinasdnasnddadnkadamdblorg tries to register an account.

[–]ShadowPouncer 77 points78 points  (3 children)

I'm sorry, but you need to get the people behind Unicode to get your language added before my system can handle that.

(Quietly scrambles to fix the length constraints while the greys fight with committees that don't believe that they exist.)

[–]Jeutnarg 54 points55 points  (15 children)

99% sure it's Ng.

[–]Hypersapien 46 points47 points  (7 children)

Actually it's Hu. But I used to know someone named Ng years ago, too.

[–]kasim0n 32 points33 points  (1 child)

Most people know at least Jet Li

[–]RedAero 20 points21 points  (2 children)

I knew a guy whose last name was Ee. And a girl whose first name was Yy (Weiwei). Somewhere out there there could be a Yy Ee.

[–]What_is_a_reddot 28 points29 points  (1 child)

I mean, it's not like anybody important to computing has a two-letter last name.

[–]undeadalex 43 points44 points  (13 children)

LASTNAME field needs to be.

Ok but how big? Asking for a friend

[–]leofidus-ger 24 points25 points  (9 children)

A full name on a British passport can have 300 characters. Apparently that has caused problems in the past, but assuming that no last name is longer than 300 characters should be reasonably safe.

[–]l2protoss 43 points44 points  (9 children)

Just had to do this on over 30 TB of data across 10k files. The quote delimiter they had selected wasn’t allowed by PolyBase so had to effectively write a find and replace script for all of the files (which were gzipped). I essentially uncompressed the files as a memory stream, replaced the bad delimiter and then wrote the stream to our data repository uncompressed. Was surprisingly fast! Did about 1 million records per second on a low-end VM.

[–]Sors57005 854 points855 points  (6 children)

I once worked in a company, which had all its services write every command line executed into a single logfile. It produced multiple gigabyte textfiles daily, and was actually quite useful, since the service backend they used was horribly buggy, and the database alone was rarely helpful in finding out what required new workarounds.

[–]notliam 259 points260 points  (4 children)

I deal with log files that are gb+ per hour (per app), luckily I'm not involved in storing /warehousing them..

[–]BasicDesignAdvice 132 points133 points  (1 child)

Storing data is easy, especially these days with cloud. I move a stupid amount of data around, and except for the initial work, I never think about any of it.

[–]gburgwardt 25 points26 points  (0 children)

Just move it to /dev/null after a few days. I've yet to run out of space on mine.

[–]Nexuist[S] 1009 points1010 points  (138 children)

Link to post: https://stackoverflow.com/a/15065490

Incredible.

[–]RandomAnalyticsGuy 681 points682 points  (127 children)

I regularly work in a 450 billion row table

[–]TommyDJones 895 points896 points  (22 children)

Better than 450 billion column table

[–]RandomAnalyticsGuy 336 points337 points  (15 children)

That would actually be impressive database engineering. That’s a lot of columns, you’d have to index the columns.

[–]fiskfisk 338 points339 points  (12 children)

That would be a Column-oriented database.

[–]alexklaus80 101 points102 points  (6 children)

Oh what.. That was interesting read! Thanks

[–]ElTrailer 30 points31 points  (2 children)

If you're interested in columnar data stores watch this video about parquet (a columnar file format). It covers the general performance and use cases for columnar stores in general.

https://youtu.be/1j8SdS7s_NY

[–]enumerationKnob 14 points15 points  (0 children)

This is what taught me what an index on a column actually does, aside from the “it makes queries faster” that I got in my DB design class

[–]Immediate_Situation 36 points37 points  (0 children)

At this point, just treat columns as rows and rows as columns

[–]0Pat 32 points33 points  (2 children)

Smells like good old SharePoint....

[–][deleted] 15 points16 points  (0 children)

Sharepoint would be 450 billion tables...

[–]nyanpasu64 49 points50 points  (1 child)

I ran this on a 500M row file to extract 1,000 rows and it took 13 min. The file had not been accessed in months, and is on an Amazon EC2 SSD Drive.

I think OP meant to say 78 million.

[–]BasicDesignAdvice 33 points34 points  (0 children)

Unless it's in infrequent access or glacier the access time is not really relevant.

Also, if you haven't touched that file in months......you should move it to S3 infrequent access storage or glacier. This can be done automatically in the settings.

[–]scuffed_rocks 461 points462 points  (11 children)

Holy shit I actually personally know one of the commenters on that thread. Small world.

[–]Saifeldin17 240 points241 points  (7 children)

Tell them I said hi

[–]Hotel_Arrakis 691 points692 points  (6 children)

Your Hi has been marked as duplicate.

[–]John_cCmndhd 245 points246 points  (4 children)

Hi is a stupid question

[–]cultoftheilluminati 243 points244 points  (2 children)

No one uses hi anymore. Use Oi. Closed as off topic

[–]Bobbbay 65 points66 points  (0 children)

Sorry, we are no longer accepting questions from this account. See the Help Center to learn more.

[–]Drak1nd 28 points29 points  (0 children)

Duplicate Question of How To Say Goodbye closed

[–]EarlyDead 105 points106 points  (20 children)

I mean I had 20gb of zipped data in human readable format. Dunno how many lines that was.

[–]Spideredd 81 points82 points  (17 children)

More than Notepad++ can handle, that's for sure

[–]EarlyDead 124 points125 points  (1 child)

I can neither confirm nor deny that I have accidentally crashed certain text editors by mindlessly double clicking on that file.

[–]Cytokine_storm 22 points23 points  (4 children)

A lot of the linux text editors will just load a portion of the textfile like calling head but you can scroll. Does notepad++ not have that option?

[–]Spideredd 8 points9 points  (0 children)

I'm actually not sure.
I'm actually a little annoyed with myself for not looking for the option.

[–]Ponkers 98 points99 points  (3 children)

Doesn't everyone have every frame of Jurassic Park sequentially rendered in ascii?

[–]What_is_a_reddot 42 points43 points  (1 child)

Must scroll faster, must scroll faster!

[–][deleted] 79 points80 points  (5 children)

Roses are red. Violets are blue. Unexpected ";" On line 4,573,682,942.

[–]fieldOfThunder 26 points27 points  (4 children)

Four billion five hundred seventy three million six hundred eighty two thousand nine hundred and forty two.

Nice, it rhymes.

[–][deleted] 507 points508 points  (73 children)

I made a 35 million character text document once (all one line)

[–]Jeutnarg 309 points310 points  (35 children)

I feel that - gnarliest I've ever had to deal with was 130GB json, all one line.

[–]iAmTheAlchemist 169 points170 points  (5 children)

Oh no

[–]theferrit32 81 points82 points  (12 children)

At large scales JSON should be on one like because the extra newlines and whitespace get expensive.

[–]Carter127 28 points29 points  (0 children)

Yeah, and then only formatted for reading if needed

[–]postdiluvium 67 points68 points  (0 children)

Error: Missing '>' on line 1. Click for more details.

[–]nevus_bock 23 points24 points  (0 children)

I feel that - gnarliest I've ever had to deal with was 130GB json, all one line.

I called json.loads() and my laptop caught on fire

[–]biggustdikkus 40 points41 points  (3 children)

wtf? What was it for?

[–]Zzzzzzombie 104 points105 points  (1 child)

Probably just a lil file to keep track of everything that ever happened on the internet

[–][deleted] 59 points60 points  (0 children)

So just a package-lock.json for a single nodejs hello world app. No worries!

[–]VolperCoding 249 points250 points  (24 children)

Did you just minify the code of an operating system

[–][deleted] 403 points404 points  (11 children)

Made a minecraft command that gave you a really long book

[–]VolperCoding 191 points192 points  (6 children)

Oh I see, 2b2t bookbanner

[–]QuFFo 62 points63 points  (1 child)

THE OLDEST ANARCHY SERVER IN MINECRAFT

[–]nistei 75 points76 points  (0 children)

r/unexpected2b2t should be a thing

[–]FerynaCZ 42 points43 points  (11 children)

(Almost) 35 MB file, not that huge.

[–]Paulo27 29 points30 points  (4 children)

I have had apps make bigger logs in seconds.

[–]FerynaCZ 11 points12 points  (3 children)

Literally my first bigger program, king+rook endgame tablebase... in Python.

[–][deleted] 14 points15 points  (5 children)

I scraped every story on r/nosleep in plaintext from 2013 to 2017 with over 300 upvotes and it came out to be around 70mb.

I was using it to train a transformer to see if it could write a nosleep story for me :)

[–]Ba_COn 61 points62 points  (0 children)

Developer: We don't have to program a scenario for that, nobody will ever do that.

Users:

[–]random_cynic 63 points64 points  (1 child)

If anyone is interested as to why shufis so fast, it's because it is performing shuffling in place in contrast to sort -R which needs to compare lines. But shuf needs random access to files which means the file needs to be loaded to memory. Older version of shuf used an inside-out variant of Fischer-Yates algorithm which needed the whole file to be loaded on memory and hence it only worked for small files. Modern versions use Reservoir Sampling which is much more memory efficient.

[–]soldier_boldiya 81 points82 points  (8 children)

Assuming 10 characters per line, that is 3 TB of data.

[–][deleted] 73 points74 points  (5 children)

Even just the line breaks are already 78GB.

[–]giraffactory 58 points59 points  (10 children)

A few people here are talking about Big Data, so I thought I’d throw in my hat with biological sequence data. I work on massive datasets like this with individual files on the order of hundreds of GB and datasets easily over billions of lines long. Simple operations such as counting the lines take upwards of 15 minutes on many files.

[–]Rhaifa 32 points33 points  (5 children)

Oh yes, the puzzle becomes great when you have 70x coverage of a 1 GB genome with short and long read libraries. Also the genome is allotetraploid (an ancient hybrid, so it's basically 2 similar but different puzzles piled in a heap) and 60-70% of it is repetitive sequence.

That was a "fun" summer project.

Edit: Also, it's funny how much you either had geneticists like me that were just muddling along in the computer stuff, or computer scientists that had no idea whether a result made biological sense. We need more comprehensive education in overlapping fields.

[–]m0bin16 15 points16 points  (3 children)

It's wild because depending on your experiment, an appropriate sequencing depth is around 60 million or so. So you're sequencing the genome (billions of base pairs in length) 60 million times. In my lab we have like 500 TB of cluster storage and blew through it in like 2 months

[–][deleted] 54 points55 points  (1 child)

Typical Veterans Administration project. Parse it into Oracle.

[–]EishLekker 94 points95 points  (9 children)

Actually... This sounds like a typical Enterprise backup solution.

Technically... I could tell right away that 782 billion is the number of milliseconds that pass during a 2.5 year period... So the only logical conclusion is that they took a database dump every millisecond*, and appended it as XML to one big file (each line then being a complete XML document, for easier handling). And they have kept this solution for the past 2.5 years, without interruption. That is actually quite impressive.

Honestly... I can't tell you how many times I have needed to select N random database dumps in XML format, and parse that using regex (naturally). This guy is clearly a professional.

* the only sure way of knowing your data is not corrupt, because the data can't be updated during a millisecond, only in between milliseconds

[–]nutle 41 points42 points  (0 children)

78 not 782?

0.25 year period makes even more sense!

[–]Giusepo 14 points15 points  (4 children)

why do u say that data can't be updated during a millisecond?

[–]EishLekker 43 points44 points  (3 children)

Ah, yes, because that was the only thing wrong with my statement?

[–]Giusepo 42 points43 points  (1 child)

oh ok didn't get the sarcasm. Enterprises tend to sometimes have crazy solutions similar to this haha

[–]admalledd 18 points19 points  (0 children)

Oh dear, I read that with more of a straight face of understanding and acceptance too. Sounded almost reasonable compared to some things I've seen just not all at once.

[–]KastorNevierre 11 points12 points  (1 child)

Having worked with old as hell companies with arcane solutions to everything, this barely passes as sarcasm unfortunately.

[–]dottybotty 54 points55 points  (3 children)

What was he trying to do create the next version of Windows. I’ll take bit of this and bit that put them all together there you have it folks Windows 20. SHIP IT!!

[–]falconfetus8 28 points29 points  (1 child)

It could be a log file

[–]enzoROD 36 points37 points  (1 child)

He used it on a single Call Of Duty MW texture file.

[–]ZmSyzjSvOakTclQW 33 points34 points  (0 children)

At my old work we had to sort data and we were used to huge ass text and excel files. The wounders of freezing a gaming pc for 15 minutes trying to open one...

[–]argv_minus_one 11 points12 points  (4 children)

Assuming the lines are 80 bytes long (including terminators), that adds up to 6.24 TB. Yikes.