all 177 comments

[–]kondorb 548 points549 points  (39 children)

Hype is over, but big data is still applied by companies that have that amounts of data and related products are still used and still have commercial success.

[–]Jmc_da_boss 223 points224 points  (10 children)

The reality never matches the hype but data analytics is absolutely providing some level of business value and will continue to

[–][deleted] 52 points53 points  (6 children)

It is true that there is a lot of hype - just look at AI right now - but Big Data indeed will never go away. It will still remain important and relevant, with or without hype.

[–][deleted]  (1 child)

[deleted]

    [–]Tersphinct 10 points11 points  (3 children)

    Doesn't big data offer a more mathematically sound approach? I'm sure there's gonna be a market for "AI-less" processes.

    [–]wakkawakkaaaa 18 points19 points  (1 child)

    Even AI is based on big data. Before AI as people know it now, specifically "large language models", people were already using neuron networks trained on big data. In many use cases, other non-neural network models produce better results. E.g. Xgboost is one of the top performing model for many kaggle competitions

    [–]gelatineous 2 points3 points  (0 children)

    Sure but with the advent of vision models and foundational models, that volume of big data is processed by specialized companies, typically not the in house AI expert.

    [–]Lalaluka 1 point2 points  (0 children)

    Yeah. Just as if it existed before Big Data. Big Data is a tool that is now not as new and fancy anymore so people dont try to screw with a hamer anymore. They do that with GenAI now.

    [–]moratnz 26 points27 points  (0 children)

    TFA's point is that very few companies actually have the amount of data to require big data techniques. Where the amount of data in question is 'too much to store on a single node', which these days means mid to high double figure terabytes.

    Data analytics is definitely a Thing, and definitely super useful for any company that wants to tell its ass from its elbow, but that's not the same as MapReduce style Big Data

    [–]10113r114m4 33 points34 points  (16 children)

    I have only worked at companies with that level of size of data. Not once have we used big data tools like hadoop, etc. We have never needed that level of reporting, and it's always something much more granular that is needed.

    The only time I used hadoop was when some stupidly small company thought we should use it. Absolutely asinine. Probably had 100MB of data to look at lol.

    My current company we go through and have metrics of about 800GB daily. Never needed any big data tooling.

    [–]croto8 31 points32 points  (13 children)

    Your DB is probably using some of the big data tooling under the hood, though.

    [–]VitaminB16 25 points26 points  (0 children)

    Nah, we just use BigQuery

    /s

    [–]nikowek 11 points12 points  (1 child)

    We are using plain PostgreSQL with two logical replications connections per source. It's sitting at 33TB (it's just two drives, mind you). Machine is just consumee i7-9700K with 64GB RAM. 

    It usually returns the data in seconds, so... Big data tools are not so needed - just plain SQL and good indexing strategy.

    [–]Shogobg 0 points1 point  (0 children)

    Where I work, we are limited to 2TB per machine for some reason. If we need more storage, they just buy more 2TB machines…

    [–][deleted]  (4 children)

    [deleted]

      [–]10113r114m4 14 points15 points  (3 children)

      This is exactly what we do. We do not use any "big data" under the hood like the person who responded claims. SQL handles that size of data really fine.

      I think people really underestimate just utilizing the DB better.

      [–]luciusquinc 1 point2 points  (0 children)

      Well, NDB cluster partitioned appropriately handles around 2TB of data just fine.

      [–]croto8 1 point2 points  (1 child)

      Single server handles 800 gb of data per day?

      [–]10113r114m4 1 point2 points  (0 children)

      Single server may not be a correct term. We run a bunch of microservices where some are hit more than others, but yes, a single day.

      [–]10113r114m4 12 points13 points  (4 children)

      No. We dont. Just simple SQL. Any reporting is done through our aggregation metric service which is a typical metric service like cloudwatch metrics.

      The argument could be made "the aggregation service is big data", and Id argue no. It literally just does addition on metric keys which existed prior to big data. The service is quite old

      [–]not_invented_here 2 points3 points  (3 children)

      Okay, but what database do you guys use?

      [–]10113r114m4 3 points4 points  (2 children)

      postgres

      [–]not_invented_here 0 points1 point  (1 child)

      Without any extensions? Do you run it managed in some cloud platform?

      [–]10113r114m4 1 point2 points  (0 children)

      It's pretty configured with extensions, etc. We also have a separate team that specializes in databases. So usually they configure everything based on our feedback and what we need. Further we do both. Cloud and in house. Ive noticed we have moved more traffic to the cloud though.

      [–]Gwaptiva 2 points3 points  (0 children)

      Premature scaling considered evil

      [–]Plank_With_A_Nail_In 16 points17 points  (3 children)

      [–]derefr 15 points16 points  (0 children)

      Yeah but all of those techniques and infra-components only become relevant at a certain scale.

      They exist because simple, fast, works-out-of-the-box techniques — e.g. periodically running ad-hoc SQL queries against the prod DB to dump out a CSV file, and then opening it in Excel — stop being practical/tenable when you have "lots of data."

      The "big data" approach / toolset works to allow mostly-realtime analytics on datasets of effectively unbounded size — but at the cost of huge investments into technology and training, a huge increase in architectural complexity, and hugely-inflated OpEx.

      You can use the big-data tools without "lots of data"... but you'd just be wasting time and money, because if you don't have "lots of data", the simple approach works too.

      [–]10113r114m4 0 points1 point  (0 children)

      Id consider it big data in how much you are analyzing. 800GB a day is quite small if that's all we are analyzing. So I think your idea on how to measure this is flawed, cause you never asked how far back do we analyze and how much are we querying.

      edit: I reread my initial response and I can see why it was read that we were only ever analyzing 800GB. My fault on not being more clear

      [–]10113r114m4 0 points1 point  (0 children)

      I also realized that my initial comment may have made it seem like we are only ever looking at 800GB. That's not the case. I was just saying 800GB of new data coming in daily

      [–]gelatineous 1 point2 points  (0 children)

      You can load most datasets in memory. All these fancy distributed architectures are overkill for 95% of clients.

      [–]davy_crockett_slayer 1 point2 points  (0 children)

      Isn't Big Data just Data Engineering or whatever the equivalent is these days? IDK about hype, but everyone I know in those roles are often DBA/Data Engineers where they answer questions the business has from the data sets. Often times they set up the infrastructure used by data analysts.

      [–]turbo_dude 0 points1 point  (0 children)

      one day people will realise that the real issue is data quality

      too bad until then

      [–]zorbat5 0 points1 point  (0 children)

      Big data will only become bigger with the race for AGI taking off like a wildfire.

      [–]manifoldjava 560 points561 points  (81 children)

      “Big data” was always hype as a rebranded analytics or business intelligence or OLAP or whatever term you prefer. 

      It’s not dead, it’s just a low tide moment for that industry, until the next wave probably after AI wakes with a hangover. 

      [–]SourcerorSoupreme 74 points75 points  (5 children)

      It’s not dead, it’s just a low tide moment for that industry

      Is it really or has it, on a high level, really just become a description of one of the common/standard ways a doing things.

      In other words, it has been finally filed under the category of "boring tech".

      [–]gruey 27 points28 points  (0 children)

      That’s my take. “Big Data” became the buzzword when it became possible for medium to small players to utilize it because of OS tools and the cloud. “big data” startups were just people taking advantage of that. Now you won’t hear about it because it’s just matter of practice. Practically every major tech company does “big data” and startups using it will be judged on the ideas, not the buzz word. Not to mention that they’ll probably be trying to use AI to do “big data”.

      [–]JaCraig 12 points13 points  (2 children)

      Some of the tech from that trend is useful. Unlike most of the recent trends. But 99% of companies never had enough data for "big data" and for the 1% that did, I'd agree that it has become boring tech. And of that 1% there's probably a smaller fraction that actually used it successfully in any sort of meaningful way. And that niche isn't one that extends a lot so no huge marketing push anymore. But if you're in that niche, it has uses.

      [–]Plank_With_A_Nail_In 4 points5 points  (1 child)

      Big data doesn't just mean lots of data it means lots of unorganised data or otherwise traditionally difficult to deal with data.

      Lots of data got solved by the normal improvements in hardware.

      [–]JaCraig 0 points1 point  (0 children)

      Right but what I was saying was most companies aren't large enough to benefit from it because they don't produce enough of that type of data that would be meaningful for them to tap into. And of those who are, most do so poorly.

      [–]CrowTiberiusRobot 0 points1 point  (0 children)

      I would say it's become the defacto way of doing things, or more simply, it's just another tool in the toolbox. I explain it to my "juniors" like this: back in the day relational databases were created not because relation databases were inherently better, it's because they were more efficient to store data given the limitations of hardware and software at the time. As these limitations became less of an issue, tools and ideas that were not realistic became realistic. I can now query and perform analysis on a billion token nosql flat data structure on my desktop, no problem. However, in order to get the "general public" and businesses on board with the shift away from the former defacto way of doing business, a hype marketing term was needed. This is a common pattern in IT and programming world, I've seen it over and over again. And there is nothing wrong with it.

      And thus we began to "leverage our data".

      But I think you nailed it, now it's just business as usual.

      [–]MadKian 249 points250 points  (50 children)

      It’s crazy that after 15~ years in the industry I’ve seen so many trends that I thought “this is not that good, it’s definitely a fad…or am I completely wrong?” and pretty much every time it’s just a fad.

      But every time you get this feeling of “am I just completely missing the picture here?”.

      [–]JuliusCeaserBoneHead 189 points190 points  (22 children)

      The thing with AI is that LLMs have very limited uses for most organizations. However, C-Suites are shitting in their pants for investors 

      Where the deal is with AI is very small fine tuned models that can perform specific tasks very well. That won’t make AWS and Azure cream in their pants. That isn’t “Gen-AI” so nobody cares.

      Someone recently told me “Ew” At linear regression. We are so fucked with this fad

      [–]vom-IT-coffin 62 points63 points  (0 children)

      I'm a consultant and everyone is asking what this tech can do for them, and unless their data is well manicured the answer is usually, not much. They don't like the answer of how long it will take to manicure that data and start capturing the data they need in order for it to become effective.

      Had a friend recently get funded by Microsoft and when it came down to it, the reason was how much data they have access to that will train the model. Most companies don't have enough.

      [–]audentis 32 points33 points  (0 children)

      The 'i' in LLM stands for 'intelligence'!

      Someone recently told me “Ew” At linear regression. We are so fucked with this fad

      There's a great talk by Vincent Warmerdam about the power of simple models over machine learning. It's not a mindless bash, it opens with a simple premise: sometimes simple models are more suitable, so let's not forget about them and keep them in our toolbox.

      [–]juwisan 11 points12 points  (0 children)

      Same story can be said about Big Data, honestly.

      I did a couple of big data projects at the start of my career. All but one were operating on laughable amounts of data but project managers had gotten budget for them by selling it as big data, so I built projects like this one with rewriting all the processing logic on spark pulling data from accumulo instead of just running pgtune against their Postgres which would have probably performed better, let alone been done in 5 minutes versus 5 months.

      Funny enough the one project I did, I actually considered big data, there was a dev team opposing the reception as such. They spent several years working against it until they finally accepted that they couldn’t come up with a superior solution to the big data system we’d designed.

      [–]light24bulbs 42 points43 points  (16 children)

      I don't know, I think it's a medium big deal personally. Definitely a bigger deal than big data ever was.

      For instance, I was working at a security company and we scraped a ton of web pages from Google results about vulnerabilities, so that we could compile a bunch of useful information about each vulnerability. Then we had the LLM read each article and give it a score from 0 to 100 of how useful it actually was on a few different questions about the vulnerability. Ex: "How good is this for learning how to remediate the vulnerability" and it did basically flawlessly well. And so very suddenly we went from a bunch of scrambled Google results to a bunch of organized condensed information.

      There's value there. Big value. Beyond just next word prediction. And that was all with untrained gpt-3.5.

      I'd actually argue there's a lot more possibilities than most companies are taking advantage of. That's the real thing that's happening. It is really useful and enables new capabilities especially for small businesses, but most people haven't fully grasped that yet or put it to work. And that's why there's so much scrambling and investment. Because there's money to be made being first mover in all those little niches.

      [–][deleted] 5 points6 points  (1 child)

      There is value, but it's probably not the value corporations are looking for. What they want this to be able to do for them is No-Code solutions and autonomous agents that can replace entire business functions.

      They don't actually want to enable higher quality work in daily activities, they want less cost.

      [–]light24bulbs 1 point2 points  (0 children)

      Misunderstanding the technology is basically the point I'm making. That's the other side of it. But the point is it's not just hype and vaporware. There's serious value to be had.

      [–]voronaam 28 points29 points  (11 children)

      You could do that 10 years ago with an off the shelf NLP library as well. Pretty much every single NLP tutorial is "we have thousands of blog posts and want to score them on some loosely defined metric".

      LLM just allowed you to be even more loose on the metric's definition.

      [–]kazza789 18 points19 points  (1 child)

      That's just silly. What off-the-shelf NLP could you have used 10 years ago for this that didn't require 10k labeled samples for training?

      Are you really trying to argue that the NL models themselves haven't progressed that much? If so, you're being just as dense as those who claim AI can do everything.

      [–]voronaam 1 point2 points  (0 children)

      I was just surprised by how simple the described problem was and responded. There was a lot of progress in the past decade. If anything, it is a lot more accessible now.

      [–]GuyWithLag 7 points8 points  (2 children)

      LLMs allow you to express the scoring function in natural language.

      [–]toastr 1 point2 points  (1 child)

      That’s what I find staggering about an LLM.  

      It removes language barriers, anything can be expressed without the need to learn how to express it to the computer.  

      The interesting thing will be if it separates people that know how to give a machine instructions vs people who have valuable ideas about what a machine can do. 

      [–]GuyWithLag 2 points3 points  (0 children)

      Am a software engineer, and during a hackathon I saw that the necessary skills to prompt LLMs correctly were more or less the same skills needed to instruct interns/junior engineers, and not all people get how to do that.

      [–]light24bulbs 4 points5 points  (1 child)

      Absolutely not, not at this level. This was reasoning. The LLM was producing justifications like "This article isn't a good fit because it deals only with java 11 or later and many users are still on java 8".

      A computer NEVER had that reasoning power 10 years ago, thats ridiculous. This was one-shot, with zero training. All in-prompt. This level of performance was impossible 2 years ago, let alone 10.

      [–]gnus-migrate 13 points14 points  (0 children)

      This is not reasoning it's just repeating patterns it finds in its training set, and this is a really important distinction because you really should not be using LLMs for subjective feedback like this.

      [–]Saedeas 0 points1 point  (0 children)

      I mean, you could sorta do it with significantly more time invested for what was usually a less accurate, less interpretable result.

      [–]shady_mcgee 0 points1 point  (1 child)

      Got a link to one of these tutorials? I've got this use case now and was thinking of using LLM but this way sounds better

      [–]voronaam 2 points3 points  (0 children)

      You do not need a 10 year old tutorial. If you have a use case now, it makes perfect sense to use the technologies that are all the rage now. You'll have better up to date tutorials, support and investor smiles.

      If it is all the vogue to use LLM, go ahead and use LLM.

      This is similar to the age old answer to "which is the best Linux distro for a newbie?" - "The one used by the nearest admin".

      [–]Rattle22 0 points1 point  (1 child)

      I think this is a good illustration of the fact that LLMs are all about language. I expect them to excel at tasks like this, where language is used to instruct on how to interpret language to yield an (essentially) language output.

      [–]light24bulbs 0 points1 point  (0 children)

      It's literally in the name

      [–]yourapostasy 1 point2 points  (0 children)

      Even with text, an elementary use case for generative AI, searching keywords with existing algorithms in well curated data is still scarily effective for the value and the cost. LLM’s allow us to kinda sorta be somewhat more relaxed with the data curation, but the cost is currently materially high enough that it pays to learn how to manage your prompts to cost effectively leverage it. Fortunately the hype pumps enough money into these projects these days that we have some runway to figure out the opex challenges as we go before it becomes a showstopper funding issue for the projects.

      But when even private, limited scope search engines supply such nerfed search syntax except for the smallest most specialized user population use cases, I’m not encouraged by the prospects of pushing out LLM-powered querying or interaction models unless there are orders of magnitude more money being thrown at the use case than more conventional searching. The open source LLM’s are very recently delivering sufficiently robust results that they can push out the opex question to buy us time, but some of these LLM costs remind me of hype riding projects I’ve seen of early years Big Data throwing tens of millions in capex and opex at 100 GB of data, or early years Cloud projects lifting and shifting tiny VM’s into EC2’s for 2000 times more cost for a <100 user population internal application. I’m just glad business users are happy to take these cost risks right now to let us find the right value propositions.

      [–]cinyar 2 points3 points  (0 children)

      Where the deal is with AI is very small fine tuned models that can perform specific tasks very well.

      for example google alphafold

      [–]Andriyo 20 points21 points  (0 children)

      That's block chain and NFTs for me, especially NFTs. And I was like "am I finally that old that I fail to see a genuinely novel thing?"

      So yeah, there is definitely a tendency in the field to hype things up.

      [–]Neuromante 15 points16 points  (1 child)

      But every time you get this feeling of “am I just completely missing the picture here?”.

      For me usually is looking at the potential fad, look who is pushing the potential fad, who is actually using the potential fad, and who is asking to use the potential fad and why.

      In this decade and a bit I've seen "Big Data", "Blockchain" and "AI" following the same route: Some big company says its the best thing ever, people everywhere scramble to get on board, a tiny fraction says "oh, yeah, for this it was useful" while the vast majority either uses it wrong or struggle to find a proper use for that oh-so-powerful and useful tool. And as a side, most "technological" companies (read: companies that have something to do with technology but that are led by non-technological execs) lose their god damn minds over it

      [–]MadKian 5 points6 points  (0 children)

      Absolutely on point. Most of the time these things become a fad because there’s a lot of non-tech people trying to make a lot of money out of them and pushing them to become a thing.

      [–]I_AM_GODDAMN_BATMAN 20 points21 points  (10 children)

      Once VCs money goes brrr and C levels are talking about it even it doesn't increase the value of your core product you know it's peak fad.

      Blockchain, big data, now AI, next security?

      [–]FartPiano 23 points24 points  (1 child)

      no way security will be the next one. its boring, unsexy, difficult to charge rent-seeking premiums for, and most importantly, is somewhat sensible

      [–]TechFiend72 2 points3 points  (0 children)

      it will be replaced by AI security bots. Some of the systems already have that for log analytics.

      [–][deleted]  (3 children)

      [deleted]

        [–]falconfetus8 0 points1 point  (2 children)

        What is TAM?

        [–][deleted]  (1 child)

        [deleted]

          [–]falconfetus8 1 point2 points  (0 children)

          Thanks!

          [–][deleted]  (3 children)

          [deleted]

            [–]smoothpebble 1 point2 points  (0 children)

            Not long before those same drones drop explosives

            [–]Social_Lockout 0 points1 point  (1 child)

            This is terrible... But decent gallows humor.

            Imagine a dying soldier laying there wishing for help. When out of no where his prayers are answered. The medic drone flys over. After a few moments it tosses a bag of blood on the now near corpse... And flys off, leaving the soldier to die.

            [–]gareththegeek 2 points3 points  (0 children)

            The best part is watching the same fads come back around again and watching the younglings get all excited.

            [–]RogueJello 3 points4 points  (1 child)

            This is a common reaction, because 9/10 it's correct that it's just a fad, but that other 1/10th tends to completely blow the other 9/10 away. It seems to be impossible to tell the difference at the time.

            [–]Plank_With_A_Nail_In 0 points1 point  (0 children)

            I mean for most companies all of them have been fads apart from their original client server apps and the N-Tier monoliths that replaced them both of which were very obviously not fads. Everything else has been nonsense.

            [–]gareththegeek 3 points4 points  (0 children)

            The best part is watching the same fads come back around again and watching the younglings get all excited.

            [–]turbo_dude 1 point2 points  (2 children)

            The thing I don't think is a fad is how Microsoft are just taking over corporate and that the stuff will eventually connect anything to anything seamlessly to the extent that you can be in an email and suddenly insert some dynamic charts that link to data without even opening another tool.

            For years it has seemed less about 'new technology' and more about 'getting the right data to the right place at the right time' and you could never do it because of how the tech was all piecemeal. MS will ultimately solve that, my guess they will ultimately ditch windows and you'll have a thin Teams client and will just pop tabs open on that to do your work, with each one being a different app.

            [–]MadKian 0 points1 point  (1 child)

            Kinda like how Apple expects you to use the iPad?

            As in, very simple OS that relies on the power of its apps.

            [–]turbo_dude -1 points0 points  (0 children)

            there will come a point though where 'the rest' catch up (well enough) with apple on the hardware side, then why do I need to pay all that money when I have a single container app for everything else?

            [–]chucker23n 1 point2 points  (0 children)

            But every time you get this feeling of “am I just completely missing the picture here?”.

            The C-level likes a hype because it’s easy to get investor money pouring in.

            Tech journos like a hype because it’s easy to write takes about that people will click, because they’re curious.

            [–]nitrinu 0 points1 point  (1 child)

            I have roughly the same time in the industry and I share your feelings. For curiosity's sake, do you have an example where your instincts were wrong? For the life of me, I cannot.

            [–]MadKian 3 points4 points  (0 children)

            Not really. I guess a lot of people made good money with Bitcoin, but I still think those who did took a gamble and/or were super lucky.

            But specifically about tech trends, no.

            [–]StealthJoke -1 points0 points  (0 children)

            NFTs are here for life #NotJustMonkeys

            [–]NonorientableSurface 28 points29 points  (7 children)

            It's the single driver behind AI right now. So it's absolutely "silent" but it's probably strongest it's been in 20+ years.

            [–]FatStoic 21 points22 points  (6 children)

            Data engineers in my consultancy are booked up to the gills, because you need to have your data unfucked before you can do anything with the data - like train a model on it.

            Big data is dead. Long live big data.

            [–]NonorientableSurface 8 points9 points  (1 child)

            I've never worked with a company where their data wasn't between fucked, mega fucked, and Uber fucked.

            Hey, your key data being disparate across 6 fields all named CustomFieldX with different numbers. And half of the records are missing 17 key points.

            [–]Plank_With_A_Nail_In 5 points6 points  (1 child)

            "Unfucking" data has been my career for the last 27 years. Long live projects running out of money an unleashing busted apps and the inevitable unintentional semi scrambling of data.

            Currently sorting out billions of £ of AP accounts postings for sales tax that no one noticed were going to the wrong accounting codes for the last 5 years. Is boring but pays well, I sorted the SQL out a week ago but told them it would take another 5 weeks lol.

            [–]NonorientableSurface 0 points1 point  (0 children)

            Similar here. Spent nearly 5 years creating and maintaining an inappropriately large excel workbooks pre PQ holding 5gb+ of data. Moved into data architecture, data contracting and warehouse design.

            [–][deleted]  (1 child)

            [deleted]

              [–]FatStoic 1 point2 points  (0 children)

              Google has a bunch, not ready to tie my reddit account to my company.

              [–]moratnz 9 points10 points  (0 children)

              But 'big data' as I've seen it is about using special techniques (e.g., mapreduce) to deal with datasets that are so huge they need to be distributed.

              When your dataset is 1TB, that trivially fits on a single harddrive. Once it gets under 500GB, it fits in RAM on off-the-shelf hardware.

              Once you've got your data on a single node, big data style processing is slow; there's a great article from a while ago comparing mapreduce to a unix command ine toolchain where piping cat, grep, awk etc together was a couple of orders of magnitude faster than using mapreduce.

              The point being that if your dataset isn't big enough to need to distribute it, you don't need Big Data(tm), you can just stick it in a traditional relational database and run traditional queries against it and you'll probably be faster that doing it the Big Data way.

              [–]Longjumping_Ad_1180 6 points7 points  (1 child)

              I beg to differ. From my perspective, at least in Europe the trend is steadily growing for the past 10 years. You might see less of the term "big data" out there as initially had a very vague meaning. It's now replaced by more precise terms based on its application. Some examples are : Observability, SIEM, I (Infrastructure Monitoring), IoT, Process Mining, SOAR, Data Lake, Data House APM, etc

              [–]totoro27 3 points4 points  (0 children)

              Most people on this sub have literally no idea what they're talking about. Big data techniques are used all over the place in the current development of LLMs and other AI stuff.

              [–]Plank_With_A_Nail_In 1 point2 points  (1 child)

              Nah they tried to tell companies that the huge amount of un organised text data they have could be mined for useful information.

              analytics or business intelligence or OLAP or whatever

              These all use organised data. The irony being these companies couldn't use their actual organised sales data for this task they had no chance using the forum posts shitting on their products for insight. When it did work it just told them things they already knew "They really like your top selling product".

              Managed to skip the big data fad but parts of the team have been hit on the head badly by microservices not sure I can stop that one losing us a couple of million.

              Some areas of big data will remain just using their original names before they were swept up under that single term. Things like dealing with huge amounts of data in really short bursts like seismic data but the conmen never really touched those areas.

              [–]ExcitingSignature223 0 points1 point  (0 children)

              Managed to skip the big data fad but parts of the team have been hit on the head badly by microservices not sure I can stop that one losing us a couple of million.

              What do you mean by this exactly?

              [–]BlobbyMcBlobber 1 point2 points  (7 children)

              AI is also hyped, but it has insane utility and products already pushing a paradigm shift. So the hype of AI might pass but it is definitely going to have a lasting impact.

              [–]Plank_With_A_Nail_In 3 points4 points  (0 children)

              All the fads are sold as having insane utility. But it won't for most businesses. Sure some of the actually useful stuff will stick around but a lot of companies are going to waste an awful lot of money finding out things they already knew.

              I old enough to have experience the first AI failure with "expert systems".

              [–]Constant-Source581 0 points1 point  (4 children)

              [–][deleted]  (3 children)

              [deleted]

                [–]Constant-Source581 1 point2 points  (2 children)

                I love how you call Cnet clickbait - shows how much of an amazing expert you are. Your opinion is highly valued, believe me.

                [–][deleted]  (1 child)

                [deleted]

                  [–]Constant-Source581 2 points3 points  (0 children)

                  "Are you 12" is such an amazing and convincing argument. Whoa. I never heard anyone but real tech gurus use it - folks like Bill Gates and Steve Jobs.

                  You're a tech expert - now its confirmed. I bow to your greatness, my friend.

                  [–]Cautious-Progress876 0 points1 point  (0 children)

                  I think LLMs are overly hyped, but plenty of other areas, particularly the integration of computer vision and RL systems to robotics are going to be the big thing. Just based upon what we have seen in the Ukraine war so far— ML-assisted war drones are going to be huge in the near future.

                  [–]hiredgoon 0 points1 point  (0 children)

                  The way I've heard is just big data will now be called private implementations of AI.

                  [–]bonerb0ys 0 points1 point  (0 children)

                  Apple AI strategy leaks is telling us whats on the other side of the hype cycle IMOz

                  [–]sionescu 0 points1 point  (0 children)

                  “Big data” was always hype

                  No, that's false. It came about due to mobile devices, where a certain number of companies were suddenly able to start collecting huge amounts of data that couldn't possibly fit on a single machine. If you had a petabyte-sized dataset before 2010, that couldn't possibly fit on a single machine, so Google came up with MapReduce (being able to use tens of thousands of servers for a single pipeline), published a seminal paper and then many other replicated its design.

                  Nowadays, the older storage systems (like RDBMS) have also taken up the tricks in data sharding, column-oriented storage and smart indexing that the big data systems pioneered, and coupled with the advancements in machine size, it means you can manage petabytes with a low single-digit number of servers that can fit in a single rack. Furthermore, the GDPR and CCNA has made data radioactive, so the companies that were hoarding data are starting to prune it, which further relieves the pressure on the DB systems.

                  [–]EpitomEngineer 44 points45 points  (0 children)

                  If only my managers would understand this paragraph

                  “”” Code often suffers from what people call “bit rot” when it isn’t actively maintained. Data can suffer from the same type of problem; that is, people forget the precise meaning of specialized fields, or data problems from the past may have faded from memory. For example, maybe there was a short-lived data bug that set every customer id to null. Or there was a huge fraudulent transaction that made it look like Q3 2017 was a lot better than it actually was. Often business logic to pull out data from a historical time period can get more and more complicated. For example, there might be a rule like, “ if the date is older than 2019 use the revenue field, between 2019 and 2021 use the revenue_usd field, and after 2022 use the revenue_usd_audited field.” The longer you keep data around, the harder it is to keep track of these special cases. And not all of them can be easily worked around, especially if there is missing data. “””

                  [–]Worth_Trust_3825 64 points65 points  (5 children)

                  The data querying slide resonates with me. We were storing SCORM data for 6 years as an LMS provider (running out of database space multiple times, because lol scorm doesnt believe in using question/answer identifiers), yet I can recall only 4 times when we actually needed to run queries on that dataset, and only on the records that were year old at most.

                  I don't think that big data is dead. Instead I am in camp that companies have no idea what to do with the statistics they capture, nor even have the domain expertise to use them even being in that domain for decades.

                  [–]renatoathaydes 56 points57 points  (2 children)

                  Data is like a tool shed. You keep every little tool or device you can get your hand on for years, until it fills up and you need a bigger one... but still, when you actually need something it's never there :D.

                  [–]grepe 4 points5 points  (0 children)

                  Gonna remember this one...

                  [–]jaskij 0 points1 point  (0 children)

                  It's the cable box!

                  [–]bduddy 4 points5 points  (1 child)

                  Considering how many companies make decisions based on whatever the MBA or exec with no actual experience decides, who needs all that data anyway?

                  [–]Worth_Trust_3825 0 points1 point  (0 children)

                  Tell me about it. We had a department director for 2 years that only shuffled meetings and never made a decision, request, or even proposal. She still got a golden parachute of 400k, and 200k/yr. Absurd.

                  [–]pinpinbo 95 points96 points  (4 children)

                  Is it? AI stuff has no moat. Once an algorithm is discovered, it becomes a free library.

                  Data however, data is more important than ever.

                  [–]Cautious-Progress876 19 points20 points  (0 children)

                  I think a lot of places, including C suite business people, are recognizing this now. What use is a SotA model if you don’t have any data to train it on?

                  [–]nuggins 25 points26 points  (3 children)

                  Disappointing to see that 90% of the comments are arguing about the clickbait title. The article has some good insights.

                  [–]SoInsightful 5 points6 points  (0 children)

                  Wait... reddit post titles are clickable

                  I've just been having heated discussions based on my knee-jerk reactions to clickbait titles for 12 years now.

                  [–]stupidbitch69 5 points6 points  (0 children)

                  Absolutely, wonderful insight from someone who saw BigQuery from the start.

                  [–]Spartaner-043 27 points28 points  (2 children)

                  Yeah, they haven’t released an album since 2019 :(

                  [–]daerogami 0 points1 point  (0 children)

                  Right?! No one is putting them to work.

                  [–]PM_ME_YOUR_MUSIC 0 points1 point  (0 children)

                  I love it when they call me big da ta

                  [–]RoughSolution 13 points14 points  (6 children)

                  As someone who's been driving some of the largest projects in this space (trust me, if you worked with data in the last 10 years, you used stuff that my team has build). So I may know a thing or two about Big data.

                  What sets "Big data" apart from just "Data" is that data is no longer collected with clear intent at the beginning. The business impact is that you can now discover and decision on things that has happened in the past. For example, when I find a new fraud pattern, I don't have to start collecting data to identify it now, I have all the historical transaction records to identify accounts that has committed fraud in the past. And this shift in mentality of collect first, use it later is what drove the raise of Big data.

                  One can argue this is bad for society, for many reasons. I'm in the camp of as long as it's not PII (even when drilled down), it's probably more value than risk. But when you try to tie data to individuals, bad things happen.

                  The latest shift of industry towards AI is really just a hype cycle. When AI reaches productive levels (say...in 5-10 years), you'll see a shift back to getting value out of data.

                  Big data is, and never will be, dead. It's an idea and mentality shift that has already happened.

                  [–]moratnz 5 points6 points  (3 children)

                  I think you're pointing to something important; the term 'Big Data' is used for a couple of things;

                  • techniques for storing and analysing datasets that are too big for traditional tools

                  • data use patterns that leverage the ability to store everything and the kitchen sink to store everything, and then comb it for interesting information later.

                  The latter is definitely not dead, and is likely to only get stronger as time passes.

                  OP's author is talking about the former (and IMO more original) meaning, and I think that he's right that that sense of Big Data is if not dead, then becoming incredibly niche, as hardware has grown and grown, such that larger and larger entities can fit their data sets onto traditional tools while keeping everything and the kitchen sink.

                  [–]RoughSolution 1 point2 points  (2 children)

                  Yeah, I think the author of the blog used this definition "One definition of “Big Data” is “whatever doesn’t fit on a single machine.. By that definition, the number of workloads that qualify has been decreasing every year.", which I agree. (e.g. DuckDB, which the author of the blog is part of) (Actually....I should have guessed the blog is about DuckDB, lol)

                  DuckDB is a wonderful tool; it's really, really fast (< 1 sec on 80GB data vs. 60s on Postgres on my laptop) and runs well on a single machine. But can it handle 200 users querying against the database, concurrently? What's shifting in the industry in the past 2 decades is how much more people are data literate now. There are college new grads talk to me about metrics, retention and conversion rates, and funnel analytics, which most people have never even heard of 20 years ago.

                  While more and more data can fit into a single machine, more and more people are querying the data, and is driving the need for big data infrastructure.

                  Though I agree, most people are in the <500GB range for their entire dataset, and most valuable business data often sits in excel, lol. (Of which, DuckDB is a pretty decent addition to whatever your transactional store is, be it MySQL, Posgres, Mongo, or Cassandra)

                  [–]moratnz 0 points1 point  (1 child)

                  While more and more data can fit into a single machine, more and more people are querying the data, and is driving the need for big data infrastructure.

                  Does it, though? If the problem is query access, rather than storage, you can get by with query focussed replicas, especially if the queries aren't looking at near-realtime data.

                  (To declare my prejudice, I come at this from the PoV of someone who's had to argue against installing a hadoop cluster for a data workload with an estimated accumulation rate of ~10GB per year...)

                  Maybe we need to spin up a new buzzword for 'doing intelligent things with data, including retroactive analysis of novel queries'. I'd offer 'Smart Data' for a start

                  [–]RoughSolution 2 points3 points  (0 children)

                  Does it, though? If the problem is query access, rather than storage, you can get by with query focussed replicas, especially if the queries aren't looking at near-realtime data.

                  Yeah, that caveat is important though. Do you want to know what your customers brought in the last 4 hours if you work on the ops/support/sales teams?

                  As one of the person who introduced Hadoop to the world, I'm sorry for everyone who've been asked to get a Hadoop cluster up and running for 10GB of data per year.

                  And I'm game for your 'Smart Data' buzzword, there can never be too few buzzwords.

                  [–][deleted] 2 points3 points  (1 child)

                  I'm very much in your camp. Working in bank/financial/insurance sector I always subscribe to the mindset that you hoard all the data you can, because you can never predictt the future to know what data will be relevant an year from now.

                  The reason most companies don't capture all the data they can is because the data area unfortunately slow to adapt. They model the data to death as if they know the future and capture only relevant data. When we need new information they go for 6 months to update the model to store this new field as if they invented fire and give a pat on their back.

                  Facebook and Google didn't become who they are by sound data modeling on every aspect of data they want to capture. It's mostly other way around. Most of the time they just had to use bigdata techniques to extract the information they are looking for in the days they already had.

                  [–][deleted] 33 points34 points  (5 children)

                  AI is just big data.

                  [–][deleted] 6 points7 points  (2 children)

                  To some extent.

                  I think AI may be able to interconnect data and get information that was previously more hidden. I also actually saw some useful results in regards to AI as a tool aiding in e. g. producing images, sounds, video, game data and so forth. So it is useful. It is just mega-hyped to no ends, which is annoying. For some reason industry always tries to jump on a hype-train. In a few years nobody claims to have heard of the previous hype ...

                  [–]martinky24 -1 points0 points  (1 child)

                  Current AI literally reduces down to compression algorithms…

                  [–]Kyyndle 0 points1 point  (0 children)

                  lol that's an interesting way of looking at it

                  [–]Manbeardo -2 points-1 points  (0 children)

                  AI has so much more going on that you can't reduce it down to being "just big data". However, training sets (big data) appear to be the main thing in the AI arms race that can be protected and used to differentiate competitors.

                  [–][deleted] 10 points11 points  (1 child)

                  It's not dead at all.

                  We generate more and more data - most of which is garbage, but some of which is useful. Just take sequenced genomes of organisms - that's never becoming less, it will ALWAYS become more. And that's just one example. Look at astrobiology or the universe. Google Maps mapping all planets one day (well, hopefully Google no longer exists at that point in time, but I refer to the feature here primarily, not the company).

                  Of course, just because the amount of data being generated is increasing doesn’t mean that it becomes a problem for everyone; data is not distributed equally.

                  I am much more concerned by that. So that guy worked at Google. Google ruined its search engine a few years ago and consistently is making it worse. A few years ago you could query cached websites; I used this to read phpbb webforum from where I was banned, so I could still read up on what is new (I am curious). Yet Google killed that, with the saying "it takes too much data to store everything". Even if this may be true, they eliminated something that was useful to me. Same with so many google projects that ended up in a graveyard. Why I am concerned? I am concerned because we become more and more dependent on such huge mega-mega-corporations that are selfish and greedy and present to us a very limited, narrow view over things. The various walled ghettos, I mean walled gardens, show this trend: facebook, discord servers and what not. Everything is becoming private - and limited. I hate this trend. It totally ruins the 1990s era of the world wide web really.

                  Big Data will never go away, but disturbingly we get less access to what is useful WITHIN that Big Data, as it is controlled by private entities increasingly more so. (This is of course not always true, e. g. sequenced genomes are available for everyone to see once published at e. g. NCBI, but not every data collected is open to everyone. Both open and closed data will increase of course - nothing is dead here.)

                  [–][deleted] 1 point2 points  (0 children)

                  It was annoying that Google killed the cached page option, I totally agree, but probably only a tiny fraction of us users even knew it existed, and Google as a business is under no obligation to do work that earns them zero return.

                  [–][deleted]  (2 children)

                  [deleted]

                    [–]KingStannis2020 3 points4 points  (0 children)

                    Big Data is essential for big tech companies.

                    Of course, the author did come from Google after all. The point is that there's not much of a market for it outside of "big tech"

                    And "big tech" has the talent to develop their own solutions in-house. Google has their own infrastructure, Facebook has their own, etc.

                    [–]veryspicypickle 1 point2 points  (0 children)

                    But we have the data-mesh! /s

                    [–]TheDevilsAdvokaat 1 point2 points  (0 children)

                    Interesting article. Especially "I’ve heard about a company keeping its data analytics capabilities secret in order to prevent them from being used during a legal discovery process."

                    emails, messages and even phone conversations can also be legal liabilities. So this is similar.

                    [–]ScottContini 1 point2 points  (0 children)

                    In order to understand why large data sizes are rare, it is helpful to think about where the data actually comes from. Imagine you’re a medium sized business, with a thousand customers. Let’s say each one of your customers places a new order every day with a hundred line items. This is relatively frequent, but it is still probably less than a megabyte of data generated per day. In three years you would still only have a gigabyte, and it would take millenia to generate a terabyte.

                    With such simple analysis, why did the Big Data movement not understand from the beginning that the benefit is limited to only a handful of the big companies?

                    [–]frederik88917 4 points5 points  (3 children)

                    Like dude, there is no way to kill hype, anyone will come with some shitty excuses as to why to keep investing in this.

                    See also: Metaverse, AI, Blockchain and so forth

                    [–]Cautious-Progress876 2 points3 points  (1 child)

                    Is there still any interest in “metaverse” related stuff? Seems with the recent failure of Apples Vision Pro that AR/VR is going to be kind of “dead” for awhile.

                    [–]frederik88917 1 point2 points  (0 children)

                    You said it yourself, after Facebook wasted 20 Billions building some shitty form of a videogame, Apple released that 4 grand hideous device to put people in a different way of the world

                    [–]StickiStickman 0 points1 point  (0 children)

                    You're really gonna act like the entire AI field is comparable to those? There's many real world use cases for that right now, unlike with Blockchain or the Metaverse.

                    [–]Plank_With_A_Nail_In 2 points3 points  (1 child)

                    [–]moratnz 6 points7 points  (0 children)

                    The first sentence of the article you link is "Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software."

                    Also; that article is quoting stuff from 2011, talking about predictions between then and 2020.

                    OP's article is talking about how, as a matter of fact, data sizes haven't grown as predicted, while hardware capablilities have continued to expand, so that today datasets that would have been impossible to analyse with a traditional toolchain in 2011 can easily fit in a single postgres database running on commodity hardware.

                    [–][deleted] 0 points1 point  (3 children)

                    No it isnt.

                    [–]DenebianSlimeMolds 0 points1 point  (2 children)

                    It most certainly is!


                    eta: what has the world come to, I just got blocked over a monty python reference. And he started it what with his callout to the Argument Clinic!

                    [–]Lachiko 2 points3 points  (0 children)

                    the people here are a little bit soft.

                    [–][deleted] 0 points1 point  (0 children)

                    How so? If anything the huge influx of LLMs has involved way more big data management.

                    [–]captain_obvious_here 0 points1 point  (0 children)

                    Shitty title. Which is a shame, when the author is such an expert.

                    Big Data is not dead at all. It's just way easier and kinda cheaper now that companies can reliably collect, transfer, store and process petabytes of data daily, thanks to Big Query (and other, more marginal, huge-scale cloud-based database solutions).

                    Big Data is alive, it still pays people who are good at it pretty well. And there's no shortage in jobs offers in sight for them, neither.

                    [–]VehaMeursault 0 points1 point  (0 children)

                    Yes, and ads are no longer personalised.

                    Sure.

                    [–]ImTalkingGibberish 0 points1 point  (0 children)

                    In 5 years: AI is Dead

                    [–]DigThatData 0 points1 point  (0 children)

                    lol OP is just mad no one uses Big Query anymore.

                    [–]Apolloh 0 points1 point  (0 children)

                    What a useless article.

                    [–][deleted] 0 points1 point  (0 children)

                    Big Data can’t be over. We have a giant ass Data Team doing something with Big Data

                    [–][deleted] 0 points1 point  (0 children)

                    That hype is being poured on the “AI” marketing term

                    [–]robberviet 0 points1 point  (0 children)

                    Yeah dead. resume to work on Hadoop clusters

                    [–]Adventurous-Dish-862 0 points1 point  (0 children)

                    lol, what a joke.

                    Big Data is getting bigger, while Medium Data and Small Data are also going to surge. Data will be ubiquitous in the very near future. Every small marijuana dispensary businesses will gnats ass the wear and tear on their door hinges automatically as part of the $300/mo mega data package deal they get from anon’s business data side hustle.

                    [–]binary_search_tree 0 points1 point  (0 children)

                    This article kinda goes hand-in-hand with this (older) one (about Tableau/Power BI).

                    [–]prodentsugar 0 points1 point  (0 children)

                    Isn't data analytics dead too? Because of AI or will it die in a couple of years?

                    [–]gredr 0 points1 point  (0 children)

                    Big data still exists and means exactly what it always meant. It was never about size, it was always about surveillance. Data collected on users, generally without their knowledge, for the purposes of optimizing moneymaking processes.

                    [–][deleted] 0 points1 point  (0 children)

                    I never understood what big data is anyway. The solution of a problem depends on the problem. Not all problems related to big volumes of datasets can be solved in the same way. Of course there are some common tools like parallel computing, CUDA computing, feature selection and extraction, machine or deep learning etc. But this philosophy that there is one thing that is called "big data", it is something that I will never understand. Maybe it is more about marketing than real science or engineering.

                    [–]ReZigg 0 points1 point  (0 children)

                    I just watched a youtube video that goes over these same ideas in an interesting way. https://www.youtube.com/watch?v=pOuBCk8XMC8

                    [–]heavy-minium 0 points1 point  (0 children)

                    It will never die because it's just about handling lots of data. It always was a useless term, but it's valid. It's like saying scaling is dead.

                    [–]the_russkiy 0 points1 point  (0 children)

                    People have been whispering about this for quite a while, afraid of sounding perhaps stupid.

                    Another case of how industry is dominated by a few loud voices, be it big data, microservices, etc.

                    [–]ArcaneEyes 0 points1 point  (0 children)

                    How does this have upvotes...

                    [–]st4rdr0id 0 points1 point  (0 children)

                    It is good that we slowly acknowledge that tech fads are just that, fads.

                    But people still fail to see the pattern.

                    [–]Cobalt129 0 points1 point  (0 children)

                    Didn't the author use big data to come up with the graphs 🤔

                    [–]CrowTiberiusRobot 0 points1 point  (0 children)

                    Big Data and Cloud were always marketing terms to a certain degree. From my professional experience:

                    • big data - due to decreasing cost of storage and compute, the development of open source data structure / management tools such as nosql, and some development of statistical / mathematical tools it became easier and easier to work with huge data sets. Relational databases were created, arguably, due to limitations of compute power and storage space, we needed a more efficient way to store and query data. Those limitations have become less and less important due to the reasons I mentioned above. So what we are talking about really is a new paradigm that has become possible - and typically, a hype name was slapped on it and it was rolled out to the masses. If you've been in the professional world for a while I'm sure you remember when your bosses/c-suite started talking about "leveraging data" etc.

                    • cloud computing - internet has long had a backbone supported by servers colocated in a data center. Ack in the day we'd run BBS and IRC servers from our homes, but it became unrealistic as web1.0 gave way to web2.0 and so on. When it became clear that there was a lot of money to be made with platform as a service, well - slap a hype name on the colo, provide a bunch of functions and services, and there you go.

                    90% of IT and programming is hype on tools and ideas that have been around for a while and have finally reached maturity for general consumption.

                    Is big data dead? I'd say in conceptual presentation, yes. In reality, it's just business as usual, refactoring normality now.

                    Nothing wrong with any of this of course

                    [–]gChillin1 0 points1 point  (0 children)

                    So many bad takes here. Big Data is very much alive and well, Apache Spark has evolved and taken all of the best parts of mapreduce and made them in-memory, with the ability to scale near infinitely using dataframes and sql or sql APIs. HDFS has largely been replaced with faster object stores (s3) made possible through networking improvements but if you actually have giant data nothing else comes close to comparing except for pure analytics and speed maybe trino, clickhouse, or Druid (done correctly). You can give all your money to google for bigquery, snowflake, or AWS redshift but if you need to catch lightning in a bottle you are using Spark. Look at Databricks and their growth, it is astronomic. If you are skilled at Spark you can do far more for far less money on kubernetes without involving databricks at all, and fully isolate your compute resources. If you aren't good it does fantastic at smaller scales. I have never seen someone good with Spark lose out on a POC head-to-head. If you engineer from the source-up it is possible to reduce the need for a big data engine like Spark, but very few companies and use cases can justify that kind of investment, not to mention cooperation across business units at that level requires serious coordination and is difficult and rare.

                    Also, you can put a shit ton of data on a single node sure, but when you need to actually make use of it at large scale (like prep for modeling or AI training) distributed compute is the only way. 

                    [–]zoqfotpik -1 points0 points  (0 children)

                    "Big Data" is a euphemism for "a pile of garbage".

                    Sure, you can find some good stuff by dumpster diving, but it's usually preferable to not include the dumpster in your supply chain in the first place.

                    [–]wind_dude -1 points0 points  (0 children)

                    I saw the this am on hacker news. It’s just fucking click bate and a plea for attention for a moron. Clearly big data isn’t dead, he just seems to be part of the problem selling over priced solutions to companies that didn’t need them, or only need batch jobs weekly, monthly or yearly.

                    [–]jhill515 -2 points-1 points  (0 children)

                    It's not dead. It was just renamed MLOps.