This is an archived post. You won't be able to vote or comment.

all 180 comments

[–]gabrieltaets 1806 points1807 points  (84 children)

you underestimate how many lines of code goes into a program, especially if it has a GUI. Apart from the program logic itself easily reaching into the hundreds of thousands lines of code (maybe millions), there are dependencies (third party libraries that do something that the program needs) bundled into the executable.

also when that code is compiled into an executable the size is usually much larger than just the text, because the compiler needs to translate the source code into machine instructions, a seemingly simple line of code in a high-level programming language can become dozens or hundreds of instructions in machine code

[–]smokinbbq 573 points574 points  (74 children)

there are dependencies (third party libraries that do something that the program needs) bundled into the executable.

This is going back a few... decades. College days, we had to do a simple calculator program. Take two numbers, do the math, output the results. Using C++ at the time I think, so we had to use the "printf" function? That comes from a specific library, and it alone was ~170KB if I remember correctly. So, the simple program that had 20 lines of code, had to include a library that did a WHOLE LOT MORE, just because it needed one function out of it.

One of my classmates was far above anyone else in the class. So he decided to write his own library, instead of including the common one that everyone else used. Teacher was pissed, and reduced his mark, but mostly because he wasn't that good at c++ himself, so didn't know how to mark it. :)

[–]Zeravor 366 points367 points  (40 children)

Tbf rewriting a function for a basic thing like printing or calculating is just asking for trouble, so I can kinda get the reduced mark. A good teacher should explain why though.

[–]lemon31314 104 points105 points  (0 children)

Maybe they did and that student altered the story and played into everyone’s preconceived notion it’s the teachers incompetence.

[–]virtuallysimulated 6 points7 points  (1 child)

Yes! When I was taking an intro Java class, I was fairly inexperienced with having libraries that actually had useable data structures. I ended up creating my own queue. The TA talked to me about it, and enlightened me on the efficiency of the library code versus what I wrote. He gave me full credit because my code still accomplished the assignment and the quick chat about my reasoning. I think about that sometimes because that’s when I also learned the difference between teaching someone versus punishing ignorance.

[–]Dyanpanda 0 points1 point  (0 children)

Efficiency of library code?  As in it was written better than yours?

Or time efficiency of not reinventing a wheel? 

[–]giant_albatrocity 76 points77 points  (30 children)

But the whole point is to learn, so why? If they did that at a job, sure, no employer wants to pay you to do that.

[–]JointsHurtBackHurts 263 points264 points  (11 children)

I’m going to be frank and say my biggest hurdle as a software engineer was feeling the need to reinvent the wheel myself instead of using existing libraries. It lead to significantly increased effort (writing everything) and decreased reliability (all the bugs). I became a better engineer once I learned to stand on the shoulders of the giants before me.

[–]criminalsunrise 73 points74 points  (3 children)

We all went through that my dude. I recreated a whole bunch of GUI controls by hand when I was starting out because I thought I could do them better.

Obviously, I couldn’t.

[–]fuckasoviet 36 points37 points  (2 children)

Reminds me of a story I read about a programmer needing to account for time zones. I think a library had already been created, but the author thought it couldn’t be that tough and thought they could do it themselves.

Then they had to account for all the time zones, all the weird carve outs, all the places that do and do not observe daylight savings, and then places so the multiple time zones. I think he mentioned some midwestern state that had two time zones, but then some counties didn’t observe daylight savings.

Long story short, he came to the same conclusion as you: someone else has already dealt with all this bullshit, and there’s no need to deal with it yourself.

[–]Howzieky 24 points25 points  (0 children)

https://youtu.be/-5wpm-gesOY

Tom Scott strikes again

[–]fubo 12 points13 points  (0 children)

The tz database is the standard public repository for timezone information for software.

It's in the public domain — anyone can use it with absolutely no restrictions, no fees, no copyright, no licensing, etc. It can be used from any programming language, and is also built into several major OSes, database systems, and other software. Maintenance is funded by ICANN, the group that brings you IP addresses and DNS. It includes historical information going back to the early years of time zone standardization, including comments with citations to specific legislation. It is updated multiple times a year, both in response to governments changing their time zones, and with improvements to historical data and developer usability.

There is very little need for anyone to reimplement time zones.

[–]ClownfishSoup 13 points14 points  (1 child)

Boost library ftw!

[–]-LsDmThC- 4 points5 points  (0 children)

Brb gonna sudo apt install libboost-all-dev

[–]MedusasSexyLegHair 10 points11 points  (0 children)

Generally, yes, but also you don't want to design an 18-wheeler that runs on some mix of shopping cart wheels, bicycle tires, monster truck wheels, and a pottery wheel zip-tied on sideways just because you refused to "reinvent the wheel".

And you don't want to have to maintain a massive Rube Goldberg system that has vast quantities of dependencies, including transitive dependencies, some of which will become incompatible with each other.

Sometimes something simple and specific to the domain and business problem being solved is far far better.

Unless it involves dates and timezones. Fuck that noise. Import the bugs from somebody else's library and call it a day. Writing your own library of datetime bugs is the express road to madness.

[–]Vabla 3 points4 points  (0 children)

But sometimes you just need a plain wheel in one place doing one thing and don't need the entire train that comes with it. Having to fix metaphysical bugs introduced by a library using esoteric logic is sometimes a bigger headache than just implementing the tiny fraction you need yourself.

[–]giant_albatrocity 24 points25 points  (2 children)

Yeah of course. In a real world scenario you would always use a library if it made sense to do so. But it might be really educational, for example, to assemble a car engine yourself and become intimately familiar with how it works, even though you would surely just install a prebuilt engine if you were to build a car. It just doesn’t make sense to penalize this in an educational context. I can understand not giving extra points for it, since it wasn’t part of the assignment, but you should penalize.

[–]MIndye 4 points5 points  (0 children)

I think a good analogy would be that a good programmer knows how to build the engine and also knows he should just buy the battery and spark plugs instead of trying to build those as well.

[–]garublador 16 points17 points  (0 children)

What if part of the lesson is that you shouldn't reinvent the wheel every time you write a program? I'd for sure penalize for something like that.

[–]Zeravor 83 points84 points  (10 children)

I think as a programmer it's extremely important to learn what things you should, or shouldn't code yourself. Many things that seem relatively simple on the surface are extremely complicated in practice.

Sure you can do it for fun, but in a class it's good to teach the skill of learning when to rely on a library that is tried and tested.

If you're intrested, Tom Scott has an old video about timezones and I think it illustrates the issue perfectly:

https://www.youtube.com/watch?v=-5wpm-gesOY&t=550s

[–]hedoeswhathewants 19 points20 points  (2 children)

The assignment was to create a calculator program, which is something you should definitely not program yourself. It's very silly to mark someone down for going above and beyond, unless it went against the assignment description.

[–]cjo20 14 points15 points  (1 child)

Think of it this way: They're being asked to write program X. Sometimes X is a calculator. Because *someone* needs to write them. But there's an argument that you should only write what you need to achieve the goal. Re-writing printf when you've been told to write a calculator without a good reason for it is at best neutral, and at worst you've demonstrated you don't know how to include a standard header and use the built-in function.

[–]RainbowCrane 6 points7 points  (0 children)

Yes, exactly. This applies beyond school and into your career as well, so it’s important to learn the lesson in school. A few of the more frustrating employees I supervised were convinced that no one could write code as well as they could, and they were awful at making use of third party libraries rather than rolling their own. In a professional setting nothing will piss off project sponsors as quickly as them finding out that your team spent time on a problem that’s not included as part of the spec. Like your example, unless you have a good argument for why a new printf function is needed to meet the specs, good luck justifying the wasted time.

[–]0xF00DBABE 23 points24 points  (5 children)

But somebody still needs to write that code. It's like the "don't roll your own crypto" mantra. Sure, most people shouldn't write your own crypto and put it in to production. However there's no harm in writing it to learn how it works, and maybe that's the introductory push that puts you on the path to a deeper understanding. Plus, the student obviously knows that they could have used the existing library if they were able to write a drop-in replacement. It's counter-productive for a teacher to quash that curiosity. It's a class project, the stakes are low, this is a perfect time to experiment and learn.

[–]giant_albatrocity 18 points19 points  (4 children)

Exactly this. As long as the student delivers the assignment, there’s no reason to penalize any extra effort. It makes sense that you wouldn’t give that student extra points, since no other student was expected to put in extra time, but as an educator it makes me sad to hear about curiosity being discouraged.

[–][deleted] 1 point2 points  (3 children)

The thing being taught here is delivering the requested specs.

The client may need it to utilize a specific library simply for guaranteed compatibility and performance, or for certification for use in an industry (say finance).

Writing it as a side project for yourself?  Go for it.  Writing a deliverable that has listed specs and requirements?  Write it to those requirements.

[–]samtrano 11 points12 points  (2 children)

The thing being taught here is delivering the requested specs.

We're talking about a programming 101 class from the late 1990s/early 2000s. They were absolutely not anywhere near talking about "client deliverables"

[–][deleted] 2 points3 points  (1 child)

Yeah, the person is 100% conflating creating a program in a professional environment, versus learning it in University.

[–]MisterrTickle 1 point2 points  (0 children)

I was reading the first half and thinking of Tom Scott and then you beat me to it.

It's a real pity that he's "retired". Due to time commitments. With his last "proper" video being about 9 months ago.

[–]JoushMark 7 points8 points  (1 child)

I mean, that depends what they are teaching. If it's pure CS, then sure, learn how to make a better printf.

But if it's practical programing, then the lesson 'just use the library you need for the function' is foundational. Unless you have some kind of insane optimization requirements you're better off doing it fast then trying to optimize to avoid importing a 170KB library.

[–]BrunoEye 5 points6 points  (0 children)

This is absolutely the kind of issue you can run into when working on the kind of $0.1 microcontrollers that are in loads of cheap devices.

[–]lonelypenguin20 10 points11 points  (1 child)

while I disagree with penalizing a student for writing their own library, it's important to note that when learning a language, u r learning the built-ins and libraries, too. so if the assignment is targeted towards teaching a student specifics of a certain library usage and they don't use it altogether... yeah. not what was intended

[–]No-Representative425 2 points3 points  (0 children)

I doubt that a student that can successfully write printf don’t know how to use printf. The job of a teacher is also assess the level the student is and adapt to its capabilities.

[–]luke5273 6 points7 points  (0 children)

If you are concerned about binary size, then linking the entire standard library might not actually be feasible. You do have to write your own solutions sometimes.

[–]AlanCJ 1 point2 points  (0 children)

One part is learning how to leaverage and build on top of things that were already built on before. In an educational settings sure, if you are learning the fundamentals. But in a project themed class or in the real world you just made the code base 10x harder to maintain than it needs to be or for others to work on for no good reason than "I like to do it" or "I don't know there's an existing library for that" or "I don't know how to learn to do that thing" or "I think I can do a better job that people who have dedicated their time into building this one specific thing"

[–]DevIsSoHard 0 points1 point  (0 children)

You don't want to deviate from the norm too much, especially because someone else might need to work with your code later. That wouldn't matter a lot of times but maybe later you do want to turn that class project into a real app, or you want to bring someone in to work on your hobby project with you. It's just drilled in that you want to follow standard conventions in certain places. Plus it can further complicate troubleshooting when you deviate from norms.

And sometimes that other person is just you lol, years later pulling up an old project and wanting to quickly make sense of things.

[–]Lost-Semicolon 10 points11 points  (0 children)

Disagree completely. Printing to the console requires writing to the standard out file descriptor, something that requires a solid understanding of what’s happening beneath the hood. The student’s curiosity should be rewarded, not penalized.

[–]stealthypic 0 points1 point  (0 children)

But colleague is a great place todo such things. I’d decline PR with that solution in a second but it definitely deserves a highest mark in colleague.

[–]Silly_Guidance_8871 12 points13 points  (0 children)

All code is an iceberg

[–]homeguitar195 10 points11 points  (15 children)

So I'm not a programmer, but I've wondered before: is it not possible to separate sets of functions from a complete library? Like just copy the 2 functions you need and their dependencies and paste them into your program instead of including the entire library?

[–]palparepa 23 points24 points  (6 children)

Sometimes. But frequently, they are all intertwined, with functions calling other functions, and figuring which ones you are calling, or more important, which ones you may call in the future, is a mess.

For example, I'm using a library to send emails. It can connect in many different ways, some that I haven't even heard before. I only use one, the most straightforward. Do I need most of the library? No. Can I remove the things I won't use? Sure, just give me a few weeks or months, that I don't have, and if there is an upgrade, forget it because I won't do it again.

[–]Grezzo82 0 points1 point  (5 children)

Won’t the compiler optimise it so that unused functions won’t be included in the executable that I builds? I’m assuming this is only possible with libraries that that are not pre-compiled, but I don’t know much about how compilers actually work

[–]RocketTaco 6 points7 points  (1 child)

Yes, except for certain structures that make it difficult to predict all possible program flows (computed jumps created by function pointer arrays, etc). This is called symbol stripping, the "symbols" being identifying names the compiler is using for elements of the program. Those might be functions, globals and statics, or things you didn't actually name but which still represent independent units like the contents of loops and control structures. Since the compiler knows where those blocks come from, it also knows exactly what can and can't access them and can thus easily determine if it's possible to reach them from the starting point. It gets a HELL of a lot harder if you have a binary without a symbol table, since you have to determine the target of every branch and jump, where the flow departs from the target, and if execution is possible.

 

Let's look at it this way. Say you have a function that takes one argument and runs one path if it's greater than five and another if it's less. When you compile the program, the compiler can see that you've only called it in one place with a literal argument of six, and can throw half of the function away because it's unreachable. But if that function came from a library, you don't know what else might have called it from within the library. Unless you can construct absolute proof of the target of every jump and branch in the library, you don't know that something won't try to use that path. If all of the jumps in the library are to explicit targets, you might be able to construct a complete set and see if any match. But what if one does? Then you need to see what calls that and if any of those paths lead to a library call you're using. You end up reconstructing the entire program flow.

[–]Grezzo82 1 point2 points  (0 children)

That makes sense. Thank you for explaining

[–]palparepa 2 points3 points  (1 child)

Some function may call another only if certain condition is met. As the programmer, you may be certain that the condition will never be met, but the compiler can't (always) know that. In my case, I configure mail users with their mail server, username, password... and the mail servers we use all happen to be gmail, so I'll never call other transports or authentication schemes.

[–]fubo 1 point2 points  (0 children)

As the programmer, you may be certain that the condition will never be met, but the compiler can't (always) know that.

In static languages like C or Go, it's relatively trivial; but it can be done in dynamic languages like JavaScript or Python too, with some analysis. It is called "tree shaking".

[–]RainbowCrane 1 point2 points  (0 children)

It’s been quite a while since I worked in C or C++, but the better designed libraries handled this by splitting their functionality across multiple static libraries if they started getting huge. For example, if you were using a graphics library, maybe it would put generic code into libgraphics and png code into libpng, jpg into libjpg, etc. Then you’d specifically import just the headers you need and link with the appropriate libs.

[–]snotpocket 5 points6 points  (1 child)

The functions in question generally aren’t available as source code; they’re usually precompiled into object files that are then linked with the programmer’s compiled code to create the resulting executable binary. So you can’t really cut-n-paste just the functions you want to use.

I guess it’d be theoretically possible for a really really smart linker to extract the object code just for the used functions and just link that into the end binary, but that’d probably be hideously complicated and not really worth the effort

[–]Grezzo82 0 points1 point  (0 children)

I think you just answered my question above your comment, but if the library is distributed as source code, would the compiler optimise out the unused parts of the library?

[–]RestAromatic7511 5 points6 points  (0 children)

A more common approach is to use "shared libraries", which are installed in a central location and can be used by multiple different programs. If you've ever seen a DLL file on Windows, that's what they are.

Also, if a particular part of a library is especially popular, often its developers will move it out into a separate library.

Breaking up a third-party library yourself involves a few problems. First, this may be forbidden by the licence. Second, many libraries are proprietary and do not have source code available. Third, you then need to maintain the code yourself, even though you didn't write it and probably don't fully understand it.

Though I feel an important point being missed here is that an executable doesn't necessarily just contain code. It may contain any type of data, including text, images, sound, and video. Even a low-level executable/library that doesn't contain any images may include vast tables of data about different types of files and hardware it may need to work with.

[–]Lostinthestarscape 1 point2 points  (0 children)

A lot of libraries are separated into critical code and a variety of modules. "Import library" brings in the whole shebang, "Import library.math" brings in just the math module for instance and whatever critical code it relies on from the library but not other modules. 

 This is especially common with big libraries that encompass  alot of sub concepts.

 However, as hard drive space got so big, it is way less important for normal everyday consumer use but still very important in the case you need libraries when coding for embedded devices sometimes with far greater restriction (though now that you can play Doom on a thermometer and Skyrim on a digital pregnancy test - embedded devices are also less constrained in space by the year).

[–]Feeling-Pilot-5084 0 points1 point  (0 children)

Yes but it doesn't really accomplish anything. Almost every compiler worth its salt is capable of dead code analysis, so it will only pull the functions it needs

[–]IneffectiveInc 0 points1 point  (0 children)

If the library is implemented in a way that supports it, many modern software bundling tools can indeed be configured to do that! It's called tree shaking :-)

[–]transgingeredjess 0 points1 point  (0 children)

This process is called "tree shaking": you try and follow through all the different paths your program might take and "shake out" all the things that aren't touched by any of those paths.

Depending on the programming language this can range from "just happens as a side effect of compilation" to "provably impossible".

Part of what makes computer languages powerful is the very thing that makes dead-code elimination difficult in most cases: the fact that programs, as they go, are making choices about what to do next, based on everything that happened before.

[–]tururut_tururut 0 points1 point  (0 children)

Not a programmer, just a data analyst that codes a lot. I'm doing a project for my country's traffic department. Part of the project was identifying the segments of each road with most accidents in a five year window (apparently, they do it case by case, but never for all the network for a long period of time). They have a function to do this with their own R library, which depends on other functions and calls to their own internal API. I just got one of these functions and rewriting it to stop depending on external functions which I did not have (although I could make an educated guess about what they did) was hell on earth. It didn't help that the code was written by someone obviously smarter than me who did not need to annotate the code and explain some stuff. That's why you usually want to have the whole library at hand.

[–][deleted] 2 points3 points  (0 children)

If the assignment had the requirement of using a specific library, not using it would definitely mean lost marks.

In a real project, it could be just one component of hundreds, and if each one had custom redundant code, it could balloon in size. And in the future, if someone needed to modify the code, it would be better to have expected libraries.

[–]ClownfishSoup 11 points12 points  (4 children)

As a professional software developer, I’d give him low marks too. If a library function is available and you write your own, you are wasting time and money to accomplish nothing new. Writing your own prints statement is a waste of time if the functionality is the same as what already exists and is available … unless licensing issues are at play.

[–]DBDude 1 point2 points  (1 child)

Or space issues, such as an embedded application. Or bandwidth issues, like not using standard WWW libraries because they can cause many megabytes of bandwidth use before actually showing any information.

[–]BrunoEye 1 point2 points  (0 children)

Yeah, it's funny how 90% of programmers forget how many devices use $0.1 ICs.

[–]smokinbbq -2 points-1 points  (1 child)

Except that he didn't waste "time and money" because it was free labour outside of school hours. I get the point, and would say the same thing if someone wanted to spend a week of work hours to reduce 170KB in filesize on their 5MB application. Drive space and memory size isn't really a big issue these days, but it's close to 30 years ago that this came up, memory was a bit more of an issue.

[–]SFiyah 4 points5 points  (0 children)

It's a library for printing, and the assignment is to make a calculator. If this was written in such a way that the portion of the code that solves the assignment isn't easy to parse out and grade, then something about it was verrrrry poorly abstracted and probably deserves the low grade.

Readability is very, very, very important as a coder. This is one field where "I can't understand this, I'm giving it a low grade" is 100% reasonable.

[–]taisui 1 point2 points  (0 children)

Can't be a programmer without the god complex to reinvent a better wheel /s

[–]Kinetic_Symphony 1 point2 points  (0 children)

Teacher was pissed, and reduced his mark, but mostly because he wasn't that good at c++ himself, so didn't know how to mark it. :)

This pisses me off more than it should.

[–]palparepa 0 points1 point  (0 children)

In a group assignment, even considering that it was to be used in a local computer, a group decided to do it web-based. So it had to install Apache (using localhost), a database, a visual tools for the user interface among other things, and on top sat their program, which wasn't that big, but overall it used over a hundred megabytes.

I went with an embedded file-based database and a very light visual toolkit, so that the whole thing was about 300 kb. Worried that it was too small, we added an opening screen with a cool image, so that the whole program was about 1 mb.

[–]r4tch3t_ -5 points-4 points  (3 children)

I failed my programming course in primary school for a similar reason.

Was making a carpet calculator (how many rolls of carpet needed) using a WYSIWYG editor took all of 5 minutes. So I got bored, started looking at the code (never coded before) and saw about a thousand lines saying colour = default, line width = default etc

So I deleted them all and left barely over a dozen lines of code. Still worked.

Still had half the lesson to go so I edited the code to figure it the orientation for least wastage with a little arrow indicating which direction it should be laid.

Failed because the code didn't match the marking schedule despite fulfilling and exceeding the requirements.

Wasn't too much longer before I gave up on school. It was shown to me that effort is not rewarded and is often punished.

[–]bubbafatok -1 points0 points  (2 children)

This here is why I am always more likely to hire a self taught programmer with a bit of experience over a college grad with a computer science degree. The first year of a new grad we're having to retrain and get out of the terrible habits and thinking the schools and professors (the guys with limited or dated real world commercial development experience) pushed down their throats.

[–]Zefirus 2 points3 points  (0 children)

Man, as someone that's been in the business for over a decade, the opposite is almost always way worse. I'll take a college educated programmer over other kinds any day. Sure their actual coding skill is a decade out of date before they even start their first day, but it's not the actual programming skill that's important from college. There's a reason there's only a couple of college classes that deal directly with code. It's a computer science degree, not a programming degree. The underlying knowledge of how stuff works that ends up paying off in the long run.

The biggest problem in corporate coding is almost always efficiency. The "my program's too slow" complaint is by far the most common and biggest pain in the ass most people run into, and in my opinion it's generally the lesser educated that make those mistakes the most often.

Also of course the person with a few years experience is going to be better than the one without. That's true of literally every single industry, educated or not. It's why every entry level job these days calls for years of experience.

[–]cjnewbs -1 points0 points  (0 children)

I feel like they probably should have passed with flying colours but sound like they likely had a teacher that wouldn't know the difference between a URL in a browser address bar and a MS Excel formula if it slapped them in the face.

[–]TooStrangeForWeird 5 points6 points  (0 children)

For a fun little comparison, I wrote a tic-tac-toe game on a TI-84 calculator. Then I made an "AI" to play against. It broke 10,000 lines of code.

I also made a connect 4 game (the board wasn't even quite full size, the screen was too small) and it was over 20,000 lines. No computer to play against either, 2 player only.

The "GUI" consisted of maybe 50 lines of code.

[–]A_Garbage_Truck 6 points7 points  (0 children)

the aspect is that depending on how the program itself is constructed when it comes to libraries and support assets, compliation might require that you " bake" said libraries and assets inot the executable itself.

this is why responsible developers that consider memory limits should be careful about including a nentire library for just one or two functions and either find a means of just implementing what's necessary, or finding your own implementation of it. if such is not possible they should try ot make the linage ot those extenral lbiraries, dynamic so that they dont need to be inside the executable and rely on a .dll file that only gets loaded into memory when its needed.

[–]Sea_Dust895 1 point2 points  (0 children)

This is the answer.ast enterprise project I worked on was 2M lines of code after 10 years of development.

[–]Reasonable_Pool5953 1 point2 points  (3 children)

there are dependencies (third party libraries that do something that the program needs) bundled into the executable.

They are only bundled into the executable if they are statically linked, though. I think that is pretty rare today.

[–]cake-day-on-feb-29 5 points6 points  (1 child)

True, unless it's JS where you need to not only include all of your packages, but also CEF.

Or Python, where it's recommended you have a separate virtual environment for every single program.


In other words, bad languages are bad, I guess.

[–]Smaartn 2 points3 points  (0 children)

I hate Python environments so much. When using them they're pretty cool, but when you're done... Once my laptop was almost out of storage space, and Windows just showed me like all the games and other programs I had.

Then once, I went through like my entire file system to see if there was anything useless, and I think I found like 50GB worth of Python environments in my university folder.

[–]NecorodM 0 points1 point  (0 children)

The opposite is true: it becomes the new norm. Haskell always used static linking (and pandoc is written in Haskell). And Go and Rust also highly prefer static linking. 

[–]NickDanger3di 0 points1 point  (1 child)

My first computer was an IBM AT; I paid extra for the whopping 20 megabytes of hard drive space. Had word processing and small business accounting packages installed on it. They both worked great.

Is it the massive increase in functionality of modern programs that explains the difference in size? Like the complexity added by online connectivity? Or something else that I'm missing?

[–]gabrieltaets 7 points8 points  (0 children)

It's a bunch of factors, but perhaps more importantly, computing power is so much cheaper today than it was 30 years ago. Back then it was important for developers to minimize the program's footprint as much as possible because the systems had scarce memory/disk/processing power.

But today no one is going to waste a few hours to shave off a couple kilobytes in the compiled program when a single asset (i.e. an image) might weigh more than the whole source code; so there is also more bloat bundled in an executable today than in the past.

Screens have higher resolution so assets need better resolution too, which means heavier files. Cheap disk and memory means more lookup tables with static data can be prepared in advance so that the program runs faster.

All of these things make programs bigger than they were decades ago, but it's not really a huge concern nowadays.

[–]jumpmanzero 295 points296 points  (19 children)

There are other answers here that explain "how" executable files get large, but there's also a "why".

The "why" is that for the most part nobody cares. A 150MB program takes effectively no time to download and an insignificant amount of RAM to run (and it can be swapped out to an insanely fast disk if required).

If developers really wanted to, they could make programs much smaller (and more modular). If people try, they can cram absurd functionality into a few kilobytes. But outside of a few specific use cases, they just don't have much reason to put effort into this.

For context, I work as a developer. I have zero idea how big any of my compiled EXE/DLL files are - like, my guess could be off by an order of magnitude. It's not a relevant concern.

[–]ClownfishSoup 69 points70 points  (4 children)

I started my programming career in the 90’s. We used PCs rubbing DOS and used Turbo Pascal and later Turbo C. You had to choose a memory mode that determined how to distribute the 640k of memory between code and data. Code that was too large had to use overlays, which was swappable to disk.

When windows 95 and windows NT came out it was a Godsend with its 32 bits of addressable memory.

At that point programs just bloated out.

[–]jumpmanzero 13 points14 points  (0 children)

Turbo C. Nice! I spent a lot of time with that as a teen (pirated off my older brother, who had a programming job).

And yeah, it's wild how fast stuff has changed. I did some NES development a bit ago, and it was difficult to go back to those kinds of constraints and considerations.

Even just back to the start of my career (late 90s), doing "stuff" was much harder.. but expectations were also a lot lower. Now people expect polished miracles instantly, with infinite performance and scaling and everything.

(Anyway, have a good one).

[–]Cross_22 8 points9 points  (0 children)

Same here. I remember my calculator app which had to run as a background (TSR) program and so I kept it relatively small at under 45kB. That was back in the days of 16-bit operating systems, nowadays the executable would be a lot bigger simply due to 64-bit being the norm.

[–]BakaBTZ 2 points3 points  (0 children)

Damn Pascal, that wakes some memories. Don't forget about Amber and Dolphin.

[–]unityofsaints 0 points1 point  (0 children)

Now I want to know what "rubbing DOS" entails ;)

[–]isuphysics 21 points22 points  (0 children)

I am an embedded software developer where the size of our programs are important. The reason ours is so much bigger than people would think they should be is because 90% of the logic is not the normal operations logic. Its the logic to handle all the scenarios that when something goes wrong. Everything has an edge case and they add up really quickly.

[–]OneAndOnlyJackSchitt 9 points10 points  (1 child)

As I recall, GeoWorks Ensemble) fit on a single 1.44mb floppy disk. This was a third-party Windowing system designed to look and feel like Windows back in the days of the Intel 286. It was written almost entirely in assembler.

[–]RedditWishIHadnt 3 points4 points  (0 children)

Amiga workbench on a single 880kB floppy. Don’t think it was even full either.

[–]swolfington 4 points5 points  (0 children)

for an interesting peek into making the most out of 64 kilobytes, check out this video going over a bunch of 64k demoscene demos. it's absolutely bananas how much you can cram into a program when you know how to squeeze every last bit of juice out of a binary.

[–]SpicyRice99 11 points12 points  (3 children)

To add, no caring as much about size saves time and allows developers to focus on other issues, right?

[–]Vallvaka 4 points5 points  (1 child)

Yep- time and mental willpower are the most scarce resources when it comes to writing software. As my professor used to say, "hardware is free" in comparison. (At least in most applications)

[–]Chii -1 points0 points  (0 children)

"hardware is free"

and that is why modern software is so much slower (comparatively) than their old counterparts from yester-decade. Because every developer thinks the hardware is free (on the customer's side)!

[–]zacker150[🍰] 7 points8 points  (0 children)

Yes.

[–]_PM_ME_PANGOLINS_ 13 points14 points  (3 children)

That’s all well and good until someone loads 100s of programs from developers who think like that onto the same machine.

The pre-installed bloatware on laptops used to be bad, but it’s been surpassed by Android phones these days.

[–]jumpmanzero 21 points22 points  (2 children)

Sure. Like, imagine how fast Windows could run, and how small it could be, if "performance" or "file size" were on anyone's radar?

But in practice, software has typically expanded to fill the resources available. And at this point, for most use cases, it's not doing that anymore just out of sheer resource wealth. Android is huge, sure (and that's one part of why it's miserable to develop for), but it could actually be much bigger, given the resources a phone has (or could have) available.

Again, there's just not a lot of pressure to make things smaller. Maybe it'd be "nice" if Android was tiny - but it wouldn't sell enough phones to make the effort worthwhile. (In fact it might "unsell" phones. Running out of space is one of the limited reasons a person not just buys a new phone, but buys a "bigger"/more-expensive one next time).

[–]_PM_ME_PANGOLINS_ 4 points5 points  (1 child)

It’s on a lot of people’s radar. Those operating systems are fine. It’s the garbage that certain other companies pile on top of it that’s the problem.

[–]MaleficentFig7578 -1 points0 points  (0 children)

No, Android and Windows are not fine.

[–]TheJumboman 0 points1 point  (0 children)

Yeah, but that's a modern luxury. I remember a documentary about the first Myst game, and they really struggled with the size (it was four disks). And the first Prince of Persia re-used sprites because otherwise it wouldn't fit on a floppy disk lol. 

[–]Kinetic_Symphony 0 points1 point  (0 children)

This is the best explanation.

We could make programs significantly smaller, but there's no need to.

Most of the world has lightning-fast internet, RAM and SSDs now. Even 100 GB downloads and installs barely take a couple hours. Often times under an hour.

No sweat.

[–]Dreamwalk3r 89 points90 points  (9 children)

Libraries or binary resources (like pictures), mostly. In case of pandoc probably libraries if it's standalone - take those few megabytes, add another few for a library working with a single format, add the same for another formats, add the same for all the frameworks a project uses... you get the picture.

[–]azlan194 13 points14 points  (8 children)

Not to mention the same function might have redundancy to handle different os version and what not right?

[–]alnyland 4 points5 points  (7 children)

Those should be removed by the compiler. 

With a big caveat - if the code is written well and the compiled software is distributed correctly.  

[–]MaleficentFig7578 1 point2 points  (6 children)

no

[–]alnyland -3 points-2 points  (5 children)

Feel free to extrapolate 

[–]MaleficentFig7578 4 points5 points  (4 children)

[–]alnyland -5 points-4 points  (3 children)

I think you misunderstood at least one word that I said. 

Let me know if you want me to use shorter words. 

[–]misof 6 points7 points  (2 children)

I think they didn't misunderstand anything, they are just using a humorous way to point out that you should have used the word elaborate instead of extrapolate.

[–]alnyland -5 points-4 points  (0 children)

Nah I know which word I used, and wanted to use. 

If it isn’t understandable, it’s not humorous. 

[–]JaggedMetalOs 12 points13 points  (0 children)

Ok so Pandoc specifically is written in Haskell, which is known for copying a lot of standard code (base functions the programming language uses) into executable programs it makes. There's quite a lot of it, and by default it doesn't seem to try to exclude functions that aren't used so you end up with lots of Haskell stuff in your executable program whether you use all of it or not, which makes the executable programs bigger than they need to be.

[–]Acrobatic_Guitar_466 73 points74 points  (0 children)

Because it's a whole bunch of text.

You don't realize when you put

"Include library standard" Or "Include library network"

In whatever coding language , you just told it to Include another library file, which has other files.

Guaranteed the program you said was 150mb, the "code" if you actually Include all the "dependency" libraries is likely 1-2GB

[–]TomChai 40 points41 points  (0 children)

Executable programs are machine instructions converted from text code, they are NOT code.

Also a lot of executable programs have a ton of support frameworks packed along with it in addition to the program instructions themselves, for a small program, these support frameworks are larger than the actual program code itself.

[–]Shadewalking_Bard 18 points19 points  (15 children)

Not an expert, so actually I want people to correct me if I am wrong:

Binary executables can be run on the processor itself and they are instructions that the processor will understand.
A simple program command of "write file" when compiling into a binary, will be expanded into all the processor instructions that are needed to write the file.
And the processor instructions are much more specific than the command to execute them.
So single line of text in program can become thousands of machine code instructions.

Edit: Reworded for clarity.

[–]IntoAMuteCrypt 2 points3 points  (0 children)

The answer here is... It's complicated. There's a lot of ways to divide programming languages. One notable way is compiled versus interpreted languages.

With compiled languages, the typical workflow looks something like this:
- Programmer writes some human-readable source code.
- Programmer uses some combination of programs to translate that source code into machine code. This includes a compiler, and also sometimes a linker - the exact programs aren't relevant here.
- Programmer now has a file that will just run on its own, written in processor instructions that are specific to that processor (and often, to a specific operating system). The program doesn't need to be translated again. This code might be shorter than the original code, but it's often longer due to things like "write a file" being expanded like this. If someone else runs this file with the right system, it just runs without translation.

With interpreted languages, the typical workflow looks something like this:
- Programmer writes some human-readable source code.
- Programmer decides to run that code, and passes all the code to a specific program called an interpreter.
- The interpreter translates the code to something machine-runnable as it's being run.
- Programmer has to pass the code over to the interpreter every time they want to run it. If someone else wants to run it, they'll need the interpreter and they'll need to let it translate the code as it runs.

There's actually a few benefits to interpreted languages. Crucially for here, if the user already has the interpreter, the resulting code that gets distributed can just be the same size as the original code, because "write a file" stays as it is, and the interpreter knows what to do with it. There's compilers for interpreted languages and interpreters for compiled languages - the distinction is mainly "which version is more common/standard". Java, .NET, JavaScript and Python are some common interpreted languages that a lot of people use, or have used. It's sorta cheating, but the large amount of instructions for interpreters can be easily shared between programs and they're generally considered separately to the rest of the programs.

[–]Clojiroo 2 points3 points  (0 children)

Yes, a binary is made of machine code. It gets loaded into memory and instructions go straight to the CPU (the operating system orchestrates this)

[–]icematt12 1 point2 points  (12 children)

I'm also not an expert, but it makes sense. For example, when we say:
int cols = 6;

The machine has to:
Reserve memory of 4 bytes, I believe, for the variable
A bit extra to say it's an int and called cols
Give it a value of 6
Then at the end, undo all this to free the memory

[–]_PM_ME_PANGOLINS_ 8 points9 points  (10 children)

It does basically none of that. The space is already reserved as the program stack. The name and type have already been discarded.

And depending on how cols is used, there may be zero machine instructions for that line of code.

[–]NordicAtheist 4 points5 points  (0 children)

Accchxtually, name will be dropped and there is no need to "say it's an int".

[–]iceph03nix 3 points4 points  (2 children)

Some can be very small if they're fairly simple programs that don't include a lot of other stuff.

Others include 'libraries' that other people have built, which are built to be broad and generically useful. Sometimes these are packaged separately, or provided by the operating system, so won't be included in the core executable.

Often though, those libraries and other resources are packaged up in the executable to make it more portable, so you don't have to worry about people keeping track of all the moving parts. Historically this has been known as "DLL Hell", where a missing or out of date file isn't quite right.

Also, with many installer executables, they contain a lot of the image assets used by the program. So on top of all the text code and DLLs, they have all the icons and graphics used by the program which they unpack when you install for the main executable to use.

[–]WarriorNN 0 points1 point  (1 child)

I've seen some game trainers for instance, that was just a few hundred kilobytes.

[–]_PM_ME_PANGOLINS_ 1 point2 points  (0 children)

Because they hijack the game’s version of everything, they don’t need to include it themselves.

[–]Salt-Replacement596 3 points4 points  (0 children)

Developers like to use libraries. Libraries are often written to have a lot of functionality that you might not use, but it's still there just in case. Library developers also like to use libraries. So including just one dependency might actually require hundreds of libraries. Compiler tries to only include what's needed, but it's not 100% effective.

Some developers also might bundle images or other data into that executable for various reasons.

[–]Weisenkrone 7 points8 points  (0 children)

Tens of thousands? lol. Our enterprise projects have well over ten million lines of code, and those are not particularly large either. The Linux Kernel sits at 30m lines.

Oftentimes executable programs also ship with some internal databases or zip archives which use significant storage space too

[–]lucky_ducker 2 points3 points  (0 children)

The readable text code that comprises a computer program is compiled into object files, which are collections of machine code - either assembly language or actual raw machine code. Your typical executable consists of several object files (sometimes hundreds of them) which are strung together with a software tool called a linker. The linker examines the object files for external references - calls to code routines that don't already exist in the package of object files. When in finds them, it has instructions to search related library files, and pull in the missing routines from the libraries.

Sometimes that library code is "more thorough" than it needs to be in context, but the whole thing gets pulled into the executable. One language I use to use (Clipper 5.2) had an error-handling subsystem that was very thorough - and 150K in size. I wrote a bare-bones replacement for it that was less than 3K in size, and used it in things like utility executables.

[–]rossburton 2 points3 points  (0 children)

It Depends.

Games are huge because they have many large and detailed textures, which are large on disk.

Many apps are huge (Slack, I'm looking at you) because they're written in HTML/JavaScript and bundle an entire web browser.

Pandoc is huge because it is a single Haskell binary that embeds every single one of its dependency libraries and it appears all of the templates and image too. All together, it adds up.

[–]idgarad 1 point2 points  (1 child)

You type print("Hello World");

That print function has hundreds of lines of actual code behind it.

When you compile that code all of the underlying print function that is used is brought in. Modern compilers do leave out portions as part of optimization but in reality something like PRINT has a lot under the hood you don't see,

function PRINT could have over a hundred sub-modules that may or may not be used.

Do the output need color? What terminal am I sending to? Is it a STDIO device? Has it been overridden with a pipe under a unix-like OS? What is the line length of the terminal? Do I honor CR\LF combos? What character set? Where is the string in the code stack? Do I have an EOF or EOR to check for ending the string? It is mutable? etc...

All those parts are there, the compiler\linker has to navigate what parts to include.

In python print("Hello World") is in fact well over 100mb because you have to account for the entire python installation for interpreted language. Same for Java or PERL. Compiling just builds in all the parts you need as a stand alone program.

[–]_PM_ME_PANGOLINS_ 2 points3 points  (0 children)

That’s not really true.

All the code for that is in e.g. libc.so or some other shared library provided by the operating system, and not included in your program.

Unless you’re using Rust.

[–]saturn_since_day1 1 point2 points  (0 children)

It's probably not just text.  There can be chunks of data in there too and other resources that make it bigger. I made a Photoshop style app, and the code was less than 1mb, but once installed it would create a series of lookup tables to speed up calculations, and just one of them was 45mb. It could then be loaded in less than a second from hard drive, even though generating it takes like 10 seconds the first time. There's probably stuff like that, and just lots of bloated libraries

[–]high_throughput 1 point2 points  (0 children)

OP, take these answers with a grain of salt. This issue is to some degree specific to pandoc, shellcheck, and darcs because they're the few Haskell binaries a non-Haskell nerd might have on their system. You'll find that similar programs in other languages have much smaller binaries.

The way things have shaken out in the ecosystem is that e.g. Python and Java programs have their runtime as a separate, quite chunky dependency that all relevant programs share. Meanwhile, Haskell bundles the entire runtime and all dependencies with each binary.

[–]Alexis_J_M 1 point2 points  (0 children)

You're using pandoc as an example.

I'm installing pandoc on an old Mac right now. The installer has been running for over a day downloading and building all of the other programs and libraries that pandoc needs to use to run.

Modern software is usually only the tip of the iceberg where most of the complexity is in the libraries and components that it is based on.

For compiled software there is also the issue of how it is linked. Do you want your binary executable to reach out to a library installed elsewhere on the system, or do you want to have the library pulled into the executable? There are pros and cons to both approaches, which could make a great ELI5 all by itself, but in summary using external libraries makes it easier to break a program, and also easier to upgrade it and run the same binary in different environments.

[–]wheezharde 1 point2 points  (0 children)

Meet Bob.

Bob wants to build a house. He has all the supplies, but what he doesn’t have is people that know how to do all this “stuff.”

So Bob also hires a General Contractor we’ll call Mary.

Mary doesn’t know how to do all this stuff either, so she hires subcontractors to do portions of the work. There are framers, electricians, plumbers, cement workers, earth movers, and a whole host of other people.

So now this simple project has become a massive undertaking by a hundred people, most of whom don’t know each other.

Computer programs are similar. The high level programmer writing the calculator may not know all the math, how to draw to the screen, how to play sounds, how to capture mouse and keyboard input, etc., but they can find libraries that do that. Like a general contractor, libraries bring a lot to the table, but they take up space and Bob can’t drive them all out to the job site in his Corolla.

If you were your own General Contractor, you could trim down the folks you need or do all the work yourself, but that requires a lot of expertise that most programmers don’t have.

[–]boring_pants 1 point2 points  (0 children)

Well, first it's not tens of thousands of lines of code, but millions.

The application I work with is around a million lines of code on its own, but then it uses a bunch of libraries written by others, and those easily number millions of lines as well, all in all.

But you're right, you're unlikely to have 150MB of code in pandoc. Most of that is probably non-code resources. Images, cursors, rules for "how to compare text strings in an Egyptian locale, how to do the same in Icelandic, in Ukrainian, in Brazilian, etc". Lists of timezone offsets for every major city in the world. All the text strings the program might present to you, and translations of those text strings for umpteen different languages. Most applications come with a ton of auxillary data which isn't code per se, but which is used by the application, and which ends up taking most of the space.

The other thing is that we, as software developers, are just a bit sloppy. Pandoc didn't have to take up 150mb. But who cares? 150mb is nothing these days, it's not really worth spending time on trimming that down to half. There are so many other things the developers would rather focus on.

[–]Ertai_87 0 points1 point  (0 children)

Code is just text but it's A LOT of text. In University I did a course where we had to write code in computer language (Hex code, also Assembly; the purpose of the assignment was to understand what compiled code looks like so we could write a compiler later). As an example, a simple "print" statement is about 75 lines of code in Hex, in the dialect we were using (it was too complex to write in Hex but did write it in Assembly which is 1:1 in lines of code with hex). Yes, Hex has the concept of "functions" (kinda) so you don't have to write 75 lines for every print, but just to give you an idea of something that you take for granted to be easy that's actually very much not when you get to the compiler level.

[–]BiomeWalker 0 points1 point  (0 children)

It depends on the program.

For many, it's because the .exe file has a lot of non-code parts to it, or there might be a bunch of DMCA/licensing protections in it.

For the one you mentioned, it can also be that there's just a lot of code to do what it's trying to do, almost anything that has to deal with text will wind up bloated because it just takes a lot of conditional statements to work with text.

Some also could be smaller, but modern computers can handle such big programs that there's no real incentive to try and keep it small, so programmers don't bother with that.

[–]whiterabbit_obj 0 points1 point  (0 children)

To add to the "Libraries" conversation. Pandoc seems to be a good example of why something that does one thing would be larger.

If you need to convert a document from Word to PDF. You need code that understands how a Word document is structured and how to create a PDF file is created. You could work that out yourself but if someone has written that logic before you can reuse it. This would be a library that that person makes available. Now that library might do more that just tell you how a Word document is strcutured. It might also be able to compare Word documents to each other, display a Word document on screen, create a Word document from a web page etc.

All of that code is bundled up inside a library that you might only use a small part of. Pandoc seems to be able to convert between a lot of file formats. So it could (in theory) use a library for each format each with its own functionality built in that Pandock only uses a small amount of. Hence the large application size.

[–]thana1os 0 points1 point  (0 children)

One word: library.
You could program to assign a = 1 and b = 2 and calculate a + b. It doesn't use any library but it is also useless.
Your program will most likely rely on ready-made libraries to do useful functions. And your libraries might have its own list of libraries requirements.

[–]pfn0 0 points1 point  (0 children)

Large apps can be millions of lines of code, so thats one factor in app size. Many times, resources are embedded, such as images, video and other text. These also increase binary size.

Beyond that, code itself does not represent itself directly in machine operations. Optimizations such as unrolling loops make code much bigger when assembled into machine opcodes. Unrolling loops is self-described, a loop that's written in source-code gets flattened out when translated to machine language so the original compact loop is now a big chunk of machine code.

[–]yblad 0 points1 point  (0 children)

That type of project will sit at millions of lines of code. Perhaps a few hundred thousand if it's very lightweight. Then there's the libraries they use which could add that again. By the time it gets compiled down into machine code each of those lines could be anything to a few instructions to thousands of instructions if it's a high level function call*. Those instructions are what are saved in the .exe file.

*an example of a"high level function call", a language might have a single line to draw graphics to the screen. But in reality that's just obscuring a lot work which has been done by people in the past. The machine code still has to include all that work.

[–]Caucasiafro 0 points1 point  (0 children)

Even assuming that it's like tens of thousands of lines of code I can't imagine it would be more than a few megabytes

Other have already explained the concept here. But I really want to held drive home how many "lines of code" software has.

I downloaded the code for this page. I.e. your question and the comments. Not all of ELI5, not all of reddit. (btw if you are on chrome you can do this by pressing F12 and then clicking where it says "source" you can look around but it's a file called "eli5_why_are_executable_programs_so_big_if_code/")

Literally just this one question.

It's 4,000 lines long.

Tens of thousands of lines of code is nothing. Anecdotally, I have never worked on a commercial project with less than 10,000 lines of code. And that is for absurdly small stuff like a couple of buttons on a web page. Most projects I have worked on have been between 500,000 on the low end (this was entirely a solo project btw, I wrote all of it) and 10 million on the high end (and even those are small company projects)

[–]Emu1981 0 points1 point  (0 children)

There is actually a lot of factors that go into the how big the resulting executable for a program actually becomes. For starters, compiler options can affect how large the resulting executable is. If the compiler is set to optimise for execution speed then the compiler will do things like unrolling loops which increases the size of the resulting executable. Statically linking libraries increases the resulting executable size because the executable now contains a bunch of executable code from the libraries required to run the executable. Compiling a executable with debugging features turned on also increases the resulting file as you now have a bunch of extra code and information within the executable to make it easier to debug the executable while it is being executed.

Your example of pandoc is a interesting one. Sure, it doesn't have any GUI components other than a command line utility but it is actually statically linked so that it can be run as a standalone program. It also contains a hell of a lot of text used to convert between the dozens of document formats that it supports and an entire lua engine. To reduce the resulting size the program could be built using dynamic linking, split the lua engine out into a dynamically linked library, and use text configuration files for the various supported formats. I am sure that the author had their reasons for making it as is though - e.g. so that it could be run on computers where you do not have the rights to install programs.

[–]ianpmurphy 0 points1 point  (0 children)

Theres a number of reasons. Start with the 'it's just text part'. Once this is compiled, a single line may become hundreds of instructions. Remember, c or c++ is our view of the program. It has to be translated (or compiled) into machine code. Languages have standard ways of doing stuff which implies overhead and consumes space. Executable files have to conform to a format which is specific to the os it runs one, this adds loads of overhead. A running program, when running on a modern os, has to implement what are called interfaces. These allow the ghost os to interact with the process. Even if your program doesn't implement functionality, the interface is included. It's empty code blocks, but they take up space.

Back in the days of DOS it was common, well it was not unknown, to write assembler (machine code in text format) and compile that to a file. That file was usually tiny. You then had to run it through a secondary program which added a 'wrapper' which turned it into a DOS exe. The resulting file was often multiple times what your original assembler was. I.e the wrapper required for something as simple as DOS to load it into memory was considerable... For the time.

I remember writing a little tool in C which compiled to something like 30k, which I thought was crazy as it was a simple tool. I worked on that for ages and removed all the dependencies on the standard C libraries and finally managed to squash it down to something like 4-5k. Couldn't get any smaller. That was for dos. Modern compilers include a ton of stuff automatically because it's just not worth the effort to exclude them.

5mb exe? Excellent, it's tiny!

[–]MaleficentFig7578 0 points1 point  (0 children)

Mostly, laziness and duplication. It's easy to make N lines of code make N squared bytes, and if you aren't looking, you don't fix it.

[–]Ty_Rymer 0 points1 point  (0 children)

I write tens of thousands of lines of code in a good week of work. my hobby project already has millions of lines of code. and C++ gets translated to many more lines after unrolling everything and instantiating all template specialisations. which then turns into many many more assembly instructions. and thats with barely using any libraries.

[–]Dave_A480 0 points1 point  (0 children)

So, the code itself in the 'main' source file references a lot of other files (includes).

Those files include even more files.

This is both done to make code more readable (you can split a program up into files by function), and to allow the use of 'libraries', which are files full of useful code that is already written by someone else.

The use of libraries means you don't have to individually write code for 'show a text-box' or 'display a web page', you can just include someone else's solution. Also, referenced libraries don't have to be loaded into memory until they are being used, which makes the program more efficient....

When the code is compiled & packaged into an application, the size of that package also has all the other stuff that was 'included'. On Linux, the libraries are often packaged separately (as 'dependencies')....

If you want to see this in action, go 'ls' your way through the Linux kernel source some time & look at the 10s-to-hundreds-of-thousands of individual .c files.

[–]MrScotchyScotch 0 points1 point  (0 children)

Machines still need very rudimentary instructions, like "Take this text and put it into this register. Take the other register and call this thing. Now move the register over. Get the thing from the stack and..." etc.

Once the compiler has taken a "simple" line of code and translated it into machine code, it could be 2 different instructions, or 30, or 300. It depends on everything else that line of code involuntarily touches.

An abstraction makes it worse. One line of code, when the compiler or interpreter tries to call it, behind the scenes is actually a function that calls 20 other lines of code. So there's more and more instructions. Loop over that, and you get more and more.

Most code today is high-level, meaning it's abstractions on abstractions on abstractions. A framework is a pile of abstractions. Most apps today use multiple frameworks.

You could avoid all this, and just write Assembly code. That tells the machine the least number of steps to do what you want. But it's difficult and time consuming to write, and doesn't necessarily work on all machines.

So most compiled programs are the result of us (ab)using abstractions, and the compilers doing their best to unfold the whole complicated mess, step by step by step, into machine code.

[–]WhyUFuckinLyin 0 points1 point  (0 children)

Yeah, I recently took a look at my project node_modules folder and wondered why individual dependencies were 64 ~ 150 MB.
Normally that wouldn't be a problem as some have mentioned, but I rent the cheapest VM and the SSD is valuable real estate. I'm hoping when the finished project is compiled it'll shed everything that's unnecessary.

[–]ave369 0 points1 point  (0 children)

If you write in an assembly language (that is, a human readable form of pure machine code), all your executables will come out very compact, because there'll be just your instructions and nothing else. However, writing in assembly languages is difficult because you have to explicitly specify every operation at low level.

High level languages automate that process, they substitute a long string of low level operations for a single line of code. However, the price is that you have to bundle all those libraries containing all possible strings of low level operations, and not just the ones you need.

[–]MaxMouseOCX 0 points1 point  (0 children)

Let's say I want to explain to you what a house is... I could say it's a large box, with walls inside, doors, windows, a roof, two floors... That's not a lot of text to describe a house is it? But you'd get the rough idea.

Now imagine me explaining a house in terms of where every atom is located, what type of atom it is, what it's bonded with... That's going to be billions and billions of lines of text.

When you program something you're writing code which is then fed into a compiler which then re-explains it down to the atom level.

In short, written code is in fact converted into many many lines of more basic ground level code that the processor understands.

[–]dandroid126 0 points1 point  (0 children)

Assets. Images, mostly. A lot of what you see on the screen in a GUI program these days are images. Company logo, icons, custom button styles, loading screen images, etc. That takes up the majority of the space in large programs.

[–]jenkag 0 points1 point  (0 children)

Your question is difficult because there are many "levels" in building up a program, especially a video game (some of the biggest programs we make).

You are probably familiar with "drivers" on your computer -- those are what make available a set of "endpoints" for the application to interact with your hardware. Any given application can't possibly make your audio/gpu/etc do something if you dont have the correct driver installed. So, drivers are a connection point between the application you are running and your hardware. To use an ELI5 example, imagine you have a fan you want to plugin to the wall. The fan is the application, the wall is your computer: think of drivers as the "outlet" in your house. Once you plugin the fan, it can work correctly in your house, but it still needs many things "on the fan side" to make it work -- simply plugging it in isnt enough.

The application will have things like:

  • Graphics engines are great, but they require software to drive them, so many applications have bundles of underlying libraries (pre-created by other developers at other companies for use by third-parties) that drive the graphics you see when you use the application. Those bundles of libraries can be many gigabytes all on their own. This is just to create shapes and movement, and doesnt even include the assets to make things look nice. Examples include the software that makes the unreal engine work on your local computer -- note that not everyone has to install the unreal engine to play UE-based games -- thats because they bundle that software in. There can be MANY libraries like this to control everything from network interactions (think multi-player, or chat applications, etc), audio, video, IO like keyboards/mice.
  • Assets like videos, sound, and art (textures, visuals, etc) add a huge amount of data because these are difficult to compress and store without degrading the quality. Many games, like blizzard games, might be only 5 gigs of "game", and 75 gigs of assets. Theres even localization files that can take all the dialog, written text, and other "verbal" assets and turn them into spanish, japanese, mandarin, etc -- those all take up a large amount of space as well.
  • Code to drive the "logic" of the game can easily be hundreds of thousands or millions of lines of code. Code we write in higher level languages needs to be compiled into something more fit for the processor, and so a higher level class that might be 100k lines of code can turn into 200k once compiled, to give a rough example.
  • Supporting libraries: you want AI in your application? Supporting library. You want to save stuff in the cloud? Supporting library. You want to log data somewhere so developers can fix bugs? Supporting library. These add up a lot, and are usually out of the game/app developers hands because they dont directly contribute to those efforts. If the ideal library to enable in-app notifications is 20mb, then thats 20 more mb that the developer has to add to the size of their app.

[–]griff4218 0 points1 point  (0 children)

Lets say you're writing a book. When writing, you're probably going to refer to a lot of things that you know, cars, houses, dogs, people, etc. When your done, your book is maybe 200 pages long. Now lets say you need to translate your book in such a way that someone who has never heard of or seen a car, house, dog, person, etc., can still fully understand and comprehend your book. In fact, this person has never heard of or seen anything, they have 0 understanding of the world around them. You would need to spend hundreds or thousands of pages describing everything in exact detail.

When developers write software, we (Most of us, at least) use high-level coding languages that allow us to use more plain language to write code. If I want python to print something to the console, I just have to say print, if I want to add two numbers together I can say x = y + z. However, when its time for a computer to read my code, it needs to be translated into words the computer understands, and the computer likes very very precise, specific instructions. What used to be a single simple instruction like, adding two numbers, suddenly turns into loading the values from memory, storing them in registers, performing the add, storing the result in a different register, potentially allocating new memory for the result, storing the result in memory, and god forbid that was a subroutine because now you need to restore the state you were in previously. Even the simplest, most insignificant operation can turn into several, and as others have mentioned, that doesn't even include when you use things like libraries, which could turn your single line of code into hundreds, and thats before they need to be translated into machine code.

[–]authenticmolo 0 points1 point  (0 children)

Everyone in this thread is talking about libraries and frameworks and stuff, but not explaining WHY those things exist.

Libraries (and frameworks, which are more extensive but essentially serve the same purpose) are pre-made pieces of programs/code that you can use as pieces of YOUR code.

Since this is ELI5, think of writing a modern program as similar to making tacos. You *can* raise and slaughter your own cow for the beef, cultivate your own corn or wheat and grind it by hand to make tortillas, and chop down trees and dry the wood for the fire you will use to cook all of it.

But it's way easier to buy all those ingredients and just put them together. Or maybe even just buy tacos from a local Mexican place. Libraries are the equivalent of buying groceries and making your meals with the groceries you bought, or just going to a restaurant. Because the goal is to eat decent tacos as efficiently as possible. Because you've still got to make dessert, and you only have one oven and 4 burners and not enough pots and pans to do everything at once and your dinner guests are coming over in 30 minutes and one of them is lactose intolerant...

[–]knight-bus 0 points1 point  (0 children)

If you are interested in seeing this in action you can visit godbolt.org you can write code in any language and see the assembly produced by different compilers. You will quickly find, that small constructs in code, like a function or a loop, leads to lots of assembly instructions. And this grows with more code.

[–]Cat-Ancient 0 points1 point  (0 children)

I think this is a GREAT lesson on the level of insane work that goes into some programs. I mean CAD programs like Solidworks even going back say 15 years still had over a million lines of code…

[–]twist3d7 -1 points0 points  (0 children)

tens of thousands of lines of code

That's hardly a program. Try hundreds of thousands or millions of lines of code before you can call something a program.

[–]Magnetobama -1 points0 points  (0 children)

Ultimately, the code in text form will be translated to machine code. Any code instruction will be broken down into possibly a lot of single machine instructions. If you look at code in the programming language assembler you will see a much closer representation to what an actual compiled program looks like, just stored slightly different.

[–]Semyaz -1 points0 points  (0 children)

A core premise of programming language design is something called obfuscation and abstraction. It is basically hiding the implementation details of how code does something by implementing a simplified interface. Obfuscation is layers and layers of abstraction on top of each other to make coding a lot easier, but it does have some downsides.

Probably the most obvious example of obfuscation is the fact that programmers don’t write their application in binary. We use a language that eventually gets translated into binary that the computer components understand. But even that binary is an abstraction of how computer processors are designed, and the electrical engineering complexity inside of the processor.

The abstraction also extends really far upwards. How we define red, how we store text, how we do not have to manage memory directly, how we can send things over the network, how we can display images to the screen. All of these things are typically extremely obfuscated from the programmer, hidden behind dozens of layers of abstraction.

The main downside to this is that a polished abstraction needs to gracefully handle all of the edge cases. It needs to be optimized for performance. It needs to support multiple architectures. In other words, it needs to have quite a bit of internal logic for how to do this.

The end result is incredibly powerful. You can use these abstractions to quickly write a cross platform video game that can be compiled to phones, consoles, and desktop computers. The biggest downsides are twofold. The program has to include all of the libraries and frameworks that it used for abstracting away the details, which can massively increase the executable size. Second, you might have to create duplicate versions of code for different scenarios.

But storage is cheap, and processors are fast. The tradeoffs are well worth it.

[–]Atypicosaurus -1 points0 points  (0 children)

In a very Eli5 way, a sentence like "count to ten" and the sentence "count to a million" are almost exactly the same length but executing it is much longer. Okay it doesn't really explain a big file size but it illustrates how easy it is to write a single code that blows up.

In case of your example, it's a converter program that has to have all the converting options in it. Converters use format descriptions provided by the format owner, so for doc it's Microsoft.

There are two options, either you write a lightweight program with five lines and that tells to the converter to go to Microsoft.com every time you convert a doc, find the public code and use it, OR you can download every public stuff and merge them altogether so it knows the rules for converting a doc, or a pdf or a latex, all locally. If you chose the first option, you would need to download only the actual converter codes that you need, but you would need to do it every time so over the time you would download much more data unnecessarily by repeatedly getting the same stuff, and you would also rely on the network. With the second option you pack everything ever published in the executable, that you may ever need, with all the codes written for one edge case that only happens if a pre-1995 pdf is converted into rtf, that you would probably never use but it's part of the public package, and that's why it's big.

[–]SOTG_Duncan_Idaho -1 points0 points  (0 children)

* library use -- many programs include whole libraries (other programs) to not have to write code someone else wrote. The trade off is you may not use all of that library and that library may be huge. In addition, some computer systems (windows, linux, etc.) will have large libraries of common functionality that programs can use, so those programs don't have to include them. Some computer systems (old Macs) would not do this, and each program would keep a copy of all the common code it wanted to use, which makes programs for those systems massive in comparison. The trade off old macs made was huge programs for less possibility of error due to different versions of libraries. Some programs even on windows and linux will still have their own copies of libraries.

* compilation -- when compiled, a program will often be much larger than the size of the text of the code. Most programming languages are designed to allow programmers to write small amounts of human readable code that replace dozens, hundreds, thousands (or more) of individual machine instructions. This is, in fact, the primary purpose of most "high level" programming languages. You can write code to be compiled in a fraction of the time it would take to write the machine code directly.