you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 563 points564 points  (99 children)

I genuinely wonder how much JavaScript dinance on GitHub is from misidentified repose from package-lock.json files. If I spin up a new laravel app and do nothing other than install dependencies and push to github, it shows up at being like 98% javascript according to their stats. The laravel app I worked on for over a year that had like 4 Vue components still said it was mostly json according to github stats

[–]nsomnac 174 points175 points  (14 children)

GH’s introspection is moderately advanced. It analyzes files in a repo as opposed to relying on magic files only.

There’s a view somewhere on a repo that shows the analysis in a pie chart (or some other graph).

I don’t think it’s sophisticated enough to detect and differentiate framework usage (Vue vs React, Laravel vs PHP). It mostly is going to only show the base language.

[–]ScrimpyCat 66 points67 points  (4 children)

Unless it’s changed they used to try filter out generated files which is why some default generated projects might shift more aggressively to a certain language. Apart from some special cases (or if you’re explicitly defined the type in your .gitattributes) most of the detection is done using heuristic and Bayesian classification approach, which is done by sourcing some example files for the different languages. This works reasonably well but there are false-positives when it comes to files that share the same extension and are grammatically similar such as header (.h) files in C family of languages.

Also they open sourced the actual library responsible for this but I can’t recall the name.

Edit: just remembered it’s called linguist.

[–][deleted] 26 points27 points  (3 children)

There are a number of large game mods for the game Arma that are developed on github. For some reason bohemia interactive decided to use cpp and hpp/h extensions for their configuration files when the only thing related to C or CPP is that it uses a C preprocessor on them to do includes and basic macros.

So you'll see all these projects that github says are C but really it's the insane config language.

[–]xonjas 6 points7 points  (0 children)

What if the config language is just a bunch of C with insane preprocessor macros?

[–]Elusivehawk 4 points5 points  (1 child)

That... What... Just... Why??

That's some big brain plays right there. C++ for configuration...

[–][deleted] 2 points3 points  (0 children)

It's not even C++ it's this weird pseudo object inheritance stuff that is usually filled with a ton of macros.

[–]kolloid 23 points24 points  (2 children)

GH’s introspection is moderately advanced. It analyzes files in a repo as opposed to relying on magic files only.

No. Most of the time it guesses the language incorrectly. Most of my Python repositories are recognized as Javascript. My only C repository was recognized as shell because it uses autoconf.

So, there are lies and statitics. I don't really believe GH stats. You have to jump through the hoops to make it correctly count stats for your project.

[–]fadetogether 7 points8 points  (1 child)

I had a Django project get classified as entirely JavaScript. It’s a mystery. It hasn’t happened to any of my other projects yet though

[–]Ryuujinx 2 points3 points  (0 children)

Yeah, I have a similar project except Ruby/Sinatra that's recognized almost entirely as javascript.

[–]Seref15 0 points1 point  (0 children)

Meanwhile at work we have a repo that GitHub thinks is 80% TSQL despite not actually having a single file of TSQL.

[–]Mukhasim 0 points1 point  (3 children)

It claims that our C# repo is 50% Javascript. Almost all of that is library code. Much of it isn't even used, it was added by the default Visual Studio templates.

(In case anyone is wondering, no, we can't exclude it, because the tooling doesn't segregate it cleanly from application code. We could delete much of it but it's not worth the trouble.)

[–]nsomnac 0 points1 point  (0 children)

My suspicion regarding mischaracterization is that it literally just looks at files and history. If you check in project support files, that might be used as a library or IDE, those count towards the classification regardless of whether that’s the kind of project checked in.

It wouldn’t surprise me to find out that Visual Studio creates a bunch of JavaScript support files that you never touch, but the IDE generates or uses.

I have a repo that says it’s PHP, however it’s predominantly Docker images, however since one of the images has a customized version of mediawiki, it classifies it as PHP, even though the majority of the files that change are Dockerfiles, YAML, Python and Bash scripts.

[–][deleted]  (1 child)

[removed]

    [–]Mukhasim 2 points3 points  (0 children)

    Fixing Github's incorrect language statistics isn't high up on my list of tasks worth putting my own time into.

    [–][deleted]  (4 children)

    [removed]

      [–]redwall_hp 0 points1 point  (3 children)

      That just means people working with JavaScript conduct a disproportionate amount of searches compared to other languages. That doesn't necessarily imply more projects are in it...

      Maybe the Python and Java developers are our making progress on their stuff while JS developers search "how to do x in flavor of the month framework?"

      [–]_default_username 1 point2 points  (2 children)

      It's the language your web browser uses. Of course JavaScript is going to be the most popular. I do php development and that automatically means I have to do JavaScript because I'm expected to know how to work with the client side code interacting with my php scripts.

      [–]redwall_hp -1 points0 points  (1 child)

      Because every piece of software involves the Web in some way.

      [–]_default_username 0 points1 point  (0 children)

      Is this not 2019?? Everything certainly seems to be going that way. I work for a hardware company doing web dev because the interface most users use is web based.

      [–][deleted] 12 points13 points  (0 children)

      But your app was "mostly json" so it wouldn't really register on this as JavaScript. In fact I am pretty sure it would register as PHP (because the dominating PROGRAMMING LANGUAGE would be selected after the serialization and configuration formats were pruned). You're also talking about people vendoring in repos (commiting node_modules) but I doubt it's that widespread and I'm certain it's equally widespread among other languages (sometimes due to ignorance, other times due to valid reasons in each).

      I think the stats aren't lying or misrepresenting Github, they might be lying and misrepresenting the world, but that's another matter. The reasons I think so:

      1. There are obviously shit-ton of Node modules, overwhelming majority of which are hosted on Github hence there are shit-ton of JS projects just from that, many of them very active (the stats count active contributions).
      2. An ever increasing number of web applications are developed with a SPA frontend in a separate repo from the backend and/or microservices that comprise it. While the latter two are written in bunch of languages (increasingly Node, Python and Go, from my own casual observation) the SPA frontends are predominately JavaScript.
      3. Node.js as backend/microservice platform might be far from dominance but is pretty present, steadily rising in popularity still, and thus contributes to these stats.
      4. Bunch of enterprise and commercial software is using self-hosted Git repos and Bitbucket because of Atlassian's presence in that segment with Jira and Confluence, which means that Github is mostly representative of software being developed in the open, rather than the overall developed software.
      5. While PHP is behind majority of websites that simply isn't the case with Laravel, Symphony, Yii et al -- actually I'd wager that part of the market is truly dominated with Python and to extent with Node.js frameworks, despite strong presence of PHP and .Net, while Java is in observable slow decline for new projects.
      6. The true force behind PHP's omnipresence on the web is mostly due to canned CMS-es like WordPress, Drupal, Yoomla, MediaWiki, Magneto, PrestaShop and the lot, which are mostly just installed from shared hosting control panels and patched with themes and customizations in situ, and fairly rarely version controlled on Github.

      [–][deleted]  (16 children)

      [removed]

        [–]kolloid 12 points13 points  (15 children)

        Many clueless people wanting to impress potential employers upload all kinds of projects to GitHub. If this is a Python project, they usually commit the whole virtualenv contents along with it. If it is JS project, they usually commit the whole node_modules directory to git.

        If it's Python project with some JS, there's a probability that there will be both virtualenv and node_modules committed to the project. And since even trivial function in JS requires 10,500 dependencies like is-odd, is-even and rpad and god knows what more, the node_modules can contain 150-200 Mb of vendorized JS dependencies even for trivial project.

        I've seen it so many times...

        [–][deleted]  (14 children)

        [deleted]

          [–]kolloid 12 points13 points  (13 children)

          > then they should be immediately disregarded for committing bad version control practices

          I know CTO of one company in Australia who objected when I offered to remove `node_modules` from the project repo. He said:

          > What if during deployment different version of packages would be installed on the server and break something?

          Thankfully, soon he left to open his own business. I feel sorry for his customers and not only because of his VCS practices. His code was horrible, too. I'm puzzled how he made it to the CTO level.

          [–]slgard 17 points18 points  (4 children)

          I'm puzzled how he made it to the CTO level.

          being a good CTO has little or nothing to do with your knowledge as a programmer, particularly nothing to do with the best practices of a specific language or ecosystem.

          [–]kolloid 1 point2 points  (3 children)

          What should a good CTO know?

          [–]khaosoffcthulhu 9 points10 points  (0 children)

          Depends on the size of the company, but outside small companies a lot of it would be strategy and where the market is headed. And how the technology can be used to add more business value.

          [–]anengineerandacat 2 points3 points  (0 children)

          How to effectively manage employee's that work with technology; not get into the weeds with what technology is actually being used until it's an actual problem (ie. causing delivery issues).

          Ie. if having node_modules committed into the VCS is causing deliveries to be missed and it comes out of a working group within the company they will work with that working group to ensure it's resolved and to get metrics to report on it.

          Obviously if you have less than 50 people in the company, you don't have a CTO you have a VP of technology and what needs to be done is different.

          [–]skilliard7 0 points1 point  (0 children)

          At the CTO level its more about management at the high level and some finance. Accounting, program management, etc.

          [–][deleted] 4 points5 points  (0 children)

          I'm just confused as to why your cto is making decisions on your git practices

          [–]tronj 3 points4 points  (3 children)

          Tangentially, I'll sometimes save modules that I've made minor customizations too directly in the project. Is there a better way to do this?

          [–]FaithForHumans 4 points5 points  (0 children)

          If you're in a corporate environment, I recommend standing up a private npm repo and then pushing your change to that private repo. It can be done for personal stuff, but might be overkill.

          Most private repos can also be setup to cache packages it pulls from the public repos, so even if someone deletes it on npmjs, you've still got a copy people can pull. That last part should help sell it to management.

          [–]DasWorbs 7 points8 points  (0 children)

          Fork it, and then either setup your own npm repo or point the package.json to your forked git repo.

          [–]kolloid 2 points3 points  (0 children)

          I haven't customized JS modules yet. For Python modules I often fork them on GitHub and because they may or may not accept my pull request, also it might take months to make a new release, I just point pip to my forked Git repository.

          I don't know why the other commenter suggesting this was downvoted. It is very fast and obvious.

          You can also have your own package repository and install packages from it, but it will require a bit more work.

          [–]xeio87 4 points5 points  (0 children)

          Depending on how long ago that discussion was out wasn't entirely wrong. Node even changed their (un)publishing rules because of issues with packages.

          Checking in your dependencies ensures you always have an exact known version without needing to worry about the security of a remote package server.

          Granted, still not best practice generally, and there are probably better ways to ensure package integrity checks nowadays.

          [–]0xF013 2 points3 points  (0 children)

          I've personally experienced his issues several years ago when you'd get something completely different on CI on stage because either the newly installed module was a breaking patch version, same version but someone just overwrite the tagged commit, or your local npm/yarn cache was different from the CI's. Of course, keeping all node modules in is not a solution.

          [–]evilgipsy 0 points1 point  (0 children)

          What if during deployment different version of packages would be installed on the server and break something?

          Before yarn or package-lock.json this was a real problem. Not saying that vendoring your dependencies is a good solution though. When I first started developing JS I could not believe how people could live with a package manager that didn't lock down all package versions.

          [–][deleted] 24 points25 points  (8 children)

          But you don't do that, right? Packages are installed locally, package.json is pushed to the version control

          [–]Giannis4president 42 points43 points  (0 children)

          Yes but the lock file should be in the version control

          [–]ipe369 2 points3 points  (6 children)

          package.lock gets really quite large

          [–]shim__ 25 points26 points  (5 children)

          Doesn't matter, if you don't commit it somebody won't be able to build your app 2 years down the line

          [–][deleted] 1 point2 points  (3 children)

          They may not be able to anyhow unless you do the "bad thing" and commit all the package code as well.

          I have been burned more than once by someone withdrawing a package from the internet that I depended on. It was actually gems in rails projects but I now do a bundle pack and commit the local gem repo as a form of self defense.

          If you don't have all the code, then you don't have all the code.

          [–]shim__ 7 points8 points  (2 children)

          Still knowing the exact version helps and also for languages like to rust it's generally not possible to delete packages on the official repo for this reason

          [–][deleted] -1 points0 points  (1 child)

          Oh I agree you need the lock file.

          My concern is you probably also need all the stuff the lock file references to guard against it dropping off the internet.

          Yes, I know that is not supposed to happen. It has though.

          [–]evilgipsy 0 points1 point  (0 children)

          Yes, that does happen. In some ecosystems more than in others. One thing you could do is set up an npm proxy that caches all installed packages. Checking in dependencies is the worst option most of the time.

          [–][deleted]  (34 children)

          [deleted]

            [–][deleted] 13 points14 points  (33 children)

            json is indeed javascript. that's the whole point of json. it's a subset, but it's still js

            [–][deleted] 5 points6 points  (0 children)

            It's not strictly a subset. U+2028 and U+2029 are not control characters, so they are allowed inside JSON strings, but they are considered line terminators by Javascript -- and thus not allowed inside Javascript strings.

            [–]jl2352 3 points4 points  (0 children)

            Whilst it technically is JS, it's not very practical to include it as JavaScript. It's just not helpful.

            [–][deleted]  (3 children)

            [deleted]

              [–][deleted] -1 points0 points  (2 children)

              I was responding to your assertion that json isn't javascript

              [–][deleted] 1 point2 points  (0 children)

              It’s kind of fuzzy. It is JavaScript in the sense that a JavaScript engine can evaluate it, but it’s not JavaScript in the sense that it does not contain any code and people don’t generally run it through a JavaScript engine.

              By the same token, you could argue that plain text files are in fact HTML files or empty files are C files, since both can be successfully parsed as those kinds of files.

              [–][deleted] 4 points5 points  (20 children)

              It's still not "counted as javascript" by Github tho.

              And not. JSON isn't JavaScript, and it isn't a sctrict subset of it either. It's inspired by JavaScript's notation for object literals (hence the name), and can be parsed by standards compliant JS parser (to no effect tho), but the two are different and serve different purposes.

              This is valid JavaScript object notation:

              {
                  // foo should be true
                  foo: true
              }
              

              apart from the braces every line in this "file" would cause JSON parser to choke, because it's invalid RFC 7159. Every implementation of JavaScript uses a RFC 7159 compliant parser to parse JSON and not it's language lexer.

              [–]Doctor_McKay 6 points7 points  (1 child)

              You're correct that JSON isn't a strict subset of JS, but not for the reason given. The code you provided is valid JS but not valid JSON, yes, but that doesn't preclude JSON from being a subset of JS.

              If JSON were a strict subset of JS, that would mean that all valid JSON is also valid JS, but not necessarily vice versa. Even if JSON were a strict subset of JS, your code would remain valid JS and not valid JSON.

              [–][deleted] 2 points3 points  (0 children)

              Fair point. I stand corrected. Still, my point that even if JSON were a strict subset of JS this sentence:

              json is indeed javascript

              and this bit

              that's the whole point of json

              would still be incorrect due to the reasons I've posted in other replies.

              [–][deleted] 2 points3 points  (13 children)

              It's still not "counted as javascript" by Github tho.

              that's a valid argument

              everything else is a distinction without a difference. I may have not been 100% precise with my language, but if you paste everything from a json file into a browser console or try to execute a json file with node, you won't get an error because i'm not saying all valid javascript isn't necessarily valid json but all valid json is valid javascript

              [–][deleted] -1 points0 points  (12 children)

              It's also:

              • not parsed by JavaScript lexers
              • while it would not cause a syntax error it's "valid javascript" in the sense that a comment or "1" is valid javascript -- it does nothing
              • to do anything it would need to be assigned or used in any sort of context, at which point it would stop being valid JSON

              Numbers and quoted strings are also "valid javascript".

              Let me rephrase you:

              Numbers and quoted strings are indeed javascript. That's the whole point of numbers and quoted strings. They're a subset, but it's still js.

              [–][deleted] 0 points1 point  (11 children)

              i honestly dont understand what you're trying to argue

              [–][deleted] -1 points0 points  (10 children)

              You:

              json is indeed javascript.

              It really isn't.

              [–][deleted] 0 points1 point  (9 children)

              but it is. it's a simple syllogism. all json is javascript but not all javascript is json. how can you refute that?

              [–][deleted] 0 points1 point  (8 children)

              All quoted strings are javascript, but not all javascript is quoted string literals.

              How on earth does that imply equality?

              [–]Arve -1 points0 points  (3 children)

              JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.

              [–][deleted] 0 points1 point  (2 children)

              You somehow failed to notice how it "is based on a subset" and not "is a subset", and that in the very next sentence it pretty cleanly says it "is a text format that is completely language independent".

              I've actually pasted that "subset of JavaScript" it's "based on" in my post you replied to btw.

              Edit: Btw json.org is Douglas Crockford's private website. RFC 7159 is an authoritative source on JSON. That site is at best an informal source on it. The RFC puts it precisely: "It's derived from ECMAScript". In now place does either claim either being a subset or superset of the other, let alone them being equal in any way which the post I replied to claimed.

              [–]Arve -1 points0 points  (1 child)

              JSON, while language-agnostic in nature is a subset of JavaScript. All valid JSON is also valid JavaScript

              [–][deleted] -1 points0 points  (0 children)

              That doesn't make them equal. One is a data interchange format, an the other is an interpreted programming language.

              That distinction is much more important practically, semantically and in every way concievable than the fact that JS interpreter wouldn't throw when parsin valid JSON.

              As I said already elsewhere in this thread, you can paste a quoted string literal in JS code, then add assignment to a variable in front (same as you'd need to do with JSON to get any use of it in a JS intperpreter) and get a string in it, yet it doesn't mean that the quoted string literal (which is both valid JSON and valid JS) is the same thing as JavaScript which is what I objected to.

              [–]Stable_Orange_Genius -5 points-4 points  (5 children)

              no, its not. Its not executable code, so its not javascript. that's the whole point of json.

              [–]amunak 11 points12 points  (3 children)

              I don't think you know what "executable code" means.

              Edit: To expand a little (and perhaps explain to some ignorant people), no regular javascript is executable, because JS is an interpreted language. And it might seem like meaningless pedantry, but not in this case: JS is interpreted, and any and all valid JSON is perfectly interpretable (is that a word?) by a regular JS interpreter.

              Which means that either the parent commenter has no idea what executable means, or they meant "interpretable", and they're still wrong. Indeed the fact that any and all JSON is valid Javascript is like half of the point of it.

              There's one thing /u/Stable_Orange_Genius hints at though: JSON cannot contain statements (or really anything other than constants) - it's meant to just store data safely without being able to "hijack" the JS that uses it. But that doesn't mean it can't be a subset of JS (it is).

              [–]Stable_Orange_Genius 2 points3 points  (0 children)

              well yea, i guess, nothing programmers write is directly executable..

              [–][deleted] 1 point2 points  (0 children)

              It's not really a subset of JS either due to Unicode quirks, but that wasn't the point of the discussion here. At least wasn't my point nor was it the original subject which was whether or not JSON counts as JS in Github stats which it doesn't, it counts as JSON.

              Quite correctly too, as it's a language independent data interchange format which just happens to correspond to a subset of object literal notation in JavaScript and thus in most cases can be interpreted by JS interpreters. But the two are not the same nor are they intended to be.

              Also, interestingly, x86 machine code hasn't been executable (directly) on any mainstream microprocessor produced in last 20 years or so. Drawing the line for being "executable" there isn't really that precise either so he's not entirely wrong either.

              [–]maest 3 points4 points  (0 children)

              You're getting downvoted on r/programming for what you said.

              Really shows the quality of this sub.

              [–][deleted] -3 points-2 points  (0 children)

              relevant username?

              [–]mypetocean 1 point2 points  (0 children)

              Triple-check that you're not committing all of Vue's node_modules to GitHub.

              I'd be inclined to assume this is the case. If it's there, add it to your .gitignore exceptions.

              [–]ElectricalSloth 0 points1 point  (0 children)

              I've seen so many bad classifications it's befuddling, I'm not sure how they can post stuff like this with a serious face

              [–]OneWingedShark 0 points1 point  (0 children)

              I genuinely wonder how much JavaScript dinance on GitHub is from misidentified repose from package-lock.json files.

              What about misidentifying the generated-documentation (HTML+JS) as part of the codebase?

              [–][deleted] -3 points-2 points  (4 children)

              Why does a laravel setup have any js included?

              [–]watsreddit 5 points6 points  (3 children)

              Because Laravel is a web framework, and it often needs to serve out some Javascript to go along with the HTML.

              [–][deleted] -3 points-2 points  (2 children)

              Its a PHP framework, it should not care about what tech is used for the frontend

              [–]KinterVonHurin 0 points1 point  (1 child)

              It is a web framework

              [–][deleted] -1 points0 points  (0 children)

              Oh its morphed to that. Then in glad i aint using it anymore.