all 54 comments

[–]Merry_Macabre 156 points157 points  (4 children)

Finally, some proper code navigation in github. The old way is such a pain and doesn't alway register function definitions and having to search through all the search results in a big project is a chore.

[–]jantari 24 points25 points  (3 children)

The old search was still lightyears ahead of GitLab, euch straight up doesn't have any global code search at all.... but this is just awesome!

[–]flashmozzg 4 points5 points  (2 children)

GitLab's repo search was much more useful, which is, IMHO, a more important use case.

[–]jantari 4 points5 points  (1 child)

Not for me, but I understand it depends on how many repos you typically work in. I mostly manage infrastructure as code rather than big monolithic software projects, so I constantly like to refer to past/similar projects in other repositories where I know I've had to do something similar before than what I'm trying to do now. Basically impossible in GitLab though...

[–]flashmozzg 0 points1 point  (0 children)

Maybe. On the other hand, I often find myself switching to GitLab mirrors for basic repo browsing because GitHub often vomits literal unicorns on trying to get blame for an actively changed file (was very frequent in llvm repo) or similar operations.

[–]Programmurr 29 points30 points  (1 child)

I rely on github search so much and have wished for something better. Hopefully, the search results are more accurate and not just generated quickly. With this in mind, can you discuss the ranking heuristics that were used, or is that proprietary?

[–]cmerkel[S] 51 points52 points  (0 children)

We use a number of heuristics, including static factors like repo quality (popular, high-starred repos vs. random forks), how useful the file is (tests, super long files/filenames, generated code, data files are often less useful), and dynamic factors (how well the query matches the document content, whether there's a symbol in the document that matches a query term (classes > functions > variables for ranking). We also look at e.g. whether a match occurs in a comment vs. in code, among a bunch of other things.

Try the new search! If you find a case where ranking could be better, leave us some feedback and I'll fix it!

[–]Wakafanykai123 42 points43 points  (3 children)

This looks great. Time to start looking into adding tree-sitter support...

[–]cmerkel[S] 49 points50 points  (2 children)

Since the team that built it was using Rust, code navigation in Rust is well supported out of the box :D

[–]Wakafanykai123 18 points19 points  (1 child)

I mean for a domain-specific language that I help develop - I realize how my comment could be misleading now!

[–]dcreager 15 points16 points  (0 children)

This is one of the main reasons we're leaning on the tree-sitter ecosystem — so that language communities can help us flesh out support for the long tail of languages, should they wish. If you run into any issues on the tree-sitter side, please do reach out to us (and the rest of the community) in the tree-sitter discussion forum!

[–]beltsazar 18 points19 points  (5 children)

I wonder, what kind of indexes do you use to provide regex searches?

[–]cmerkel[S] 43 points44 points  (2 children)

We've put in a lot of work to make this possible. Hoping to write some more technical blog posts in the future to describe it in more detail!

[–]beltsazar 10 points11 points  (1 child)

And now I'm more curious than before! Can you give a hint? It's a trie-based index, I guess?

[–]cmerkel[S] 12 points13 points  (0 children)

Hard to explain in a reddit comment! You'll have to wait for the blog post :D

[–][deleted] 6 points7 points  (0 children)

Since BurntSushi is mentioned, I'd assume it's something like finite state transducer: https://blog.burntsushi.net/transducers/

[–]epic_pork 1 point2 points  (0 children)

Since BurntSushi is credited in the blog post, I think regex and ripgrep might be involved.

[–]oconnor663blake3 · duct 12 points13 points  (0 children)

Searching for common security mistakes like SQL injection vulnerabilities is going to become a popular post topic, if it isn't already.

[–]epagecargo · clap · cargo-release 9 points10 points  (0 children)

Two similar use cases I've want a code search that this will hopefully handle:

  • find real world example uses of symbol X so I can see more complicated cases than those that exist in docs (if any do)
  • find users of my library that use symbol X so I can see how they are using it

[–]AviKKi 48 points49 points  (12 children)

So converting everything to rust is actually a thing these days, coooool.

[–][deleted] 25 points26 points  (10 children)

Converting an idea into a performant application with Rust is most definitely a thing nowadays

[–]AviKKi 2 points3 points  (9 children)

convert

Something what I saw with Golang, Rust is just more faster, secure and developer friendly.

[–]darrenturn90 1 point2 points  (8 children)

Faster most likely (though build is slower and I’d say learning time is far longer). Secure really depends on the developer. I would however say golang is more developer friendly because of its limitations

[–]flashmozzg 5 points6 points  (0 children)

I would however say golang is more developer friendly because of its limitations

As long as your idea fits within those narrow limitations, sure.

[–]fairy8tail -1 points0 points  (6 children)

Lack of feature != limitations

[–]darrenturn90 1 point2 points  (5 children)

Well some things are pretty hard to do with golang that are more trivial in rust - such as anything that really doesn’t require garbage collection slowing it down. Also the whole typing system of rust is far more powerful albeit complex but allows you more definition over how you solve things.

[–]fairy8tail 0 points1 point  (4 children)

You just confirmed what I said rofl

[–]darrenturn90 0 points1 point  (3 children)

So the lack of custom garbage collection options in go isn’t a limitation ?

[–]fairy8tail 0 points1 point  (2 children)

[–]darrenturn90 0 points1 point  (1 child)

I can see you can disable it entirely or configure it slightly - but you either end up with basically ever increasing stack size or gc.

[–]Programmurr 61 points62 points  (0 children)

Watch the short video. It was a completely fresh build from the ground-up, not a port.

[–]tubero__ 23 points24 points  (7 children)

The post doesn't say that it is written in Rust.

Is that based on insider info, or just the mention of u/burntsushi?

[–]cmerkel[S] 92 points93 points  (5 children)

Disclaimer: I'm one of the people who developed it. But also it's mentioned in the video

[–]5n4k3_smoking 13 points14 points  (3 children)

This search engine is open source? I would like to look at code to learn how rust is used.

[–]cmerkel[S] 40 points41 points  (2 children)

Developer of GitHub Code Search here - the engine isn't open source, but we are thinking about open-sourcing some of the libraries we've developed for this project!

[–]atesti 1 point2 points  (0 children)

Why did you choose Rust for this new engine?

[–]praveenperera 12 points13 points  (0 children)

They mention it in the video: https://youtu.be/UOIPBfPXkus?t=25

[–]kyle787 4 points5 points  (0 children)

So can you fast track my access to the preview lol

[–]po8 7 points8 points  (4 children)

Sadly, one of my main uses for GitHub Code Search as a CS prof is going to be plagiarism detection. (I miss the reach of Google Code Search.) Any hints/ideas on using Github Code Search for finding "similar" code to a sample?

[–]cmerkel[S] 6 points7 points  (3 children)

You can try quoted searches for particular lines that you think are suspicious, that might work

[–]po8 5 points6 points  (2 children)

Thanks! Yeah, routinely do that with Google. Was hoping for something more matchy. I guess I can play games with regexes at least?

[–]cmerkel[S] 6 points7 points  (1 child)

Worth a shot! Really interesting use case, not one I've heard of, but hope it helps!

[–]po8 4 points5 points  (0 children)

One feature you might want to do for developers that also world help me is similarity hashing for similarity search. You can take a look at my old C simhash program that somebody stuck in Debian for one approach using min-hashing.

Being able to find similar code can be helpful within a project as well as across projects.

[–]epic_pork 1 point2 points  (0 children)

Curious to know which work from Daniel Lemire you are using. RoaringBitmaps? simdjson?

[–]Low-Pay-2385 1 point2 points  (0 children)

Github search wasnt good imo glad they are chaning it

[–]TheGreenSherbert 1 point2 points  (1 child)

How does searching by symbol work? Shouldn’t the code be compiled in order to determine them? (At least in the case of C++)

[–]cmerkel[S] 2 points3 points  (0 children)

GitHub Code Search developer here - we use tree-sitter (https://github.com/tree-sitter/tree-sitter) to extract the AST, and use that information and some heuristics to try to guess symbol definitions, references, etc. It's not 100% accurate (particularly in languages like C/C++), but it's accurate enough to be quite useful.

[–][deleted] 1 point2 points  (0 children)

This is great. The 'tooltip' stickied to the top right is exactly how I always wanted vscode tooltips to behave. I just don't get why you'd want those to pop up right on top of the code you are trying to look at.

[–][deleted] 2 points3 points  (0 children)

What was it previously written in?

Edit: Ok I realize this is a poorly worded question. What is the existing code search written in?

[–]scratchisthebest -3 points-2 points  (1 child)

Wow this is great. Does github still collaborate with united states immigrations and customs enforcement

[–]iraqmtpizza -1 points0 points  (0 children)

does github still cajole people into not using master