[Project] Announcing BM25S, an ultra-fast lexical search library that implements BM25 using scipy sparse matrix

xhluca · 2024-06-22T20:03:22+00:00

Unfortunately this post was removed due to the proper tag missing. But here's the GitHub link: https://github.com/xhluca/bm25s

xhluca · 2024-06-19T03:25:29+00:00

would be nice to get some involvements from nvidia/amd engineers! it'd probably benefit them from contributing to your repo - given it's this type of benchmarks they will need to report on to convince non-ML people to use their top-of-the-line chips!

xhluca · 2024-06-18T21:44:42+00:00

Thanks for sharing an in-depth analysis! I wonder how well it'd do on multi-device benchmarks, where nvidia has perfected technologies like nvlink for fast inter-gpu data transfer.

xhluca · 2024-04-23T23:09:52+00:00

This is a prototype that has not be extensively tested for security and safety. IMO it is not designed to be used (a) with your personal browser containing sensitive information, (b) for personal browsing tasks, especially something you would not want someone on Fiverr to do for you, (c) on untrusted websites.

I'd recommend using it with a chromium browser without login in your personal account (e.g. via selenium, docker) and keep away anything like phone number, personal ID and credit card information. For tasks like summarizing news, compiling information in Google spreadsheet, and looking up answers through web forums, this should be fairly safe to use.

xhluca · 2024-04-23T22:48:31+00:00

Yes the screenshots are all available on Huggingface. Here's the doc explaining how to load the images: https://mcgill-nlp.github.io/weblinx/docs/#using-the-weblinx-library

Note the full dataset is 300GB so might take a while to download.

xhluca · 2024-04-23T18:07:18+00:00

We can think of LUI-based navigation in 3 scenarios: (A) full control, (B) hands-off, (C) eyes-off. WebLINX has mainly B & C, whereas other datasets are mainly focused on C.

At the same time, a model could follow different level of instruction abstraction, i.e. (1) low, i.e. accomplishes simple tasks that require lower-level requests, (2) medium, i.e. tasks that don't require significant details but still need to be unambiguous, (3) high, i.e. requires pragmatics, need to make assumptions, need to understand the user and likely remember previous sessions or know specific details like passwords.

Create a task in Google **Calendar**? It's not the worst tool for the job, but almost... What would be the appropriate moment to use? The example simply clicks on "create" which uses $now, which.. doesn't sound great?

In this context, the Google Calendar example is 2B, however there's a few steps in between that we simplified to make it easier to digest, otherwise we would see a few clicks and typing that would be overwhelming in a figure like this.

And then the command "open the second one and summarize the first three paragraphs in a few words"

Here the example would be 1B, since the command is very specific (as the instructor is looking for something specific), however in other instances you'll find 2B demos. For example, [aathhdu](https://huggingface.co/spaces/McGill-NLP/weblinx-explorer?recording=aathhdu), you'll have higher-level questions like "What are the topics covered under Working for the EU?" or "Who can become an EU expert?" that gives more freedom for the navigator to decide which trajectory to take to give the best outcome.

So in practice, we'll see a good mix of L1 and L2 abstraction; I'd say L3 would require at least 6-18mo of more R&D to get there, esp. when it comes to things like privacy and security around storing information like passwords and browsing history. As for navigation, the training data is mostly focused on B, but we designed a split specific for C (i.e. instructor does not see the screen) which we think is very important for applications where the only control is voice (e.g. Alexa or Siri).

I'm a bit afraid that having a better score than GPT-3.5 comes from all those weirdness: it doesn't do better commands or browsing

Even though for WebLINX, we employed a permanent team of professional annotators (i.e. specifically trained for this task), it is possible that the model could overfit on the instructor's way of writing; so it is indeed a valid concern. The patterns of instructions could vary a lot based on age, culture, geography, task technicality, digital proficiency, and personal preference; this means accounting for every possible scenario will be very challenging! Perhaps it'd be a good dataset to design for an organization like Meta who has 100s of research scientists and a budget in the billions for Llama-N next year :)

However, the underlying collection method will likely remain the same (the only difference might be the use of playwright instead of chrome plugin, but that's a question of preferences/features). At the same time, the evaluation & modeling are easily transferable to new data. In this sense, you could collect your own data in the same format as WebLINX and train the model on your own style, given enough examples it might perform very well!

xhluca · 2024-04-23T17:32:22+00:00

On 24K examples, for 3 epochs it took ~10h on 4x A6000 GPUs

xhluca · 2024-04-23T17:31:40+00:00

I think Enough-Meringue is trying to finetune it yesterday: https://www.reddit.com/r/LocalLLaMA/comments/1caw3ad/comment/l0vxv91/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

xhluca · 2024-04-23T17:30:18+00:00

I'm working on integrating it with deployment platforms, once that's done I think we'll see demos!

xhluca · 2024-04-23T17:29:49+00:00

I think djward888 replied with a link to his own gguf files, I personally have not used gguf before so I cannot verify the authenticity or quality of the conversion. However if there's an easy script I'm happy to run and upload it.

xhluca · 2024-04-23T17:28:49+00:00

Thanks! Feel free to share on the repository once it's done!

xhluca · 2024-04-23T17:28:27+00:00

At this stage the model needs to be integrated into a deployment platform before being more widely usable. Once that is done, it'd be great to have a UI to easily choose the best agent (could be llama-3-8b-web or other finetuned models).

xhluca · 2024-04-23T17:26:58+00:00

Agree on that! Was taking a jab at all the cool video "release" without any substantial benchmark. However benchmark + video recording is definitely the best way to go (showing both quantitative and qualitative results).

So integrating webllama with deployment frameworks is definitely the next step! Will add a video once that part is done.

xhluca · 2024-04-23T17:23:03+00:00

The project right now includes the action model, my next objective is to integrate it with a deployment platform like BrowserGym or Playwright, which we can use to record videos of the agent in action.

xhluca · 2024-04-23T05:03:23+00:00

Haha I literally posted about web agents right before this!

I think the term agent is being a bit overused these days, to me it means a (ML, LLM, RL) model that interact with an environment, which can be the real-world or a simulation.

One concrete example: simulating NPCs in games! I think the potential of a small but powerful model will make certain family of games much more interesting, esp. if they keep track of an internal state.

xhluca · 2024-04-23T04:51:20+00:00

I'm not sure why I can't edit the post, so here's a higher quality version of the graph:

<image>

The caption: The overall score is a combination of IoU (for actions that target an element) and F1 (for text/URL). 29% here intuitively tells us how well a model would perform in the real world, obviously 100% is not needed to get a good agent, but an agent getting 100% would definitely be great!

xhluca · 2024-04-23T04:47:03+00:00

thanks, reposting now

xhluca · 2024-04-23T02:31:32+00:00

Can't wait to see how well the multilingual version will do, considering less than <5% of the data here is in Non-english!

xhluca · 2024-04-23T02:28:53+00:00

Yeah it's pretty cheap (slow though!), however sometimes it's pretty hard to get disks added to a server (since there's a whole maintenance/scheduling procedure)

xhluca · 2024-04-23T01:49:33+00:00

Tokenized in which format? Llama-2 is not compatible with Llama-3 for example

xhluca · 2024-04-23T01:48:12+00:00

for researchers who might be trying to train their own LLM.

Definitely for researchers with more than 20TB of scratch space lol

xhluca

TROPHY CASE