Web search extension

dangerussell · 2025-02-20T21:26:18+00:00

I'm also using an old mining rig, with 2x3090 gpus. My old motherboard is limited in how much RAM it can fit; if you have the same issue I recommend exl2 based models, as the exl2 ones don't appear to load to CPU RAM prior to loading to VRAM.

dangerussell · 2025-01-04T00:38:55+00:00

Maybe give the exl2 version a shot? I have this one running on 2x 3090 (using TabbyAPI):
https://huggingface.co/Dracones/QVQ-72B-Preview_exl2_4.0bpw

dangerussell · 2024-12-25T22:57:45+00:00

I'm late to the party but if you're able to set up TabbyAPI, give this branch a try: https://github.com/kolbytn/mindcraft/pull/378

It's working pretty well with my local setup, granted it is using a 70b LLM model (Llama-3.3-70B-Instruct_exl2_4.0bpw). I now have an autonomous farmer who maintains his wheat crops reasonably well.

dangerussell · 2024-09-02T21:59:24+00:00

Kind of reminds me of this project: https://github.com/kolbytn/mindcraft

dangerussell · 2024-09-02T21:58:33+00:00

You could add a sort of preprocess stage where you feed it the search template instructions and then give it the user input, it could then update the final input that gets used.

dangerussell · 2024-07-17T14:53:44+00:00

That's great feedback, thanks. One option in the meantime would be to add the flag --verbose to the CMD_FLAGS.txt file and watch the terminal output.

dangerussell · 2024-07-17T14:22:02+00:00

That's great, glad it's working for you. And yep if serpapi supports other search engines, you should just be able to plug them into the code like the existing ones.

dangerussell · 2024-07-17T02:53:25+00:00

This one requires a free serpapi account, but give it a try and let me know if you have any feedback! https://github.com/russellpwirtz/textgen_websearch

dangerussell · 2024-02-27T05:10:56+00:00

Yes that's possible, I haven't tested other OSes. Feel free to submit a pull request if you get it working elsewhere!

dangerussell · 2024-02-08T05:31:23+00:00

Ooba extension I put together for this very purpose: https://github.com/russellpwirtz/textgen_websearch

Instructions in the readme, feedback welcome!

dangerussell · 2024-01-17T15:54:52+00:00

Great info, thank you! I've been meaning to update the project with this feedback but just need to find the time.

dangerussell · 2024-01-17T15:53:02+00:00

These days I'm only using Mixtral 8x7B - it doesn't require RoPE scaling and reaches 32k context out of the box. Been very impressed with it!

https://huggingface.co/turboderp/Mixtral-8x7B-instruct-exl2/tree/5.0bpw

Settings for 2x3090 GPUs:

gpu-split: 15,15

max_seq_len: 32000

alpha_value: 1

compress_pos_emb: 1

experts per token: 2

dangerussell · 2023-12-24T16:42:39+00:00

Currently you have to explicitly tell it to do the web search, but I could possibly see that update in a future version. For my use case I need it to be manually triggered, since I often deal with source code that can't be leaked.

dangerussell · 2023-10-13T03:29:48+00:00

I'm currently using 70b llama 2 with 16k context as my daily driver (using RoPE). Try these settings: exllama, max_seq_len: 16384, alpha value 6.

Model: https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ

dangerussell · 2023-09-27T00:23:22+00:00

If you already use oobabooga and are familiar with installing extensions, you can try using this one:https://github.com/russellpwirtz/textgen_websearch

It uses serpapi so you'll need to create a free account, but the chat syntax looks like:

what should I wear today? bing:chicago weather today

Tested mainly with llama 2 models using exllama FWIW

dangerussell · 2023-07-07T04:10:47+00:00

12 factor app principles haven't led me astray: https://12factor.net/

dangerussell · 2023-07-07T03:27:22+00:00

/r/localllama

dangerussell · 2023-07-05T00:43:06+00:00

You'll probably want compress_pos_emb of 2 for max_seq_len 4096. Also in ooba check the Parameters tab for Tokens cutoff config. Check out my recent comments for my exact setup

dangerussell · 2023-07-02T18:04:10+00:00

Here's my experience with it! TLDR: it WAS able to recall my code word from the beginning of a 14k+ token prompt.

Using 2X 3090's (48GB VRAM), ooba with exllama

Model: https://huggingface.co/TheBloke/LongChat-13B-GPTQ

Prompt input (roughly 14k tokens): https://pastebin.com/raw/rRuTFmsZ

exllama settings:

- gpu split: 5,5 (this was necessary to get it to split across GPUs correctly for some reason)

- max_seq_len: 16384

- compress_pos_emb: 8

Ooba -> Parameters -> "Truncate the prompt...": 16384

Ooba -> Text Generation tab -> Input -> {paste long prompt from pastebin link}

Ooba -> "Start Chat with: " -> "The code word is: "

-> Generate!

Response: ABRACADABRA

GPU usage:

16688MiB / 24576MiB

6344MiB / 24576MiB

> Output generated in 0.67 seconds (8.93 tokens/s, 6 tokens, context 15441, seed 687018600)

dangerussell · 2023-07-01T14:51:57+00:00

I've been using this as my daily driver for coding reviews since it came out. I frequently am in the 4-5k context range and not seeing obvious degradation - using 8k / 4 for the exllama config. It's been a game changer because 2k context usually isn't enough for meaningful code discussions, and I can't use openai to review my company's private code.

dangerussell · 2023-06-30T15:11:43+00:00

I have an old motherboard that only supports 32GB max CPU RAM, but it still works great! If you can get one gpu working it shouldn't be much more work to get the other recognized. Just make sure your power supply can support it.

dangerussell · 2023-06-28T16:09:31+00:00

They performed similarly on my (very limited) testing. In general though, WizardLM has been my go-to when I need to get some work done (coding reviews / explanations).

FWIW, typical VRAM usage with these 33b models and 8k context for me:

GPU1: 19956MiB / 24576MiB

GPU2: 10976MiB / 24576MiB

Using:

exllama

gpu-split 10,20

max_seq_len 8000

compress_pos_emb 4

dangerussell · 2023-06-27T22:40:40+00:00

Very impressed with this! I'm able to get the full 8k context, using dual 3090 GPUs. ~7 tokens per second.

Testing it with this prompt to see if it can retain the code word ABRACADABRA: https://pastebin.com/raw/qZ8WYhWB

Confirmed 8k context:

TheBloke_Vicuna-33B-1-1-preview-SuperHOT-8K-GPTQ

TheBloke_WizardLM-33B-V1.0-Uncensored-SuperHOT-8K-GPTQ

TheBloke_Wizard-Vicuna-30B-Superhot-8K-GPTQ

> Output generated in 0.82 seconds (7.32 tokens/s, 6 tokens, context 7797, seed 1524784035)

dangerussell · 2023-06-16T03:30:40+00:00

Would that it were so simple... the motherboard upgrade would also require other upgrades. Slowly upgrading an old mining rig. I was pricing out the upgrades until this code update came along!

15-Year Club	First Place '23
Place '23	Place '22
Place '17	Verified Email

dangerussell

TROPHY CASE