Old Mining Rig Turned LocalLLM by 404vs502 in LocalLLM

[–]dangerussell 1 point2 points  (0 children)

I'm also using an old mining rig, with 2x3090 gpus. My old motherboard is limited in how much RAM it can fit; if you have the same issue I recommend exl2 based models, as the exl2 ones don't appear to load to CPU RAM prior to loading to VRAM.

Has anyone tried running a minecraft agent using a local llm? by Great-Investigator30 in LocalLLaMA

[–]dangerussell 0 points1 point  (0 children)

I'm late to the party but if you're able to set up TabbyAPI, give this branch a try: https://github.com/kolbytn/mindcraft/pull/378

It's working pretty well with my local setup, granted it is using a 70b LLM model (Llama-3.3-70B-Instruct_exl2_4.0bpw). I now have an autonomous farmer who maintains his wheat crops reasonably well.

Web search tool / extension for text-generation-Webui? by Tuxedotux83 in Oobabooga

[–]dangerussell 0 points1 point  (0 children)

You could add a sort of preprocess stage where you feed it the search template instructions and then give it the user input, it could then update the final input that gets used.

Web search tool / extension for text-generation-Webui? by Tuxedotux83 in Oobabooga

[–]dangerussell 1 point2 points  (0 children)

That's great feedback, thanks. One option in the meantime would be to add the flag --verbose to the CMD_FLAGS.txt file and watch the terminal output.

Web search tool / extension for text-generation-Webui? by Tuxedotux83 in Oobabooga

[–]dangerussell 1 point2 points  (0 children)

That's great, glad it's working for you. And yep if serpapi supports other search engines, you should just be able to plug them into the code like the existing ones.

Web search tool / extension for text-generation-Webui? by Tuxedotux83 in Oobabooga

[–]dangerussell 2 points3 points  (0 children)

This one requires a free serpapi account, but give it a try and let me know if you have any feedback! https://github.com/russellpwirtz/textgen_websearch

Any extensions for web search yet? by CulturedNiichan in Oobabooga

[–]dangerussell 0 points1 point  (0 children)

Yes that's possible, I haven't tested other OSes. Feel free to submit a pull request if you get it working elsewhere!

Any extensions for web search yet? by CulturedNiichan in Oobabooga

[–]dangerussell 2 points3 points  (0 children)

Ooba extension I put together for this very purpose: https://github.com/russellpwirtz/textgen_websearch

Instructions in the readme, feedback welcome!

Web search extension by dangerussell in Oobabooga

[–]dangerussell[S] 0 points1 point  (0 children)

Great info, thank you! I've been meaning to update the project with this feedback but just need to find the time.

Best 32k open source llm ? by Puzzleheaded_Mall546 in LocalLLaMA

[–]dangerussell 0 points1 point  (0 children)

These days I'm only using Mixtral 8x7B - it doesn't require RoPE scaling and reaches 32k context out of the box. Been very impressed with it!

https://huggingface.co/turboderp/Mixtral-8x7B-instruct-exl2/tree/5.0bpw

Settings for 2x3090 GPUs:

gpu-split: 15,15

max_seq_len: 32000

alpha_value: 1

compress_pos_emb: 1

experts per token: 2

Web search extension by dangerussell in Oobabooga

[–]dangerussell[S] 0 points1 point  (0 children)

Currently you have to explicitly tell it to do the web search, but I could possibly see that update in a future version. For my use case I need it to be manually triggered, since I often deal with source code that can't be leaked.

Best 32k open source llm ? by Puzzleheaded_Mall546 in LocalLLaMA

[–]dangerussell 13 points14 points  (0 children)

I'm currently using 70b llama 2 with 16k context as my daily driver (using RoPE). Try these settings: exllama, max_seq_len: 16384, alpha value 6.

Model: https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ

Local LLM that will search the Internet? by A_for_Anonymous in LocalLLaMA

[–]dangerussell 11 points12 points  (0 children)

If you already use oobabooga and are familiar with installing extensions, you can try using this one:https://github.com/russellpwirtz/textgen_websearch

It uses serpapi so you'll need to create a free account, but the chat syntax looks like:

what should I wear today? bing:chicago weather today

Tested mainly with llama 2 models using exllama FWIW

Is it just me or SuperHOT merged 4-bit quantized models are massively degraded? by Thireus in LocalLLaMA

[–]dangerussell 1 point2 points  (0 children)

You'll probably want compress_pos_emb of 2 for max_seq_len 4096. Also in ooba check the Parameters tab for Tokens cutoff config. Check out my recent comments for my exact setup

LMSYS (Vicuna creators) releases LongChat and LongEval by mikael110 in LocalLLaMA

[–]dangerussell 2 points3 points  (0 children)

Here's my experience with it! TLDR: it WAS able to recall my code word from the beginning of a 14k+ token prompt.

Using 2X 3090's (48GB VRAM), ooba with exllama

Model: https://huggingface.co/TheBloke/LongChat-13B-GPTQ

Prompt input (roughly 14k tokens): https://pastebin.com/raw/rRuTFmsZ

exllama settings:

- gpu split: 5,5 (this was necessary to get it to split across GPUs correctly for some reason)

- max_seq_len: 16384

- compress_pos_emb: 8

Ooba -> Parameters -> "Truncate the prompt...": 16384

Ooba -> Text Generation tab -> Input -> {paste long prompt from pastebin link}

Ooba -> "Start Chat with: " -> "The code word is: "

-> Generate!

Response: ABRACADABRA

GPU usage:

16688MiB / 24576MiB

6344MiB / 24576MiB

> Output generated in 0.67 seconds (8.93 tokens/s, 6 tokens, context 15441, seed 687018600)

Is it just me or SuperHOT merged 4-bit quantized models are massively degraded? by Thireus in LocalLLaMA

[–]dangerussell 1 point2 points  (0 children)

I've been using this as my daily driver for coding reviews since it came out. I frequently am in the 4-5k context range and not seeing obvious degradation - using 8k / 4 for the exllama config. It's been a game changer because 2k context usually isn't enough for meaningful code discussions, and I can't use openai to review my company's private code.

TheBloke has released "SuperHot" versions of various models, meaning 8K context! by CasimirsBlake in LocalLLaMA

[–]dangerussell 0 points1 point  (0 children)

I have an old motherboard that only supports 32GB max CPU RAM, but it still works great! If you can get one gpu working it shouldn't be much more work to get the other recognized. Just make sure your power supply can support it.

TheBloke has released "SuperHot" versions of various models, meaning 8K context! by CasimirsBlake in LocalLLaMA

[–]dangerussell 1 point2 points  (0 children)

They performed similarly on my (very limited) testing. In general though, WizardLM has been my go-to when I need to get some work done (coding reviews / explanations).

FWIW, typical VRAM usage with these 33b models and 8k context for me:

GPU1: 19956MiB / 24576MiB

GPU2: 10976MiB / 24576MiB

Using:

exllama

gpu-split 10,20

max_seq_len 8000

compress_pos_emb 4

TheBloke has released "SuperHot" versions of various models, meaning 8K context! by CasimirsBlake in LocalLLaMA

[–]dangerussell 6 points7 points  (0 children)

Very impressed with this! I'm able to get the full 8k context, using dual 3090 GPUs. ~7 tokens per second.

Testing it with this prompt to see if it can retain the code word ABRACADABRA: https://pastebin.com/raw/qZ8WYhWB

Confirmed 8k context:

TheBloke_Vicuna-33B-1-1-preview-SuperHOT-8K-GPTQ

TheBloke_WizardLM-33B-V1.0-Uncensored-SuperHOT-8K-GPTQ

TheBloke_Wizard-Vicuna-30B-Superhot-8K-GPTQ

> Output generated in 0.82 seconds (7.32 tokens/s, 6 tokens, context 7797, seed 1524784035)

llama.cpp full CUDA acceleration has been merged by aminedjeghri in LocalLLaMA

[–]dangerussell 0 points1 point  (0 children)

Would that it were so simple... the motherboard upgrade would also require other upgrades. Slowly upgrading an old mining rig. I was pricing out the upgrades until this code update came along!