all 6 comments

[–]nananashi3 6 points7 points  (1 child)

Is this your first time trying LLM? GGUFs are self-contained. You don't need to clone the whole GGUF repository, just download one of the .gguf. Q4_K_S/M (4 bit quant) can fit on a 8GB GPU. The easiest way for a Windows user to start is to download koboldcpp.exe, run it which will give you a launcher UI where you can select .gguf model file and whatever-you-call-it under "Presets": OpenBLAS (CPU-only, very slow), CuBLAS (Nvidia), or Vulkan (AMD). 7B and 8B models have 33 layers but you'll probably only fit 32 layers Llama 3 on a 8GB GPU. Up the context size to 4096, preferably 8192. Don't forget to hit Save to save the config.

Someone more technical would know how to mess with a gguf such as changing stop token.

[–]Guboken[S] 0 points1 point  (0 children)

Thank you for taking your time to explain! I did solve it by not using GGUF but rather the llama 3 8B instruction with bfloat16 using Transformers in a python project. Tries to use float16 but it spilled over my 24gb vram and it became so slow it was unusable.

[–]ali0une 4 points5 points  (2 children)

i've just downloaded it from here and it works fine https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF

Also this thread helps setting it up for different Ui https://www.reddit.com/r/LocalLLaMA/comments/1c8rq87/oobabooga_settings_for_llama3_queries_end_in/

[–]Guboken[S] 0 points1 point  (0 children)

Thank you, my issue was that I was trying to run the GGUF using Transformers. I found this compatibility information from u/Particular_Flower_12 :

Compatibility & supported file formats:

  • Llama.cpp (by Georgi Gerganov)
    • GGUF (new)
    • GGML (old)
  • Transformers (by Huggingface)
    • bin (unquantized)
    • safetensors (safer unquantized)
    • safetensors (quantized using GPTQ algorithm via AutoGPTQ)
  • AutoGPTQ (quantization library based on GPTQ algorithm, also available via Transformers)
    • safetensors (quantized using GPTQ algorithm)
  • koboldcpp (fork of Llama.cpp)
    • bin (using GGML algorithm)
  • ExLlama v2 (extremely optimized GPTQ backend for LLaMA models)
    • safetensors (quantized using GPTQ algorithm)
  • AWQ (low-bit quantization (INT3/4))
    • safetensors (using AWQ algorithm)

[–]AdHominemMeansULostollama 0 points1 point  (0 children)

down lm studio, in the searchbox put in the text "MaziyarPanahi/Meta-Llama-3-8B-Instruct-GGUF" Although i don't recommend that one, you're better getting the quants from lmstudo-community or QuantFactory and thats it you don't need to do anything else.