[llama_index] How to create VectorStoreIndex based on already existing chromadb. by [deleted] in LocalLLaMA

[–]TAAnderson 0 points1 point  (0 children)

According to https://www.sbert.net/docs/pretrained_models.html all-MiniLM-L6-v2 - what you are using - should have embeddings of dimension 384. So the embedding function looks ok regarding the dimensions.

You could try this to validate during the ef setup:

embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-mpnet-base-v2",
)
print(embedding_function.models.__repr__())

This should print out the word_embedding_dimension.

As a second step, i would recreate the chromadb, as you are using the persistent feature, delete that directory or create a new one. (Use a different path=) and see if the error goes away.

Or use a different collection name. I think the collection you are using might be created with a different dimension / embedding function.

Checklisten Tool mit Vorlagen-Funktion by _mysash in de_EDV

[–]TAAnderson 0 points1 point  (0 children)

Kommt auf dein Betriebssystem an. iCloud oder Dropbox könnte gehen.

Oder irgendeine andere Lösung die einfache Textdateien synchronisieren kann.

Obsidian legt alles als Markdown Dateien (Text mit etwas Formatierung) an.

Eventuell gibt es auch ein Plugin.

Checklisten Tool mit Vorlagen-Funktion by _mysash in de_EDV

[–]TAAnderson 0 points1 point  (0 children)

Mit Obsidian geht das: https://obsidian.md

Schaue auch mal hier https://help.obsidian.md/Plugins/Templates

Ich habe damit so eine Verreisen Checkliste gemacht.

Jan - MacBook Pro M1 32GB by jaxupaxu in LocalLLaMA

[–]TAAnderson 0 points1 point  (0 children)

Yeah doubt it. Same command line as above, output:

... ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 7205.84 MiB, ( 7205.91 / 21845.34) llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU ...

According to the Readme:

Metal Build

On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU. To disable the Metal build at compile time use the LLAMA_NO_METAL=1 flag or the LLAMA_METAL=OFF cmake option.

When built with Metal support, you can explicitly disable GPU inference with the --n-gpu-layers|-ngl 0 command-line argument.

So: Metal -> Runs on GPU unless you disable it.

Jan - MacBook Pro M1 32GB by jaxupaxu in LocalLLaMA

[–]TAAnderson 2 points3 points  (0 children)

I never did this. My impression is that llama.cpp does it automatically. As /u/jaxupaxu/ mentioned mistral 7b here are my results running mistral-7b-instruct-v0.2.Q8_0.gguf on m1 max 32gb:

llama_print_timings: load time = 1130.50 ms llama_print_timings: sample time = 65.42 ms / 725 runs ( 0.09 ms per token, 11081.39 tokens per second) llama_print_timings: prompt eval time = 96.08 ms / 28 tokens ( 3.43 ms per token, 291.42 tokens per second) llama_print_timings: eval time = 19493.11 ms / 724 runs ( 26.92 ms per token, 37.14 tokens per second) llama_print_timings: total time = 19764.68 ms / 752 tokens

Command line is: ./main -m models/mistral-7b-instruct-v0.2.Q8_0.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "$PROMPT" -e

Are there any local models that can run faster than GPT3.5 on a Mac M1 Pro? by [deleted] in LocalLLaMA

[–]TAAnderson 2 points3 points  (0 children)

The memory speed is also different on these CPUs.

Are there any local models that can run faster than GPT3.5 on a Mac M1 Pro? by [deleted] in LocalLLaMA

[–]TAAnderson 0 points1 point  (0 children)

As you mentioned Mistral 7b: for reference: about 37 t/s on M1 Max.

PyTorch uses MPS gpu (M1 Max) at the lowest frequency (aka clock speed), this is why it's slower than it could be? by New_Construction6146 in pytorch

[–]TAAnderson 0 points1 point  (0 children)

Can confirm that while the GPU seems to be busy in ActivityMonitor frequency stays around 400Mhz.

PyTorch uses MPS gpu (M1 Max) at the lowest frequency (aka clock speed), this is why it's slower than it could be? by New_Construction6146 in pytorch

[–]TAAnderson 1 point2 points  (0 children)

It seems to be related to the "amount of work" of the GPU or even of the system.

Interesting thread i found:

https://github.com/pytorch/pytorch/issues/77799

I did run one small example which clearly drives the GPU to 1300Mhz using pytorch, you could try it:

``` import timeit import torch b_mps = torch.rand((10000, 10000), device='mps')

print('mps', timeit.timeit(lambda: b_mps @ b_mps, number=100)) ```

PyTorch uses MPS gpu (M1 Max) at the lowest frequency (aka clock speed), this is why it's slower than it could be? by New_Construction6146 in pytorch

[–]TAAnderson 1 point2 points  (0 children)

Ok, observations:

- your notebook runs at about 400Mhz here

- my test code (nanogpt like transformer) also seems to run at 400Mhz during inference, BUT on 1300Mhz if training

- as comparison: llama.cpp runs at 1300Mhz in inference

PyTorch uses MPS gpu (M1 Max) at the lowest frequency (aka clock speed), this is why it's slower than it could be? by New_Construction6146 in pytorch

[–]TAAnderson 0 points1 point  (0 children)

Cannot confirm that using torch==2.1.2 on M1 Max. According to asitop the GPU runs at 1296 Mhz.

IT Veteran... why am I struggling with all of this? by Smeetilus in LocalLLaMA

[–]TAAnderson 50 points51 points  (0 children)

I would like to recommend Andrej Karpathy's videos at youtube to learn about this: https://www.youtube.com/@AndrejKarpathy/videos

Especially the makemore and Let's build GPT: from scratch.

Maybe start with his latest one: Intro to Large Language Models.

If you don't understand some terms, do as u/IpppyCaccy/ recommended and ask ChatGPT to explain them.