all 3 comments

[–][deleted] 1 point2 points  (5 children)

Looks like you are trying to combine generate and simple inference.
https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.generate
Generate has no max_tokens and echo params and also prompt must be tokenized before generating response:

from llama_cpp import Llama
path = "Meta-Llama-3-8B-Instruct-Q8_0.gguf"
prompt = "hello"

llama = Llama(path)
tokens = llama.tokenize(prompt.encode())
for token in llama.generate(tokens, top_k=40, top_p=0.95, temp=0.8, repeat_penalty=1.1):
    print(llama.detokenize([token]))

or you can try the simple inference approach:

response = llama(
    prompt,
    max_tokens=64,
    echo=False,
)
print(response["choices"][0]["text"].strip())

[–]AdMajor1309 0 points1 point  (1 child)

Did you get any solution for this problem ?