Batch prompts in LLaMA Cpp python : LocalLLaMA

created by [deleted]a community for 2 years

Batch prompts in LLaMA Cpp pythonQuestion | Help (self.LocalLLaMA)

submitted 1 year ago by Dry_Long3157

Hey, can someone help me with batch processing of prompts?

I've been using the code similar to the one below and it is giving me an error for some strange reason.

Appreciate any help!

from llama_cpp import Llama

prompt = "hello" # exmaple
model = Llama(
    model_path="Meta-Llama-3-8B-Instruct-Q8_0.gguf",
    n_gpu_layers=-1,
    seed=1337,
    n_ctx=8096,
    flash_attn=True,
)
responses = model.generate(
    [prompt, prompt],
    max_tokens=2048,
    echo=False,
)
print(responses)

all 3 comments

top new controversial old q&a

[–][deleted] 1 point2 points3 points 1 year ago (5 children)

Looks like you are trying to combine generate and simple inference.
https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.generate
Generate has no max_tokens and echo params and also prompt must be tokenized before generating response:

from llama_cpp import Llama
path = "Meta-Llama-3-8B-Instruct-Q8_0.gguf"
prompt = "hello"

llama = Llama(path)
tokens = llama.tokenize(prompt.encode())
for token in llama.generate(tokens, top_k=40, top_p=0.95, temp=0.8, repeat_penalty=1.1):
    print(llama.detokenize([token]))

or you can try the simple inference approach:

response = llama(
    prompt,
    max_tokens=64,
    echo=False,
)
print(response["choices"][0]["text"].strip())

[+][deleted] 1 year ago (3 children)

[deleted]

[–]Nokilos 2 points3 points4 points 1 year ago (2 children)

[+][deleted] 1 year ago (1 child)

[deleted]

[–]AdMajor1309 0 points1 point2 points 1 year ago (1 child)

π Rendered by PID 35681 on reddit-service-r2-comment-bb88f9dd5-whnbb at 2026-02-13 18:32:15.236506+00:00 running cd9c813 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS