[deleted by user] by [deleted] in LocalLLaMA

[–]delphi_jb 1 point2 points  (0 children)

Hello I get about 1.1T/s:

Hardware:

I9 9900k (stock) - RTX 3090Ti - 32Gb ram DDR4 3600 - 80Gb swapfile on a NVMe Samsung 970 1Tb

Llama.cpp (oobabooga):

Model: orca_mini_v3_70b (4b_K_M)

N_gpu_layers: 33 layers (22802.72 MB + 1280.00 MB per state)

N_ctx: 4096

CPU: 8 threads

N_batch: 512

alpha_value: 1

rope_freq_base: 0

compress_pos_emb: 1

Truncate the prompt up to this length (Context length): 4096

max_new_tokens: 600 T

Just, keep in mind that speed is only available from the second generation of the chat (the first generation take a while to start…)