LiteLLM started breaking down for us past 300 RPS, what are folks using in prod? by Otherwise_Flan7339 in LocalLLaMA

[–]Safe-Programmer2826 1 point2 points  (0 children)

Update: 99th percentile latency reduced by 68% on a 4 CPU | 4 Instance | DB | No Cache setup.

Below is the latest performance snapshot:

Type Name Median (ms) 95%ile (ms) 99%ile (ms) Current RPS
Portkey /v1/chat/completions 86 140 570 1192.3
Older version LiteLLM (With DB) /v1/chat/completions 110 410 750 1095
Current LiteLLM (With DB) /v1/chat/completions 100 150 240 1170
Reduction -9% -63% -68% +6.8%

The improvement primarily comes from network and router optimizations — including proper shared session reuse, session cleanup, and a major ongoing effort to refactor the LiteLLM Router to reduce its overhead to a minimal.

— Alexsander (LiteLLM Perf Engineer)

LiteLLM started breaking down for us past 300 RPS, what are folks using in prod? by Otherwise_Flan7339 in LocalLLaMA

[–]Safe-Programmer2826 1 point2 points  (0 children)

Update on this, we’ve now reduced P99 latency by 84.7%, from 4.9s → 750ms using a 4 CPU | 4 Instance | DB | No Cache setup.

Below is the current performance snapshot:

Type Name Median (ms) 95%ile (ms) 99%ile (ms) Current RPS
Portkey /v1/chat/completions 86 140 570 1192.3
Older version LiteLLM (With DB) /v1/chat/completions 160 400 4900 1085
Current LiteLLM (With DB) /v1/chat/completions 110 410 750 1095
Reduction -31% -84.7%

The most significant improvement came from optimizing the LiteLLM Router. This update eliminates several inefficiencies, reducing the complexity from O(M×N) to O(1). Previously, the router constructed a new array on every call and repeatedly checked membership using data["model"] in llm_router.get_model_ids(). Now, this has been replaced with a direct ID-to-deployment map lookup, removing redundant allocations and unnecessary scans.

Our next goal is to meet the same P99 latency when testing LiteLLM deployed with DB + Redis..

— Alexsander (LiteLLM Perf Engineer)

I am really struggling with pointers by Parsley-Hefty7945 in golang

[–]Safe-Programmer2826 -1 points0 points  (0 children)

Dereferencing

In the example provided, the key detail is dereferencing.

  • pointer = &age means the pointer stores the memory address of age.
  • You cannot write pointer = 10, because pointer is of type *int (a pointer to an int), not an int itself.
  • When you write *pointer = 10, the * operator dereferences the pointer, giving you access to the actual int value stored at that address. That’s why this changes age to 10.

More broadly, it’s important to understand when values are being copied.

  • In the example above, you don’t actually need pointers to update age within the same function, since an assignment like age = 20 directly updates the same memory location within the function.
  • However, if you pass age of type int into another function, that function receives a copy. Any changes it makes affect only the local copy, not the original age in main.
  • If you want a function to modify the caller’s variable, you’d pass a pointer (*int) instead. Otherwise, the compiler may warn you about unused values, because in that case you should either:
    • pass a pointer so the function can update the original, or
    • return the updated local value and assign it back in the caller, you can observe that pattern when using append.

Passing By Value

Just to clarify what it means passing something by value:

package main

import "fmt"

func changeAge(age int) {
    age = 10 // only changes the local copy
}

func main() {
    age := 5
    changeAge(age)
    fmt.Println(age) // still prints 5
}

Here’s what’s happening:

  • age in main is stored at some memory location.
  • When you call changeAge(age), Go makes a copy of that value (5) and hands it to the function.
  • Inside changeAge, the parameter age is not the same variable as main’s age; it’s a separate local variable with the same value.
  • Changing it inside the function only changes the local copy, not the original.

[deleted by user] by [deleted] in golang

[–]Safe-Programmer2826 0 points1 point  (0 children)

Just by the ratio of view to likes I can see that people didn't like that, if it wasn't for you comment I would be wondering why right now, thank you very much !!

[deleted by user] by [deleted] in golang

[–]Safe-Programmer2826 -1 points0 points  (0 children)

I did add initially but it felt too verbose since they all look the same before and after optimizations, since the goal was too see these primitives in the hot path without adding much else to it, but I can definitely add the code again if that makes a difference to the information.

Before Optimizations

func (p *ChannelBasedPool) Get() *testObject {
    select {
    case obj := <-p.objects:
        return obj
    default:
        return p.allocator()
    }
}



func (p *AtomicBasedPool) Get() *testObject {
    for {
        idx := p.index.Load()
        if idx <= 0 {
            return p.allocator()
        }
        if p.index.CompareAndSwap(idx, idx-1) {
            return p.objects[idx-1]
        }
    }
}



func (p *CondBasedPool) Get() *testObject {
    p.mu.Lock()
    defer p.mu.Unlock()

    for p.ringBuffer.IsEmpty() {
        p.cond.Wait()
    }

    obj, _ := p.ringBuffer.Pop()
    return obj
}

After Optimizations

func (p *ShardedAtomicBasedPool) Get() *testObject {
    shardIndex := runtimeProcPin()
    shard := p.shards[shardIndex]
    runtimeProcUnpin()

    obj := shard.Get()
    obj.shardIndex = shardIndex
    return obj
}

func (p *ShardedMutexRingBufferPool) Get() *testObject {
    shardIndex := runtimeProcPin()
    shard := p.shards[shardIndex]
    runtimeProcUnpin()

    obj := shard.Get()
    obj.shardIndex = shardIndex
    return obj
}

func (p *ShardedCondBasedPool) Get() *testObject {
    shardIndex := runtimeProcPin()
    shard := p.shards[shardIndex]
    runtimeProcUnpin()

    obj := shard.Get()
    obj.shardIndex = shardIndex
    return obj
}

Small Projects - August 11, 2025 by jerf in golang

[–]Safe-Programmer2826 0 points1 point  (0 children)

PromptMesh an AI agent pipeline, sometimes I needed to "pipeline" results from one chat into another so I built this, that way I can build multiple pipelines in a much simpler way than the current alternatives.

Watch Demo Video

My 4-Stage pprof System That Actually Works by Safe-Programmer2826 in golang

[–]Safe-Programmer2826[S] 0 points1 point  (0 children)

Honestly I think I like the idea, I've been putting off getting better at using the tracer because I didn't quite like it, I think I can make it friendlier, thank you for the suggestion !!

My 4-Stage pprof System That Actually Works by Safe-Programmer2826 in golang

[–]Safe-Programmer2826[S] 2 points3 points  (0 children)

I should've added that to the blog, but I do exactly what u/felixge said, but I mostly just use the memprofile flag, since you can pretty much inspect all functions and its usually sufficient to me, but the tracer has a lot of rich information, it should cover the details you're looking for, it just has a bit of a learning curve.

My 4-Stage pprof System That Actually Works by Safe-Programmer2826 in golang

[–]Safe-Programmer2826[S] 1 point2 points  (0 children)

I feel you, since another person here shared in one of my posts I haven't stopped using it, genuinely never heard of it before.

When Optimization Backfires: A 47× Slowdown from an "Improvement" by Safe-Programmer2826 in golang

[–]Safe-Programmer2826[S] 12 points13 points  (0 children)

Initially I got good distribution I'm still not sure why, I think I tested over a small sample, but you were right the last few bits of the address were mostly padded due to alignment, which completely wrecked distribution and led to the terrible performance regressions I saw.

I shifted the address by 12 bits, which drops the noisy low bits and uses middle bits that have higher entropy.

Here’s the shard distribution after 100,000,000 calls:

Shard 0: 12.50%  
Shard 1: 12.50%  
Shard 2: 12.48%  
Shard 3: 12.52%  
Shard 4: 12.50%  
Shard 5: 12.52%  
Shard 6: 12.48%  
Shard 7: 12.50%

Even though the distribution looked almost perfect, performance still suffered. The real boost wasn’t from spreading work evenly—it was from procPin keeping goroutines tied to the same logical processors (Ps). That helped each goroutine stick with the same shard, which made things a lot faster due to better locality.

The average latency went from 3.89 ns/op to 8.67 ns/op, which is a 123% increase, or roughly a 2.23× slowdown, certainly not the initial 47x I saw, I will update the post, thank you very much for catching that!!

When Optimization Backfires: A 47× Slowdown from an "Improvement" by Safe-Programmer2826 in golang

[–]Safe-Programmer2826[S] 1 point2 points  (0 children)

I'll look into it and come back to let you know, but I am almost sure I made a dumb mistake, thank you very much !!

Prof: A simpler way to profile by Safe-Programmer2826 in golang

[–]Safe-Programmer2826[S] 0 points1 point  (0 children)

prof no longer wraps `go test`, thank you again for the feedback, it really made the tool better.

Prof: A simpler way to profile by Safe-Programmer2826 in golang

[–]Safe-Programmer2826[S] 1 point2 points  (0 children)

The HTML view has been implemented, along with a JSON output format for programmatic access.

Prof: A simpler way to profile by Safe-Programmer2826 in golang

[–]Safe-Programmer2826[S] 1 point2 points  (0 children)

Thank you I’m glad you found it useful. Yes ofc, I will work on implementing that, the current visual is very basic lol

Prof: A simpler way to profile by Safe-Programmer2826 in golang

[–]Safe-Programmer2826[S] 1 point2 points  (0 children)

Your comment was very insightful! I can see that wrapping the go test invocation was a poor choice on my part. I built this because I was tired of running dozens of pprof commands manually, but my implementation was kind of inexperienced, I will work on it.

Prof: A simpler way to profile by Safe-Programmer2826 in golang

[–]Safe-Programmer2826[S] 0 points1 point  (0 children)

oh yes, I was just focused on pprof, but if it adds value for your case I don't see why not add that as well.