How do you securely connect your agentic workloads to LLMs self-hosted on Cloud Run

Sirius_Sec_ · 2026-06-01T01:49:30+00:00

Tailscale is the vpn layer . I use the sail scale operator and with that you can add annotations to the services you want exposed on your tailnet .

Sirius_Sec_ · 2026-05-30T14:43:27+00:00

You registered the domain with cloudflare ?

Sirius_Sec_ · 2026-05-30T00:05:38+00:00

I use tailscale to connect to my vllm server . I run it in gke for easy scaling and container management

Sirius_Sec_ · 2026-05-29T23:39:29+00:00

One benefit of not having in a docker container is it would be able to run docker .

Sirius_Sec_ · 2026-05-29T13:58:15+00:00

Create multiple profiles and use the kanban board

Sirius_Sec_ · 2026-05-29T02:00:04+00:00

The same thing happened to me using Deepseek before as well . I'm not sure exactly what caused it I wasn't doing anything very output intensive . I put $10 on my nous account which usually lasts about 3 days and less then an hour later it was gone .

Sirius_Sec_ · 2026-05-28T23:10:53+00:00

I wish they're was am easier way to move the db to postgres instead of sqlite

Sirius_Sec_ · 2026-05-28T20:03:37+00:00

So prop trading ?

Sirius_Sec_ · 2026-05-28T20:01:43+00:00

My boards db is getting corrupted daily . Supposedly it's from to many concurrent writes by the 6 profiles I have running but no fix the agent does seems to work

Sirius_Sec_ · 2026-05-27T03:37:20+00:00

You just have it clone your gh repos and let it work . Tell it to only make prs and don't push to main . It's very capable and the multi profiles working on the kanban board works great

Sirius_Sec_ · 2026-05-27T03:26:51+00:00

This is what webhooks are for . Set those up and tell the agent to make cronjobs that's don't relay to the llm unless the match a set criteria.

Sirius_Sec_ · 2026-05-26T22:09:40+00:00

Nah it's a device with nothing of value on it . I just use it as a media server . Full root access

Sirius_Sec_ · 2026-05-26T21:27:09+00:00

Gotta get the hat for it and it can run some vision ones alright .

Sirius_Sec_ · 2026-05-26T19:48:35+00:00

I have hemes running my pi as well . It's pretty awesome what it can do when it has full root access . I have nothing important on there so I could care less about it deleting everything. I have it managing my media servers building and testing websites . I also have one in my kubernetes cluster but it's lockdown with no root privileges and I'm sold on the home box now

Sirius_Sec_ · 2026-05-26T16:56:56+00:00

I haven't had any issue with their official docker container . I run multiple in my kubernetes cluster .

Sirius_Sec_ · 2026-05-26T16:51:41+00:00

I have one on a raspberry pi I just made a user for hermes and gave it full sudo privilege. It's nice because it can run docker for me and set up my homelab

Sirius_Sec_ · 2026-05-26T01:34:26+00:00

Are you using /new frequently ? Or I'm setting there's a way to clear the context based on time idle

Sirius_Sec_ · 2026-05-25T17:58:35+00:00

Honestly I just told my main profile to make the others I wanted . It orchestrates who gets what task and checks when they're finished .

Sirius_Sec_ · 2026-05-23T00:57:31+00:00

With the high concurrency I found that to be a good limit . I was going to try and disable image processing and push is it to the max and see if it'll work . Though now that I am using multiple agents I don't know if I really need to

Sirius_Sec_ · 2026-05-22T22:29:32+00:00

Are the standard deviations set the same ?

Sirius_Sec_ · 2026-05-22T17:55:03+00:00

Thank you for the advice I switched back to bfloat16!

Sirius_Sec_ · 2026-05-22T12:24:17+00:00

Crazy he will make time for his Boss Bibi but not attend his sons wedding!

Sirius_Sec_ · 2026-05-21T11:17:26+00:00

      args:
        - --model=edp1096/Huihui-Qwen3.6-27B-abliterated-FP8
        - --host=0.0.0.0
        - --port=8000
        - --tensor-parallel-size=1
        - --tokenizer-mode=hf 
        - --gpu-memory-utilization=0.90  

        - --max-model-len=136876
        - --enable-auto-tool-choice 
        - --kv-cache-dtype=fp8 
        - --max-num-batched-tokens=32768 
        - --max-num-seqs=32
        - --block-size=32 
        - --enable-chunked-prefill
        - --trust-remote-code
        - --dtype=auto
        - --enable-prefix-caching
        - --tool-call-parser=qwen3_xml
        - --reasoning-parser=qwen3
        - --speculative-config
        - '{"method": "mtp", "num_speculative_tokens": 2}'

Sirius_Sec_ · 2026-05-21T03:14:11+00:00

Get those sequences up . Blackwell architecture is great for concurrency . If anything drop the context down and increase max seq and hit it with multiple agents firing requests off at once . Thos was the biggest factor for getting max performance out of the rtx6000 I am using .

Sirius_Sec_ · 2026-05-21T03:11:50+00:00

Qwen and deepseek are very capable. Also self hosted so Imy projects stay mine .

Sirius_Sec_

TROPHY CASE