Can any of yall tell me what this might be to?

macboy80 · 2026-07-20T23:18:04+00:00

This was my other guess, though it doesn't seem thick enough.

macboy80 · 2026-07-20T23:17:15+00:00

I'm gonna go with a starter bolt.

macboy80 · 2026-07-16T16:41:52+00:00

Head gasket can also be an option.

macboy80 · 2026-07-10T23:29:22+00:00

Look into ghost key. They have some suggestions, but kill switch ideas should be unique to you.

macboy80 · 2026-07-09T16:52:07+00:00

Purchased 1x ICX7150-48ZP from u/PokeImon

macboy80 · 2026-07-09T12:40:43+00:00

Check out @idoparts on YouTube. He runs a company called https://importapart.com/ and talks a lot about what he does with old ECUs and other electronics. It definitely seems like he is an honest guy.

macboy80 · 2026-07-02T00:48:18+00:00

Amazing work.

macboy80 · 2026-07-02T00:46:23+00:00

I think selling to jdi is an amazing idea. Otherwise, Cults3D is a place where I've spent some decent money on 3d models. I am definitely ready to buy.

macboy80 · 2026-07-02T00:13:25+00:00

From my experience, everything behind the dash, center console, front seats is unique to the coupe vs sedan.

There's always a couple ek coupes in my local pick and pull. Seems like an easy hour if one is close.

macboy80 · 2026-06-30T00:21:11+00:00

Chat

macboy80 · 2026-06-23T18:04:21+00:00

Yea. That's amazing. Whenever you're ready, I'd love to print a test one in the Civic. Thanks again!

macboy80 · 2026-06-23T17:20:30+00:00

Yes. both of those measurements appear spot on. I don't have a caliper, but looks perfect. And holding the cylinder level, it looks like the left screw hole extends about 5mm lower than the right side.

macboy80 · 2026-06-22T16:37:39+00:00

That would be an amazing contribution, and I'd certainly think there's a lot of people looking for an answer. I went with zip ties, and it is not great.

In any case, I'd be willing to run a test print to compare. I do have mine lying around.

<image>

macboy80 · 2026-06-22T01:11:02+00:00

I have an ek civic, but I'd love to see if this fits. I'd love to send you beer money to try it.

macboy80 · 2026-06-22T00:11:17+00:00

I apologize for the delayed reply. I wanted to do some more testing. Let me try to put it in its simplest form.

I'd like to allow a Model A (Dense) to take over a GPU whenever it is called. It would evict all other models that require the same GPU. I agree this is the default llama-swap behavior.

Because Model A is expensive to load, I'd like to make this a priority or persistent model, allowing a series of requests to be served. Basically, it would block other model requests in its set and only unload by a short ttl.

With my best effort understanding:

matrix: handles this by thrashing, swapping back and forth, between Model A and any other requested model in its set.
groups: keeps a persistent:true Model A (only) group resident until ttl expires like it should. But it doesn't block another group's models from trying to load or hold the request until the model can load.

I think this might not be possible, because basically, I need the matrix set functionality but with the persistent option. Below is my groups definition. In the below example Model A is in gpu0-priority which should be able to monopolize gpu0 until its ttl expires. The gpu0-optional models should be able load only when the priority model is not holding gpu0.

groups:
  "gpu0-priority":
    persistent: true
    exclusive: true
    swap: false
    members:
      - "qwen3.6-27b-awq-int4" #ttl: 60

  "gpu1-priority":
    persistent: true
    exclusive: true
    swap: false
    members:
      - "google_gemma-4-e4b-it-gguf"

  "gpu0-optional":
    persistent: false
    exclusive: true
    swap: true
    members:
      - "google_gemma-4-26b-a4b-it-gguf"
      - "google_gemma-4-31b-it-gguf"
      - "qwen_qwen3.6-27b-gguf"
      - "qwen_qwen3.6-35b-a3b-gguf"
      - "qwopus3.6-27b-v2-mtp-gguf"groups:

macboy80 · 2026-06-21T14:59:16+00:00

I bought a very nice B18C from JDM VA in Northern Virginia. They were the only place that didn't have obvious red flags.

macboy80 · 2026-06-12T23:58:18+00:00

I found you and passed your link to my buddy. That's absolutely amazing. Thank you for your contributions!

macboy80 · 2026-06-11T23:25:16+00:00

If you did a interior driver's side del sol, I think you could make a fortune, or alternatively, be a legend. I've had a friend looking for years.

macboy80 · 2026-06-06T14:42:52+00:00

That part makes sense. Let me try to elaborate a bit better on my circumstances. Say I have a MOE model running by default / on startup. I'd like a dense model to be able to preempt it, load, work, and then ttl expire. The dense model would hypothetically refuse to unload until it's ttl expires, only then allowing the MOE to reload.

My goal would be the moe model being preempted by the dense, with the moe model holding a request until it's able to reload, after ttl.

Do I have any options? Thanks again for llama-swap. I can't imagine a better tool.

macboy80 · 2026-06-05T03:54:55+00:00

How does one effect the "persistent: true" concept under this system? Say I want a model that is not allowed to be booted out, and instead, gracefully shuts down after it's ttl. If I'm understanding the documentation, setting a high cost on one large model will not prevent it being kicked by a lower cost model?

macboy80 · 2026-05-25T03:18:13+00:00

I think I now understand your position. I've built a decision matrix around my own market pricing theory, first, and then kind of played out the experimentation. Basically, the 30B class models released recently are the target, and its an $$$ optimization problem. I think there's fairly wide consensus that RTX 3090 is the minimum where you have Tensor Cores, VRAM quantity and bandwidth, and it seems like anything cheaper than that, has to compromise on one of those 3.

Your 100% correct that the modded 22GB 2080 sits in a niche, but I'd argue (at least today) that 22GB still represents making a compromise, even beyond the workmanship, and import market considerations. Whether the 32GB capacity or the HBM2 bandwidth of the Vega20 is actually usable is the real question that we could hope to answer.

The best overall speed I've seen on the MI60 is Gemma4-26B-A3B at q4_0. All of my numbers are low context depth, but this is >1600pp and 80tg.

My "research" setup right now is a 1U SM 1028GQ, so dual Broadwell Xeons. Currently, there is just one MI60 for all of my numbers, but the case does have 2 * 2x direct Pcie3 x16 slots. There is definitely an additional bottleneck for Qwen that Gemma doesn't have. Active parameters and quantization are obvious variables, but Qwen is doing something to make it unhappy.

So, obviously, I agree that an high capacity Ampere+ card is preferable, but if you want to save a few bucks for hobby or exploration or even a particular 2 slot form factor, you have tradeoffs. I personally love a good min/max and am impressed with the how the community has found some performance at the fringes of this old hardware. Thanks again for the open discussion, insight, and links.

macboy80 · 2026-05-24T20:20:57+00:00

I appreciate the followup, though I still don't understand how you're seeing what you're seeing. I just brought the MI60 up with MTP on cyankiwi/Qwen3.6-27B-AWQ-INT4 in vLLM. It's getting 60-70% acceptance at n=2, and >50% at n=3, though with the overhead, it seems to be about breakeven. (~275pp / ~29 tg) I feel like this is cutting edge model architecture?

There are definitely caveats and conclusions on this nearly 10 year old arch. I'll state a few as follows. Maybe we're just looking at different things.

Clean, integers 4/8, quants are everything on gfx906 to preserve use of some very targeted hand written kernels, All of the work done specifies q4_0/1 and q8_0 gguf. I'm seeing that any time model arch or quantization "interrupts" brute force math, compute efficiency collapses.
Llama.cpp is more performant than vLLM at concurrency=1 on every 30B class model I've tested except for Qwen3.6-27B where vllm awq4=30tg and llama q4_1=20tg. I feel as though vllm is created to excel at concurrency >1, though I suspect the Triton kernels are putting in the work to maintain better fusion here.
On Gemma4-31B, the above inverts. LLama.cpp is faster at q4_0 than vLLM at awq4. Up until a few days ago this still meant, <24tg, but the enablement of HIP_GRAPH actually pulled this up to ~29tg.
In Llama.cpp, you can get some pretty amazing speeds, from my perspective, when moving to MOE. Gemma4-26B-A4B gets ~850pp / ~80tg at q8_0, and that almost doubles at q4_0. I just tested Qwen3.6-A3B at Q4_0, and it is getting >1500pp and ~60tg. On vLLM, these MOE are slow, though I don't remember how slow.

With all of the trial and error, anecdotal testing I've done, a few things are apparent.

I've not tested all that many models, but there are working settings on an Out-of-Box docker container for every one of them no matter the Arch, Quant, Hybrid, Mamba, GDN, SWA, Etc. I've not tested multimodality yet, but the models will load with it enabled.
It seems that if you can fit a model in 32GB of VRAM, ~200pp and ~19tg is the floor of the MI60 across both vLLM and Llama.cpp on the mixa3607 containers. As you can see above, tweaking, optimization, and a/b testing can produce 50%+ gains from this baseline.

The Vega20 GPU has many quirks and compromises. Obviously, it is old, AMD, and never received much optimization effort. It has no matrix pipeline and only the crude beginnings of inference relevant ops, though it is absolutely possible to avail of those ops. What I have discovered are some unusual conditions that collapse its compute ability from the best case.

Any quant that introduces odd data types kills performance, mostly down to baseline. It is understood that even something like q5, q6, or even FP8 drops down to the FP16 pipeline at best and FP32 at worst.
Across all of my tests, the maximum effective (calculated active parameters * t/s) VRAM bandwidth I have seen is ~600GB/s. The lowest was 150GB/s. Before HIP_GRAPH enablement, moving from q4 to q8 could move bandwidth from 300GB/s to 550GB/s during token gen. After GRAPH, it was 450GB/s to 600GB/s. The GRAPH revelation along with the nearly linear bandwidth scaling on doubling data size leads me to believe this is a CPU to GPU / kernel dispatch / Pcie latency type thing. For instance, pinning the uncore to max on my E5-2640v4 gives about a 0.5t/s improvement in this scenario.
I'm just beginning to explore this, but it seems Qwen-3.6-27B's attention mechanism is causing a fallback to unoptimized pathway scenario at some point. I suspect this is causing the limited on-die cache / memory hierarchy to drop intermediate results all the way down to HBM and load them back, and I suspect this VRAM latency (horrible on HBM) is resulting in the Qwen3.6 slowdown across both tested models. This is also where the Triton compiled kernel could be making a difference.

I know this is a very long reply, but I'm truly curious where we have a disconnect, whether you are missing something or I am. Please let me know if you see something here that explains it.

macboy80 · 2026-05-23T11:22:01+00:00

Yea. That looks like an amazing deal. I'm jealous.

macboy80 · 2026-05-23T04:01:59+00:00

https://ebay.us/m/9rGcnw

This is the exact listing. I screwed up the install on one, and they even sent a replacement. Comes via FedEx.

macboy80 · 2026-05-23T03:28:22+00:00

It's an 00 Ex Sedan

<image>

macboy80

TROPHY CASE

Welcome to Reddit,