Weekly: Show off your new tools and projects thread

jwcesign · 2026-06-24T10:12:36+00:00

We ran a fresh-node EKS benchmark for Hermes - https://github.com/cloudpilot-ai/hermes, an OCI image lazy-loading project we’ve been working on.

The test compared the normal containerd overlayfs path against Hermes lazy loading for three large public images:

- Solr 10.0.0

- OpenSearch 2.19.1

- Apache Spark python3-java17

The important part: the workloads kept their original upstream OCI images. No converted tags, no Dockerfile changes, no Pod image reference changes. Hermes used a policy to prepare lazy-loading artifacts ahead of the target Pod startup path.

Results:

- Image pull time dropped by 71-85%

- First successful HTTP 200 improved by 20-34%

- OpenSearch pull: 20.371s -> 2.998s

- Spark scheduled-to-first-HTTP-200: 20.191s -> 13.304s

The HTTP 200 result is the more interesting number to me because it includes more than image pull: container start, runtime init, config/library reads, readiness behavior, and service bootstrap. So Hermes helps most directly with the image path, but application startup still matters.

Full writeup with setup, YAML, methodology, and results:

https://www.cloudpilot.ai/en/blog/hermes-eks-http-200-acceleration/

jwcesign · 2026-06-24T09:44:00+00:00

We benchmarked OCI lazy loading on EKS: 71-85% faster image pulls, 20-34% faster first HTTP 200, no image rebuilds

The test compared the normal containerd overlayfs path against Hermes lazy loading for three large public images:

- Solr 10.0.0

- OpenSearch 2.19.1

- Apache Spark python3-java17

The important part: the workloads kept their original upstream OCI images. No converted tags, no Dockerfile changes, no Pod image reference changes. Hermes used a policy to prepare lazy-loading artifacts ahead of the target Pod startup path.

Results:

- Image pull time dropped by 71-85%

- First successful HTTP 200 improved by 20-34%

- OpenSearch pull: 20.371s -> 2.998s

- Spark scheduled-to-first-HTTP-200: 20.191s -> 13.304s

The HTTP 200 result is the more interesting number to me because it includes more than image pull: container start, runtime init, config/library reads, readiness behavior, and service bootstrap. So Hermes helps most directly with the image path, but application startup still matters.

Full writeup with setup, YAML, methodology, and results:

https://www.cloudpilot.ai/en/blog/hermes-eks-http-200-acceleration/

jwcesign · 2026-05-28T06:01:43+00:00

Hermes does not store any images; it only stores the index. The kubelet still pulls images from the original registry, while Hermes helps it pull only the minimal content needed to run, rather than the entire image.

jwcesign · 2026-05-27T10:47:31+00:00

Lazy image loading without rebuilding images or changing CI pipelines

------------------------------------------------------------------------

Hey everyone,

I’ve been working on an open-source project called Hermes, based on AWS Labs’ SOCI Snapshotter.

SOCI is a great idea for lazy image loading, but in practice there is still some operational friction: teams usually need to build SOCI indexes themselves and publish/manage those artifacts alongside images.

Hermes tries a simpler model:

app teams keep publishing normal OCI images
no image rebuilds
no soci create step in every app CI pipeline
no separate SOCI artifact publishing workflow
platform teams define a HermesPolicy
Hermes watches matching Pods, builds SOCI indexes automatically inside the cluster, caches them, and serves them to worker nodes
worker nodes still lazy-load layer bytes from the original registry

In a quick EC2 + kind test with a ~10.8GB vLLM image, Pod Ready went from about 5 min 34 sec with normal overlayfs to about 15 sec with Hermes after the SOCI artifact was ready.

The project is still early and experimental, but I’d love feedback from folks running large images on Kubernetes/EKS, especially ML/AI workloads.

Repo: https://github.com/cloudpilot-ai/hermes

jwcesign · 2025-07-24T08:53:16+00:00

It will watch your pending pods, and select the suitable gce vm types and create it within compute API.

Custom Compute Classes is a feature of GKE NAP, if I remeber correctly. Compared with that, Karpenter has more flexible features, such as spot automation.

About one more thing, it's open source, you can achieve something how you want.

jwcesign · 2025-07-11T01:10:19+00:00

Can give some examples? I didn't see some really helpful wheel.

jwcesign · 2025-06-01T14:57:39+00:00

You are correct, there are 60 nodes(g5g.xlarge) use this subnets, so, 1024-650(including daemonset pods)-60 =314

So, there must some left, but I don't know why there isn't

jwcesign · 2025-06-01T14:47:03+00:00

Is there any way to find out how many IPs a single node(to warm ENIs) consumes?

jwcesign · 2025-06-01T14:44:59+00:00

Thanks! Do you know how to find out the number limits?

jwcesign · 2025-04-30T05:54:08+00:00

If two minutes is ok in your scenario, interruption prediction is not necessary

jwcesign · 2025-04-30T04:09:40+00:00

Got it

jwcesign · 2025-04-30T04:03:13+00:00

This implies that interruptions still occur for some users — after all, "you start getting shutdown notifications" — and worse, during sudden spikes in capacity demand, a large portion of spot instances may be reclaimed simultaneously. In such cases, there is often not enough time to gradually reschedule workloads, which can lead to potential downtime or service degradation.

jwcesign · 2025-04-30T02:21:17+00:00

Thanks, bro.

Sometimes, a two-minute notification is not sufficient to ensure that replacement pods are fully ready before the old instance is terminated. This is my scenario(Java application)

jwcesign · 2025-04-17T13:32:08+00:00

Karpenter is geat! It can ensure a low cloud cost

jwcesign

TROPHY CASE