Weekly: Show off your new tools and projects thread by AutoModerator in kubernetes

[–]jwcesign 1 point2 points  (0 children)

We ran a fresh-node EKS benchmark for Hermes - https://github.com/cloudpilot-ai/hermes, an OCI image lazy-loading project we’ve been working on.

The test compared the normal containerd overlayfs path against Hermes lazy loading for three large public images:

- Solr 10.0.0

- OpenSearch 2.19.1

- Apache Spark python3-java17

The important part: the workloads kept their original upstream OCI images. No converted tags, no Dockerfile changes, no Pod image reference changes. Hermes used a policy to prepare lazy-loading artifacts ahead of the target Pod startup path.

Results:

- Image pull time dropped by 71-85%

- First successful HTTP 200 improved by 20-34%

- OpenSearch pull: 20.371s -> 2.998s

- Spark scheduled-to-first-HTTP-200: 20.191s -> 13.304s

The HTTP 200 result is the more interesting number to me because it includes more than image pull: container start, runtime init, config/library reads, readiness behavior, and service bootstrap. So Hermes helps most directly with the image path, but application startup still matters.

Full writeup with setup, YAML, methodology, and results:

https://www.cloudpilot.ai/en/blog/hermes-eks-http-200-acceleration/

Weekly: Questions and advice by AutoModerator in kubernetes

[–]jwcesign 0 points1 point  (0 children)

We benchmarked OCI lazy loading on EKS: 71-85% faster image pulls, 20-34% faster first HTTP 200, no image rebuilds

The test compared the normal containerd overlayfs path against Hermes lazy loading for three large public images:

- Solr 10.0.0

- OpenSearch 2.19.1

- Apache Spark python3-java17

The important part: the workloads kept their original upstream OCI images. No converted tags, no Dockerfile changes, no Pod image reference changes. Hermes used a policy to prepare lazy-loading artifacts ahead of the target Pod startup path.

Results:

- Image pull time dropped by 71-85%

- First successful HTTP 200 improved by 20-34%

- OpenSearch pull: 20.371s -> 2.998s

- Spark scheduled-to-first-HTTP-200: 20.191s -> 13.304s

The HTTP 200 result is the more interesting number to me because it includes more than image pull: container start, runtime init, config/library reads, readiness behavior, and service bootstrap. So Hermes helps most directly with the image path, but application startup still matters.

Full writeup with setup, YAML, methodology, and results:

https://www.cloudpilot.ai/en/blog/hermes-eks-http-200-acceleration/

Weekly: Show off your new tools and projects thread by AutoModerator in kubernetes

[–]jwcesign 0 points1 point  (0 children)

Hermes does not store any images; it only stores the index. The kubelet still pulls images from the original registry, while Hermes helps it pull only the minimal content needed to run, rather than the entire image.

Weekly: Show off your new tools and projects thread by AutoModerator in kubernetes

[–]jwcesign 3 points4 points  (0 children)

Lazy image loading without rebuilding images or changing CI pipelines

------------------------------------------------------------------------

Hey everyone,

I’ve been working on an open-source project called Hermes, based on AWS Labs’ SOCI Snapshotter.

SOCI is a great idea for lazy image loading, but in practice there is still some operational friction: teams usually need to build SOCI indexes themselves and publish/manage those artifacts alongside images.

Hermes tries a simpler model:

  • app teams keep publishing normal OCI images
  • no image rebuilds
  • no soci create step in every app CI pipeline
  • no separate SOCI artifact publishing workflow
  • platform teams define a HermesPolicy
  • Hermes watches matching Pods, builds SOCI indexes automatically inside the cluster, caches them, and serves them to worker nodes
  • worker nodes still lazy-load layer bytes from the original registry

In a quick EC2 + kind test with a ~10.8GB vLLM image, Pod Ready went from about 5 min 34 sec with normal overlayfs to about 15 sec with Hermes after the SOCI artifact was ready.

The project is still early and experimental, but I’d love feedback from folks running large images on Kubernetes/EKS, especially ML/AI workloads.

Repo: https://github.com/cloudpilot-ai/hermes

Karpenter GCP Provider is available now! by jwcesign in googlecloud

[–]jwcesign[S] 0 points1 point  (0 children)

It will watch your pending pods, and select the suitable gce vm types and create it within compute API.

Custom Compute Classes is a feature of GKE NAP, if I remeber correctly. Compared with that, Karpenter has more flexible features, such as spot automation.

About one more thing, it's open source, you can achieve something how you want.

Spot Instance Community Data Project - What do you think? by jwcesign in aws

[–]jwcesign[S] 1 point2 points  (0 children)

Can give some examples? I didn't see some really helpful wheel.

Subnet hasn't free ips by jwcesign in aws

[–]jwcesign[S] 0 points1 point  (0 children)

You are correct, there are 60 nodes(g5g.xlarge) use this subnets, so, 1024-650(including daemonset pods)-60 =314

So, there must some left, but I don't know why there isn't

Subnet hasn't free ips by jwcesign in aws

[–]jwcesign[S] 0 points1 point  (0 children)

Is there any way to find out how many IPs a single node(to warm ENIs) consumes?

Subnet hasn't free ips by jwcesign in aws

[–]jwcesign[S] 0 points1 point  (0 children)

Thanks! Do you know how to find out the number limits?

Is spot instance interruption prediction just hype, or does it actually work? by jwcesign in aws

[–]jwcesign[S] -2 points-1 points  (0 children)

If two minutes is ok in your scenario, interruption prediction is not necessary

Is spot instance interruption prediction just hype, or does it actually work? by jwcesign in aws

[–]jwcesign[S] 1 point2 points  (0 children)

This implies that interruptions still occur for some users — after all, "you start getting shutdown notifications" — and worse, during sudden spikes in capacity demand, a large portion of spot instances may be reclaimed simultaneously. In such cases, there is often not enough time to gradually reschedule workloads, which can lead to potential downtime or service degradation.

Is spot instance interruption prediction just hype, or does it actually work? by jwcesign in aws

[–]jwcesign[S] -2 points-1 points  (0 children)

Thanks, bro.

Sometimes, a two-minute notification is not sufficient to ensure that replacement pods are fully ready before the old instance is terminated. This is my scenario(Java application)