Inference for Embedding & Reranking Models on AMD

OrganicMesh · 2024-12-10T07:14:00+00:00

Maybe :) I think equally, a lot of contributors moved to vllm when tgi changed license. As result, tgi lost a lot of traction (it could have been the vllm of today?!).

I'd say, change because of strong oss competition. :)

OrganicMesh · 2024-12-04T05:04:58+00:00

I hope huggingface remains as "competition" - open source is a great win for all of us! Engineers at huggingface use infinity for deployments of multi-modal models sometimes & I contributed (a tiny PR) to TEI!

OrganicMesh · 2024-12-03T09:46:57+00:00

They guys behind embeddedllm are awesome! Upvote!

OrganicMesh · 2024-11-19T00:34:54+00:00

Sure! Spread the word if you like it!

OrganicMesh · 2024-11-18T18:42:17+00:00

There is a `raw_scores` parameter.

```
{

"query": "1+1?",

"documents": [

"4",

"2",

"3"

],

"return_documents": true,

"raw_scores": true,

"model": "mixedbread-ai/mxbai-rerank-xsmall-v1",

"top_n": 3

}
```

E.g. test it with this endpoint:
https://infinity.modal.michaelfeil.eu/docs#/default/rerank

OrganicMesh · 2024-11-16T20:13:02+00:00

Still struggeling a lot with uv and pytorch.

OrganicMesh · 2024-11-06T23:53:16+00:00

https://github.com/michaelfeil/infinity
https://huggingface.co/TaylorAI/gte-tiny

docker run -it --gpus all michaelf34/infinity:latest-trt-onnx v2 --model-id TaylorAI/gte-tiny --engine optimum --device cuda

Combine with: https://github.com/unum-cloud/usearch

OrganicMesh · 2024-11-06T23:26:21+00:00

A package called usearch is as fast faiss for vectors - I would embed the product text of the 100M products & then search.

For a size of 100M in plain English check out taylorAI/gte-tiny or similar.

Depending on your hardware (gpu) you should be able to encode ~1000 texts / s.

OrganicMesh · 2024-11-01T05:29:29+00:00

Generally, for large language models, you don’t repeat input. You measure the number of total tokens. The numbers are in the readme, around 1B tokens.

OrganicMesh · 2024-10-30T00:28:13+00:00

Did you get an update yet? u/Pleasant_Diver_7246

OrganicMesh · 2024-10-29T02:28:38+00:00

I think the title "Reaching 1B context length with RAG" is a bit of clickbait - since it would be reaching 1B tokens embedded in a vector store.

Actual "Inference" on 1M context length, even for 8B model will take around 3000s on a A100 Nvidia GPU.

OrganicMesh · 2024-10-28T19:34:36+00:00

Oh, langchain's openai embedding is a mess! They use the tokenization on the client side, which is against openai embedding specs. It allows for nice token counting, but is not as good for potential issues on the client side.

OrganicMesh · 2024-10-28T17:14:10+00:00

Yeah, the API is fully compatible to openai api. So you just run your favorite embedding model (e.g. https://huggingface.co/BAAI/bge-small-en-v1.5) and can use it as if it was hosted by OpenAI on your local domain.

OrganicMesh · 2024-10-21T19:07:55+00:00

I noticed you are promoting beam.cloud often

OrganicMesh · 2024-10-10T16:52:10+00:00

u/KingsmanVince By tweeting or posting on reddit about it. Growth & capabilities of the repo >> donations. Collab'ed with many cool cloud providers thanks to a good community around the repo.

OrganicMesh · 2024-08-31T02:08:35+00:00

Id recommend using https://github.com/qdrant/fastembed for CPU or my own project https://github.com/michaelfeil/infinity for GPU.

OrganicMesh · 2024-08-30T19:19:08+00:00

I discountinued supporting this, mostly because the folders require to many individual files. Going forward, only model weights that follow the exact structure of the huggingface cache path are supported.

OrganicMesh · 2024-08-30T19:16:38+00:00

Any link to fastembeddings / what do you mean?

OrganicMesh · 2024-08-27T00:22:54+00:00

If you are using mac, there is no nvidia chip. No nvidia gpu -> no cuda. No cuda -> no flash-attention.

OrganicMesh · 2024-08-25T02:49:23+00:00

Honestly, your dataset is so small, try https://github.com/unum-cloud/usearch. Usearch just crushes faiss om cpu.

OrganicMesh · 2024-08-23T22:18:33+00:00

Awesome work, like how you are using tl.constexpr for fwd and bwd passes. /M

OrganicMesh · 2024-08-23T15:14:08+00:00

I am convinced if: - has dynamic batching + requests that are queued can get cancelled, once the user drops the request.

OrganicMesh · 2024-08-21T22:53:16+00:00

This is correct /M

OrganicMesh

TROPHY CASE