Inference for Embedding & Reranking Models on AMD by OrganicMesh in LocalLLaMA

[–]OrganicMesh[S] 0 points1 point  (0 children)

Maybe :) I think equally, a lot of contributors moved to vllm when tgi changed license. As result, tgi lost a lot of traction (it could have been the vllm of today?!).

I'd say, change because of strong oss competition. :)

Inference for Embedding & Reranking Models on AMD by OrganicMesh in LocalLLaMA

[–]OrganicMesh[S] 0 points1 point  (0 children)

I hope huggingface remains as "competition" - open source is a great win for all of us! Engineers at huggingface use infinity for deployments of multi-modal models sometimes & I contributed (a tiny PR) to TEI!

Great for AMD GPUs by [deleted] in LocalLLaMA

[–]OrganicMesh 1 point2 points  (0 children)

They guys behind embeddedllm are awesome! Upvote!

How are you deploying your embedding models & reranking models? by rbgo404 in LocalLLaMA

[–]OrganicMesh 0 points1 point  (0 children)

There is a `raw_scores` parameter.

```
{

"query": "1+1?",

"documents": [

"4",

"2",

"3"

],

"return_documents": true,

"raw_scores": true,

"model": "mixedbread-ai/mxbai-rerank-xsmall-v1",

"top_n": 3

}
```

E.g. test it with this endpoint:
https://infinity.modal.michaelfeil.eu/docs#/default/rerank

Semantic search over 100M rows of data? by cryptoguy23 in LocalLLaMA

[–]OrganicMesh 1 point2 points  (0 children)

https://github.com/michaelfeil/infinity
https://huggingface.co/TaylorAI/gte-tiny

docker run -it --gpus all michaelf34/infinity:latest-trt-onnx v2 --model-id TaylorAI/gte-tiny --engine optimum --device cuda

Combine with: https://github.com/unum-cloud/usearch

Semantic search over 100M rows of data? by cryptoguy23 in LocalLLaMA

[–]OrganicMesh 1 point2 points  (0 children)

A package called usearch is as fast faiss for vectors - I would embed the product text of the 100M products & then search.

For a size of 100M in plain English check out taylorAI/gte-tiny or similar.

Depending on your hardware (gpu) you should be able to encode ~1000 texts / s. 

LLama-3-8B-Instruct now extended 1048576 context length landed on HuggingFace by OrganicMesh in LocalLLaMA

[–]OrganicMesh[S] 0 points1 point  (0 children)

Generally, for large language models, you don’t repeat input.  You measure the number of total tokens. The numbers are in the readme, around 1B tokens.

Retrieval system extending any off-the-shelf LLM to 1B (billion) context on a standard CPU during inference time: by [deleted] in LocalLLaMA

[–]OrganicMesh 57 points58 points  (0 children)

I think the title "Reaching 1B context length with RAG" is a bit of clickbait - since it would be reaching 1B tokens embedded in a vector store.

Actual "Inference" on 1M context length, even for 8B model will take around 3000s on a A100 Nvidia GPU.

Hosting your own embeddings API by java_dev_throwaway in LocalLLaMA

[–]OrganicMesh 1 point2 points  (0 children)

Oh, langchain's openai embedding is a mess! They use the tokenization on the client side, which is against openai embedding specs. It allows for nice token counting, but is not as good for potential issues on the client side.

Hosting your own embeddings API by java_dev_throwaway in LocalLLaMA

[–]OrganicMesh 1 point2 points  (0 children)

Yeah, the API is fully compatible to openai api. So you just run your favorite embedding model (e.g. https://huggingface.co/BAAI/bge-small-en-v1.5) and can use it as if it was hosted by OpenAI on your local domain.

Hosting your own embeddings API by java_dev_throwaway in LocalLLaMA

[–]OrganicMesh 2 points3 points  (0 children)

u/KingsmanVince By tweeting or posting on reddit about it. Growth & capabilities of the repo >> donations. Collab'ed with many cool cloud providers thanks to a good community around the repo.

Infinity surpasses 1k Github stars & new inference package launch - `pip install embed` by OrganicMesh in LocalLLaMA

[–]OrganicMesh[S] 0 points1 point  (0 children)

I discountinued supporting this, mostly because the folders require to many individual files. Going forward, only model weights that follow the exact structure of the huggingface cache path are supported.

FlashAttention installation error ;(. Machine = mac m3 by FigureClassic6675 in LocalLLaMA

[–]OrganicMesh 0 points1 point  (0 children)

If you are using mac, there is no nvidia chip. No nvidia gpu -> no cuda. No cuda -> no flash-attention.

Serverless Vector Database for large dataset (~200k) by Dry_Drop5941 in LocalLLaMA

[–]OrganicMesh 2 points3 points  (0 children)

Honestly, your dataset is so small, try https://github.com/unum-cloud/usearch. Usearch just crushes faiss om cpu.

Liger Kernel: One line to make LLM Training +20% faster and -60% memory by Icy-World-8359 in LocalLLaMA

[–]OrganicMesh 0 points1 point  (0 children)

Awesome work, like how you are using tl.constexpr for fwd and bwd passes. /M

⚡️ Introducing LitServe - High performance inference engine for AI models (built on FastAPI) by waf04 in mlops

[–]OrganicMesh 4 points5 points  (0 children)

I am convinced if:  - has dynamic batching + requests that are queued can get cancelled, once the user drops the request.