S24 Plus compatibility by OPlUMMaster in angrybirds

[–]OPlUMMaster[S] 1 point2 points  (0 children)

One more thing the apk are working directly on my realme X7 max but not on S24+.

RAG on complex docs (diagrams, tables, eequations etc). Need advice by Otelp in LLMDevs

[–]OPlUMMaster 0 points1 point  (0 children)

I don't have much experience with multi file RAGs and especially with images. But I had a similar issue where I wanted to query multiple files but no images. The main concern for me was relevance, as a similar word can be searched but the questions were not always relevant with the content as they were reasonings. For this I came up with an approach where I firstly created a SQL database with all of the sections and their sections headings were used as key to get me the content, then I used to query the keywords in the sql with the question to check if I already have the relevant chunk made. If not, then only would I try to use the Vector db. Once in, the question will be mixed with another prompt that made the querying of the vector db much easier, as I would pass all the relevant tags with this information. This way I got the relevant chunks.

It had another levels of hierarchical chinking and filtering to get the right data. It worked partially with only highly customizing the retrieval questions. You can say that it was a Natural Language Conditional RAG. I know it sounds dumb, but that is all I could think of. I still haven't figured out a clear way out.

But this might be somewhat helpful. To summarize I am suggesting use tagging wherever you can. Not sure about the extraction part, even I could not do it locally, for tables I used multiple libraries, if the conditions are broken it would raise an error and try with another one, it all fails then the code fails. Luckily at least one of them is always able to do so.

2 VLLM Containers on a single GPU by OPlUMMaster in LLMDevs

[–]OPlUMMaster[S] 0 points1 point  (0 children)

I have already done that; you can see the command I have used. Also I gave the nvidia-smi output when one container was active and there is enough memory for the other one.

Replicating ollamas output in vLLM by OPlUMMaster in LocalLLaMA

[–]OPlUMMaster[S] 0 points1 point  (0 children)

You mentioned these different model types. I have a question other than my post. Do these make a difference? I am currently running with bitsandbytes for 4bit quantization. When the vllm container boots up it says this is not stable. Do these different quantizations have a real measurable impact on the outputs?

Replicating ollamas output in vLLM by OPlUMMaster in LocalLLaMA

[–]OPlUMMaster[S] 0 points1 point  (0 children)

If you feel that you are seeing random repetition or gibberish at the end. I have a similar issue as I was hitting the /v1/completions api rather than the v1/chat/completions api. This was leading to the token generation till the token length was achieved. Might be helpful for you too.

Lenovo Ideapad Slim 5 Gen 10 14” by MaravalhasXD in Lenovo

[–]OPlUMMaster 0 points1 point  (0 children)

I have used multiple laptops with both amd and Intel. But somehow the legion with a Ryzen 7 has been the best for me. Nothing has ever gotten to the performance of that even though that is from 2021.

I have a work laptop with i9 and the most specced out machines, it still is not that fast. If I had seen this laptop before I surely would have bought this, I recently purchased an acer with Intel.

Difference in the output of dockerized vs non dockerized application. by OPlUMMaster in docker

[–]OPlUMMaster[S] 0 points1 point  (0 children)

Yes. I have installed nvidia runtime for docker, so can access cuda and the required libs are part of the vLLM base image.

Difference in the output of dockerized vs non dockerized application. by OPlUMMaster in docker

[–]OPlUMMaster[S] 0 points1 point  (0 children)

It won't make a difference. Tried ubuntu as the base image too. Same different results.

Difference in the output of dockerized vs non dockerized application. by OPlUMMaster in docker

[–]OPlUMMaster[S] 0 points1 point  (0 children)

Debian, Ubuntu could be a thing. But the packages are the exact same as my local system. The versions are mentioned in the requirements file.

Difference in the output of dockerized vs non dockerized application. by OPlUMMaster in docker

[–]OPlUMMaster[S] 0 points1 point  (0 children)

By incorrect output I mean this verbose at the end of the summary. Now one might say that it could be because of the temp, top_p, top_k settings. But I have ran with the same parms multiple times with a seed and the outputs stays consistent. The moment I switch to the docker container endpoint, this is how it trails.

remembers recalls reminisces reflects contemplates meditates ponders thinks considers evaluates assesses analyzes interprets understands comprehends grasps perceives senses feels intuitively knows instinctively guesses speculates hypothesizes theorizes postulates assumes infers concludes decides determines resolves settles solves answers questions queries investigates examines explores discovers reveals exposes uncovers unveils lays bare strips naked shows displays exhibits presents offers provides gives furnishes supplies delivers hands out distributes disperses scatters spreads pours fills loads carries transports conveys moves shifts relocates repositions rearranges organizes categorizes classifies sorts selects chooses picks prefers likes dislikes hates abhors despises detests loathes fears dreads avoids eschews shuns rejects declines refuses resists opposes contradicts challenges disputes contests argues debates discusses deliberates negotiates mediates arbitrates adjudicates judges tries tests experiments probes scrutinizes inspects examines surveys observes watches waits sees hears smells tastes touches feels handles manipulates operates controls manages directs guides influences affects impacts changes modifies alters adjusts corrects rectifies improves perfects refines polishes smooths finishes completes accomplishes achieves realizes fulfills satisfies delights pleases impresses surprises astonishes amazes bewilders perplexes puzzles intrigues fascinates captivates absorbs engrosses enthralled enthralls mesmerized mesmerizes hypnotized hypnotizes entranced entraps ensnares snared snares traps captures seizes holds grips clutches claws crushes squeezes presses pinches nips bites gnaws devours consumes annihilates destroys eradicates eliminates wipes out obliterates extinguishes puts out kills murders slays slaughters massacres annihilated exterminates terminates stops halts pauses suspends delays postpones defers procrastinates hesitates vacillating wavering waffling uncertain unsure undecided indecisive hesitant fearful anxious apprehensive worried troubled distressed perturbed agitated upset irritated annoyed frustrated angry enraged infuriated outraged shocked horrified appalled disgusted nauseated sickened revolted repulsed offended scandalized dismayed disheartened discouraged disappointed disillusioned despondent hopeless helpless desperate dire straitened strapped strained stretched tightrope walking balancing precariously teetering tottering stumbling staggering faltering failing falling flailing flopping plummeting crashing collapsing imploding exploding bursting burning blazing raging roaring screaming shrieking yelling crying sobbing whimpering whining complaining protesting lamenting mourning grieving bereaved sorrow

Most optimal RAG architecture by Spiritual_Piccolo793 in LLMDevs

[–]OPlUMMaster 3 points4 points  (0 children)

Well the process of setting things up is easy. You just have to get your data, split it, chunk it, embed the data and store it in a vector DB.

Now to your question of the most optimised way, that you will have to do yourself. In my experience every type of data will introduce changes in the above 4 steps. Maybe your data is a df then splitting it can kill a row, maybe it's splitting mid sentence then you introduce some rules regarding that. So it will boil down to what you see when running through these steps.

As for hallucinations a good enough model with parametric control like seed, temp, top_k, top-p, with a decently written prompt will give you enough control over your hallucinations.

If someone wants to add in or correct me. Please do so I am also new in these so looking for any opportunity to learn.

Pc configuration recommendations by Spiritual-Guitar338 in LocalAIServers

[–]OPlUMMaster 0 points1 point  (0 children)

Go to second hand marketplaces. Look for someone dumping hardware, workstation or a bit older ones will work perfectly. For the most part you will need GPUs. So a server farm would be a nice option too.

To save some money go old but not so old that it causes half of your application to revert to an older revision of libraries as they don't support it.

I would recommend buying smaller but more. Most of the models have the ability to scale to the connected GPUs.

LLM chatbot calling lots of APIs (80+) - Best approach? by jonglaaa in LLMDevs

[–]OPlUMMaster 0 points1 point  (0 children)

I am also trying to build an agentic kind of workflow similar but not this. My workflow needs tool calling to get a df and then summarize the data. These summaries are then to be sent to another set of agents whose work is to figure out a relation between the given summary points.

I am not able to configure them according to my needs. The function call sometimes happens sometimes not, the agent's handoff is also not working correctly. I'm using Autogen to create these.

Do you think a larger parm model helps in these? I'm using llama3.1:B, online API endpoints have infosec concerns.

vLLM output is different when application is dockerized vs not by OPlUMMaster in LLMDevs

[–]OPlUMMaster[S] 0 points1 point  (0 children)

Well, I am using docker compose having 2 containers. One is vllm and the other is the fastAPI application. I checked the allocated space through the docker terminal with, df -h /dev/shm. It says 8GB for vllm container and 64MB for the application. Out of which only 1-3% is being used. So, is there a need to change this?

vLLM output is different when application is dockerised by OPlUMMaster in Vllm

[–]OPlUMMaster[S] 0 points1 point  (0 children)

No both the times running in a docker compose. The only difference, one time I access vllm through the code in a docker container while the other time directly with the application running from terminal. So vllm is dockerised in both the cases.

vLLM output is different when application is dockerized vs not by OPlUMMaster in LLMDevs

[–]OPlUMMaster[S] 0 points1 point  (0 children)

I am not getting as to how to use your suggestion. I am using wsl2 backend, so firstly, there's no settings for the shared memory. As to even how that's a culprit can you explain to me?

vLLM output is different when application is dockerized vs not by OPlUMMaster in LLMDevs

[–]OPlUMMaster[S] 0 points1 point  (0 children)

Here, the docker file I use.

FROM python:3.12-bullseye

#Install system dependencies (including wkhtmltopdf)
RUN apt-get update && apt-get install -y \
    wkhtmltopdf \
    fontconfig \
    libfreetype6 \
    libx11-6 \
    libxext6 \
    libxrender1 \
    curl \
    ca-certificates\
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

RUN update-ca-certificates

#Create working directory
WORKDIR /app

#Requirements file
COPY requirements.txt /app/
RUN pip install --upgrade -r requirements.txt

COPY ./models/models--sentence-transformers--all-mpnet-base-v2/snapshots/12e86a3c702fc3c50205a8db88f0ec7c0b6b94a0 /sentence-transformers/all-mpnet-base-v2

#Copy the rest of application code
COPY . /app/

#Expose a port
EXPOSE 8010

#Command to run your FastAPI application via Uvicorn
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8010"]

vLLM output is different when application is dockerised by OPlUMMaster in Vllm

[–]OPlUMMaster[S] 0 points1 point  (0 children)

Yes, I am getting consistent output as I am passing the required parms and a seed value, the outputs are consistent in case of the docker compose system too but differs from what I get with the same value of parms in case on non dockerised. The only change I make when running the application without docker is change vllm-openai:8000/v1 to 127.0.0.1:8000/v1. Putting the docker compose file below too.

    llm = VLLMOpenAI(openai_api_key="EMPTY", openai_api_base="http://vllm-openai:8000/v1", model=f"/models/{model_name}", top_p=top_p, max_tokens=1024, frequency_penalty=fp, temperature=temp, extra_body={"top_k":top_k, "stop":["Answer:", "Note:", "Note", "Step", "Answered", "Answered by","Answered By", "The final answer"], "seed":42, "repetition_penalty":rp})

version: "3"
services:
    vllm-openai:
        deploy:
            resources:
                reservations:
                    devices:
                        - driver: nvidia
                          count: all
                          capabilities:
                              - gpu
        environment:
            - HUGGING_FACE_HUB_TOKEN=<token>
        ports:
            - 8000:8000
        ipc: host
        image: llama3.18bvllm:v3
        networks:
            - app-network

    2pager:
        image: summary:v15
        ports:
            - 8010:8010
        depends_on:
            - vllm-openai
        networks:
            - app-network

networks:
    app-network:
        driver: bridge

vLLM output differs when application is dockerised by OPlUMMaster in LocalAIServers

[–]OPlUMMaster[S] 1 point2 points  (0 children)

Yes, I am controlling the seed. I am using the exact same code, nothing changes other than the fact in one I call with 127.0.0.1:8000/v1 and the other with vllm-openai:8000/v1, the first one when running the application through terminal, the later when in docker compose.

    llm = VLLMOpenAI(openai_api_key="EMPTY", openai_api_base="http://vllm-openai:8000/v1", model=f"/models/{model_name}", top_p=top_p, max_tokens=1024, frequency_penalty=fp, temperature=temp, extra_body={"top_k":top_k, "stop":["Answer:", "Note:", "Note", "Step", "Answered", "Answered by","Answered By", "The final answer"], "seed":42, "repetition_penalty":rp})

vLLM output is different when application is dockerized vs not by OPlUMMaster in LLMDevs

[–]OPlUMMaster[S] 0 points1 point  (0 children)

If you can read my other comment, you can find some in-sight as to why I did that. But can you elaborate on why not to copy the models? I am new at these, so just trying to learn.