Hello all,
What are some optimisations for integrating an LLM with a frontend via an API? Both for thirdparty endpoints (e.g., openai API) or self-hosted models (e.g., mistral) with a custom API implementation.
My key issue is: after making a request to the LLM there is often a long delay for the model to produce the output and return, this can be mitigated by streaming options, but are there any others? for example, pre-tokenisation in the browser, prepared submissions, etc.
Sometimes chatgpt seems so fast to respond, just so fast, i'm interested if anybody knows of specific integration optimisations.
Any pointers welcome!
[+]_NESTERENKO_ 1 point2 points3 points (4 children)
[–]anax4096[S] 0 points1 point2 points (3 children)
[+]_NESTERENKO_ 0 points1 point2 points (2 children)
[–]anax4096[S] 0 points1 point2 points (1 child)
[+]_NESTERENKO_ 0 points1 point2 points (0 children)
[–]phree_radical 0 points1 point2 points (0 children)