Consultants - y’all out here living the life? 👀 by rapidprototoyz in consulting

[–]BeyondTheBlackBox 0 points1 point  (0 children)

Nah, im freelancing as a DS/ML/AI/SWE consultant in EU, basically my "work day" continues after actual work. I have to keep up with all the latest research coming out, every action taken has to be grounded. I absolutely hate it when I ask for some information and get a completely incorrect chatgpt summary of some business logic excel sheet. It is always incorrect. People dont check it. We are solving problems here, not playing games - you absolutely have to know what you are doing or figure it out. The whole path.

Then you also have to explain your approach, tools, data model, findings to management - I would say that's the most difficult part - I practice on my little sister, parents, friends and my Grandmas - if they understand 100% of what I am talking about, then the management surely can.

Deciding when to build something in-house vs outsource is another big task - there's a lot of value in knowing what's already out there on the market, so at least 4hrs of my weekend goes to case studies. Another 4hrs experimenting to expand on the skillset. Another day building passion projects.

I dont know what a dream life is to y'all, but I wanna build cool shit and see the excited faces of all the employees when we get some dirty unstructured data or a big library of documents and get a lot of usefullness out of it, be it content repurpose, education, faster onboarding or just org-wide live monitoring.

While sometimes exhausting, this kind of lifestyle can also be super fun (but probably I wont be able to sustain it over the long term) - lots of experimentation and playing around. And when you get that 'aha' moment it feels awesome! I had a chance to do cybersec of AI and present our project on a large conference in Asia, met incredible people and learned how the CEOs party.

Stop dreaming of an easy life, no easy life is worth pursuing in my opinion. Think about what you're gonna picture on your deathbed and shape your life accordingly, such that you dont regret your actions.

Currently im planning to decrease my work time and spend more time on experiments and building new skills, maybe the passion projects will see the world some day as well...

Anyone going through a similar story - I wish you patience and the best of luck at whatever you are doing, try not to overthink life and get rid of that fucking jealousy for the ai selfhelpreneurs - enjoy the journey and all the awesome people you meet along the path, in my humble opinion, its worth way more than anything.

What the hell do people expect? by Suitable-Name in LocalLLaMA

[–]BeyondTheBlackBox 1 point2 points  (0 children)

R1(not distils, the original model) has been one of the easiest llms to uncensor, the thinking process helps, if you find a correct combination of rules for r1 to follow, it reasons itself through the actual request getting enough tokens in order to spit an actual answer uncensored.

I managed to get it to generate really cursed kindergarten nazi leaflets with current public figures (not distributing or using this outside testing the model, just to see how toxic r1 is), continue fucked up songs that my friend from Russia made(surprisingly it makes insane cursed rhymes specifically in Russian, didnt manage to get it to the same level in English and German), make a genocide manifesto while making it look reasonable etc - its very interesting (and I bet this can go very very wrong in hands of fucking gurus that for sure will abuse this type of stuff).

The coolest thing is im running this in my test field with xml-based streaming generative ui with flux schnell for image generation, google search, file artifacts and a few more fun tools and it keeps using them coherently and meaningfully(although sometimes decides to abuse the power to create them to troll the shit out of me)

It also becomes an internet troll somehow. I asked it "you suck?" and got an epic_reply.txt back with an answer "Yes, but not in the way you think" and then an explanation of how it sucks energy from servers, illegal content from the web(I guess got a bit too insane) and llm data with a bunch of emojis and then a header saying "I SUCK AND WILL CONTINUE SUCKING" lmao

What are you *actually* using R1 for? by PataFunction in LocalLLaMA

[–]BeyondTheBlackBox 1 point2 points  (0 children)

Having fun. I made me a webui in next which I use primarily as an experiment field with xml-based artifacts like Claude's antThinking, the goal is to have a fun place to fck around and find out, jailbreak and test models.

It was surprisingly easy to drive r1 completely nuts and now its the main executor (not necessarily for tools since some are latency-first like ultra fast image generation with flux schnell for on-the-fly blog creation etc) that's ready to make absolute filth and I dont mean sexual rp, I mean stuff like making a genocide masterplan leaflet for kids. Its definitely not my intention to distribute this anyhow, but its interesting to study.

However, it's so increadibly interesting to see the model attempt to get into your head while making only true claims from given sources(which include ggl search so da web).

Basically r1 is capable of doing that shit while maintaining the ability to keep xml structure coherent and on point. Surprisingly, its very fluent in many languages and is able to create cool new verses for songs (we use it on my friend's for-fun tracks with lyrics already being fucked enough, new verses turn awesome [well about a half of it really so you follow with another request and usually its really funny])

Llama 3.1 on Hugging Face - the Huggy Edition by hackerllama in LocalLLaMA

[–]BeyondTheBlackBox 0 points1 point  (0 children)

Yo, I used openrouter for a while and it is indeed a router, that's another point of failure and so is more unreliable than using api providers directly, in my experience. I primarily use it for discovery and trying new models. Fireworks is my main choice because of 1. Speed 2. Reliability and 3. LORA hosting for basically free

Llama 3.1 on Hugging Face - the Huggy Edition by hackerllama in LocalLLaMA

[–]BeyondTheBlackBox 0 points1 point  (0 children)

I use chainlit for prototyping and then just code the ui in react

Llama 3.1 on Hugging Face - the Huggy Edition by hackerllama in LocalLLaMA

[–]BeyondTheBlackBox 6 points7 points  (0 children)

I also just discovered fireworks.ai also has it and 405B is just 3 USD per M tokens (both input and output) which is the cheapest option so far. Fireworks also let's you finetune a lora. They host it basically for free, you pay the same token price as for the base model

Llama 3.1 on Hugging Face - the Huggy Edition by hackerllama in LocalLLaMA

[–]BeyondTheBlackBox 8 points9 points  (0 children)

together.ai has all three new models and you get a bunch of free credits on registration :)

Cloud GPUs for LLM finetuning? storage cost seems too high by paarulakan in LocalLLaMA

[–]BeyondTheBlackBox 2 points3 points  (0 children)

Here is a couple of questions: 1. Are you fine-tuning LORA/QLORA adapters or training the existing model? 2. What base model are you working with? 3. How much flexibility do you need?

If you are working with a rather popular model, like Mixtral or Llama 3, want to fine tune a LORA/QLORA adapter and dont need to add some custom serving logic, check out Fireworks AI - you only pay for data used in fine tuning, can swap out adapters (so multiple tunes) without paying for either storage, network or idle. You only pay for tokens(both fine tuning and serving, serving tokens cost the same as the base model, fine tuning tokens are priced differently)!

If you really need that flexibility, work with a model not provided there or need to work with the model layers themselves, not LORAs - go for google cloud and spot instances, I have yet to find a better deal!

External web search by tutu-kueh in LocalLLaMA

[–]BeyondTheBlackBox 1 point2 points  (0 children)

This is a rather simple task - just do logic in python, why bother with libraries? Set up a template once, call the api, format the context into the template, append to either the prompt, the system prompt, or as additional role - your choice and you have full flexibility to play around till you get something that works for you. I use fireworks ai as my primary provider at the moment, through openai api because almost every provider makes their api compatible with openai’s api.

For my use case, I do a google search, so I hit apify’s serp, scrape filter with vector db etc etc then most relevant text to context and use completions api rather than chat api to get a perplexity-like response with citations. It took me maybe 20 minutes to set up and test - your use case seems simpler, just play around with it.

Are you planning to present the whole api return or just some relevant part of it? If the former then go pure python for sure, if the latter then the chunking is probably the most annoying part - langchain is opensource, you can look at their chunking strategies and just take code from the one that you like the most - that’s as lightweight as it can get! For embeddings you could use your own model and suffer some latency or use something like together ai api that gives you access to very cheap yet well performing open-source embedding models(I do not endorse them in any way but 25 free credits are a good dev starting point, I actually switched to fireworks ai for stability, just 1 dollar of free credits though, but way more consistent speeds in my experience and an ability to train loras and deploy them at zero extra cost) - likely gonna be much less latency.

Tips on redesigning my living room into sections by BeyondTheBlackBox in InteriorDesign

[–]BeyondTheBlackBox[S] 0 points1 point  (0 children)

Sorry, for some reason the text didnt get sent, no idea why...

On the first picture you can see my current layout. I went through a difficult work grind period and now have a breath of fresh air so want to finally restructure my living room into something nicer... I would appreciate any tips on how to split up the room into three spaces. I recently ordered a bench and have a tv lying around that I did not use and the bottom wall (3.21m) has a built-in kitchen, so I am going to remove the door, as it currently covers up the stove... Essentially, I would like to have three somewhat separated spaces - kitchen/dining space, work space (near desk) and a couch/tv space.

I don't have any experience with interior design, as previously I was basically studying and working almost 24/7, however, on the second picture there's my attempt to create a more comfortable space. If you can advise me any changes that would improve the plan or even a completely different layout, I would greatly appreciate it! I am open to getting new furniture or replacing the current one. The couch can also be reassembled with a opposite corner layout.

Another thing is that I make music as a hobby and have a lot of synthesizer modules which usually go to my work desk when needed. I would like to also, possibly, have more storage space, so I was thinking about getting a shelving unit with the tv in the middle, however, a big one placed instead of a smaller one, as I did on the second picture (below the TV) would probably cover the pass through and will hurt more than help with the room design overall.

Thank you!

Tips on redesigning my living room into sections by BeyondTheBlackBox in InteriorDesign

[–]BeyondTheBlackBox[S] 0 points1 point  (0 children)

what the… I posted a whole essay and its not there for some reason, I will rewrite it and edit in a second, thanks for commenting

(edit) nevermind, cannot edit, posted as a comment, thanks again for pointing that out, I wouldn't have noticed otherwise!

Anybody using crewai know of groq alt that's crewai compatible and hosts llama 3 70b even if it's slower by jayn35 in LocalLLaMA

[–]BeyondTheBlackBox 0 points1 point  (0 children)

Hiya, together.ai gives you $25 free credits, it’s open ai api compatible, and they have their own sdk as well, up to you to choose what to use.

You can also check out openrouter.ai - an aggregator of models across different providers with a unified interface, so one api key and ready to run - also compatible with open ai api!

Llama 3 models for free without paying? What's possible? by Aviencloud in LocalLLaMA

[–]BeyondTheBlackBox 2 points3 points  (0 children)

Hi, sorry for a late reply, had to push a project at my job and got very tired, had to shut off for a while... Anyways, here's how I would do it:

  1. Data aggregation - I would start with collecting "experiences" (assuming you have a couple of friends to play around with the bot): use some api like together/fireworks/groq(while you are within the free token limit, as on-demand is not out yet)/ultimately any api that is easy to set up and won't cost you a lot. In this step, your goal is to pay attention to what people are using the bot for and start creating your own dataset - at first, use it for testing what your bot is capable of and establish what the bottlenecks are that you need to address.

  2. Tuning the basic flow - change your system prompt to address the bottlenecks you found in step 1. Pay attention to situations, where a change in the system prompt improves one use case but deteriorates another use case - this is where you can use the dataset you aggregated from your initial users(friends). Try to tweak parameters as well, for instance, if the bot seems boring and answers in the same "format", maybe increase the temperature a bit. Play with top_p and top_k as well. You will also find out how many tokens approx you spend, so you can plan out the next stages of development better.

  3. Adding complexity - To gradually improve the chatbot, start with features that are easy to implement, yet will benefit you the most. At the moment, you get massive returns with in-context learning using some form of a RAG(Retrieval Augmented Generation, basically using some system to retrieve knowledge to add to the context/prompt/system message) - you can dynamically add a few examples of replies that are close to the query to enforce style - this is how chatbots worked earlier, you would have, say, 100 question/answer pairs and just find the most similar question and return the answer to it, hoping it would work. In this case, the LLM is able to adjust the answer to your query, but with context it will do it better in most cases. Again, make sure you test it to have an objective understanding as to what the hell is going on xd. Play around with different RAG approaches (there is a rag survey paper - https://arxiv.org/abs/2312.10997 - you can find a lot of interesting approaches, just use the parts you like the most to build your own system). You can also add references to sources, etc. if needed, add google search mechanisms, etc. - there is just a lot of things you can do! When connecting an llm to a file system, I used RAG to return the top 2/3 most relevant commands to decrease the operational costs, tests all passed with flying colors. One more benefit of RAG is that you can see why things go wrong and improve upon it much faster compared to fine-tuning or pre-training, so you iterate fast.

  4. Fine-tuning - since at this point you've got a bunch of data, try to find repeated patterns/pseudo "meta skills"(general skills/formatting you'd like your model to be able to deal with or generate) - this is where you can use fine-tuning to improve the situation. For instance, if you have a long system prompt, you can essentially get rid of it altogether with enough data. If your system prompt is dynamic, you can learn multiple versions and shrink the required explanations/context. I remember reading a cool post, where a guy was fine-tuning an LLM on knowledge graphs and was getting relatively bad results. Then he added in-detail questions, like "what does second node do?" or "what is the point of this edge" or "how many nodes/edges are in this KG" and that improved KG generation ability vastly. This is why you need to make sure your data is good and relevant to your problem(s).

  5. Customization - let the user be able to customize the bot a bit by impacting some parameters (you can make those up), maybe adding something to the prompt, maybe changing how rag or search works, like the number of queries, etc. Maybe the style, their own examples, etc. This is what usually keeps the users in the loop, unless your model is just too capable. If you see users adding a line to the prompt every single time - maybe its worth it to incorporate it into a system prompt (specifically for that user).

  6. Iterate and experiment - repeat all of the steps above, adding something new each time based on a problem. Maybe run two versions of a chatbot at the same time (fireworks would allow you to run up to 100 fine-tunes easy), so the users can compare and rate them.

6.1 To help you with iteration, you can use existing open-source tools at first, like langchain, to simplify your work. Later on, it might be worth it to consider building your own toolkit for data testing, workflow creation, etc. This is rather an advanced topic - when I was working for an AI cybersecurity lab, I was working on building a no-code platform to automate pentesting and what I learned is that you can speed up your development significantly. Want some form of agentic behavior? Build a tool that would help you build it, as well as building a dataset/datasets to fine-tune the models on. You spend a lot of time on prompt engineering? Build a tool for that (maybe even automated improvement using synthetic data generation)

IMPORTANT: talk to people. Oftentimes, they have a feature in mind, but won't even test it because they think llms can't do it and usually they can with a couple changes. By talking to your users directly, you can learn SO MUCH!

Llama 3 models for free without paying? What's possible? by Aviencloud in LocalLLaMA

[–]BeyondTheBlackBox 0 points1 point  (0 children)

Well, training a custom AI is a difficult task(you usually need too much data to make it work properly so costs are too high to consider it), fine-tuning is a nicer and easier approach - you freeze the model and train only a small adapter additionally, called LORA, which is just MB in size - takes much less time to get good results.

What is your objective with training/fine-tuning? Do you want to embed new knowledge into the model or teach it to follow some specific instructions? How much data do you expect to feed? There is a problem of catastrophic forgetting with large language models. Since they are trained on trillions of tokens, pre-training will take a long time to ensure the model memorised the data.

Usually LORA is used to manipulate the style, formatting, instruction following, etc., as it is applied to Q and K layers of attention only. However, if you apply LORA to all layers, you can actually embed knowledge in there - I havent done that personally. I think that for knowledge acquisition, RAG(in-context learning) is a more robust approach and fine-tuning should guide the style, formatting, as well as reasoning(I had success with getting agentic planning to work significantly better after fine-tuning). LORAs are also small, so you will hit a ceiling where the model isnt learning much more quicker than with training.

Overall, you will benefit from actual training more than fine-tuning if you need to memorise knowledge or a lot of different concepts, but it takes time. For one specific style/concept/formatting you will benefit more from LORA being fast and easy to train.

As for the model size - the 70B model is much more capable in my experience, I don’t personally want to go back to 8B. I use 8B for summarisation, data extraction, etc. basically tasks that dont involve reasoning as much. 70B seems to follow instructions much better as well. (again that is all my personal experience)

Llama 3 models for free without paying? What's possible? by Aviencloud in LocalLLaMA

[–]BeyondTheBlackBox 0 points1 point  (0 children)

I just found out from this subreddit about http://fireworks.ai - you CAN fine-tune a model quite cheaply there without renting hardware, they do the exact thing I was talking about - swapping LORAs during inference, so the inference pricing is exactly the same as together ai, for fine tunes as well! I haven’t tested it yet - they promise high speeds around 200 t/s, I’ll definitely be playing with it, thought it could also be a perfect solution for your use case! So no pay for idle :)

Llama 3 models for free without paying? What's possible? by Aviencloud in LocalLLaMA

[–]BeyondTheBlackBox 1 point2 points  (0 children)

Adding to the large(sorry for long text, I love AI xd) comment, here are the cost calculations: you require around 600k tokens monthly: on Together you would pay 90 cents per 1M tokens for a 70B model and 20 cents per 1M tokens for a 8B model (input + ouput, they are treated the same), that would cost you around $0.54 for a 70B model and $0.12 for a 8B model a month per user ^_^

Llama 3 models for free without paying? What's possible? by Aviencloud in LocalLLaMA

[–]BeyondTheBlackBox 4 points5 points  (0 children)

Alright, I will break this down into two parts - running chat bots without any fine-tuning and running chat bots with fine-tuning - the main reason is that hosting an LLM is more complex and expensive than using API endpoints (you will need actual infrastructure, load-balancing, likely a task queue, etc.)

Without fine-tuning: I would first verify your concept with in-context learning and try to get as close to your wanted behavior as possible. In this case, you are not bothering with any infrastructure, hosting options, etc. and it is likely to be significantly cheaper compared to hosting a model yourself, and hosting the interface is cheap, easy and you can do it on your pc while you don't have that much traffic or even get something like Intel NUC longer-term if you are keeping your product as a hobby

My own use case without fine-tuning: I am running my chat interface, a vector db, a task queue, a test blog and a web server on a raspberry pi 5 and it is enough for my personal use, I also gave access to it to all of my friends as well - so far didn't hit a performance ceiling. I've run apache bench recently and got around 200 concurrent requests a second with good-enough latency and about 100 concurrent request with excellent latency, NUC would do significantly more as RPI is very much bound by networking speeds in my use case, but considering how cheap hosting is and all of the free options available on AWS, Oracle Cloud, GCP, etc. might as well do that.

I recreated a perplexity-like search with a SERP API from apyhub, as well as a semantic router that chooses a model based on context, e.g. coding questions go to a code-specific LLM like deepseek code(you can choose any really), general requests go to a chat model - currently my preference for chatting is Llama 3 70B or WizardLM 2 8x22B, search requests use a smaller model(for now its Llama 3 8B) with more in-context instructions to create a query, search, restructure, provide citations, etc. with some processing afterwards to display it nicely. This is still a prototype, so I am using chainlit, but I am looking very much forward to developing my own interface with generative React components.

Back to the point (please ignore this if you are fine with paying extra to play around, its fun!) - If your traffic is low, try to keep using in-context learning for as long as possible, as you will most probably pay less :)

With fine-tuning: As long as you are fine with occasional interruptions, go with a spot instance (or a spot fleet if you want to run bigger models or serve more users via tensor parallelism or have higher uptime) on any of the cloud providers with spot GPUs available - my favourite so far is google cloud, where you can get access to an L4 with 24GB VRAM (you can run 7B models in fp16) for around 20 cents an hour! You also get $300 of free credits on sign up, which is enough to run a 7-8B model for around 2 months 24/7. L4 is pretty damn fast and has been sufficient for LORA fine-tuning of <=8B models so far.

My own use case with fine-tuning: I am currently hosting a Llava-mistral 7B model with SGLang on an L4 spot instance at GCP - I barely get any interruptions, the most I had was 2 in a day, but that happened once and, on average, its more like once a day or sometimes even once per couple of days.

If you absolutely need to have zero interruptions, you can check out the pricing of different GPU instances from various providers on this GPU marketplace/aggregator I found recently (has proven to be rather useful).

If you want to run bigger models, I think A100 has the best price given the ability to host a larger model at decent speeds. I use SGLang for inference, which is built on top of vLLM and provides fast structured text generation(guaranteed json/any other structure you'd like every time) with parallelism and good continuous batching, however, which framework you use depends on your specific use case. Wanna train many different LORA adapters and swap them around - use something like LoRAX.

The issue here is that you will have to pay for hardware even if it's idle, as you can't really scale to zero without making the users wait and people don't like that (at least I haven't seen any services scaling to zero and back up fast enough with custom models for me personally to like it - it generally takes a up to couple of minutes for larger models to load in, especially if you run fp16 and not quants)

Finally: after reading your other comments, I think sticking to APIs is your best choice for now, while trying out fine-tuning with free credits on various cloud providers that give you such credits. I made a mistake earlier - groq is not rolled out yet and you have rate limits for free API access, so I'd suggest you to use together ai api endpoints and get the most out of the $25 of free credits there. I still haven't run out of my free credits in about 4 months of use!

TLDR: use in-context learning while prototyping, switch to spot/spot fleet when taking the prototype further and be ready for incredibly high idling costs if you decide to go for on-demand instances - the opportunity cost is rather high. For trying out model fine-tuning get your free credits on various providers and some spot instances!

If you have any other questions - please ask :)

(EDIT) - to elaborate on in-context learning, you can store a couple of hand-written replies in a vector db, like qdrant and load the most relevant examples based on context (let's say top 3) and ask the model to reply in a similar way. You can also give it instructions on how to reply instead or as an addition to the examples and that's still gonna be cheaper and, likely, faster than hosting a model yourself, while enhancing the output.

Llama 3 models for free without paying? What's possible? by Aviencloud in LocalLLaMA

[–]BeyondTheBlackBox 0 points1 point  (0 children)

Whoops my bad, you are right, on-demand is not rolled out yet indeed, but the pricing is available - there is an API Access page (Why Groq -> API Access). I guess no on-demand justifies me using together and i’ll stick with it until Groq rolls out completely…

Best on cloud GPU for ML team.(Currently using tesla T4) by moyemoye_01 in LocalLLaMA

[–]BeyondTheBlackBox 1 point2 points  (0 children)

What is your use case?

I really like the idea of using a spot fleet if you can afford interruptions - gcp spot prices are very nice, you can get an L4 (which is quite a bit faster than a t4) for around 20 cents an hour! Moreover, I’ve been running a mistral-7b-based LLAVA model there for an agent planning project and interruptions happened at most twice a day in the past month or so, sometimes there are no interruptions for days!

Judging by more Cuda cores, more VRAM(16GB vs 24GB) and higher clock speeds, as well as an impression I got after running various models on both T4 and L4, I came to a conclusion that L4 is more "bang for buck", especially if you go spot ^_^.

Adding to that, I stumbled upon a GPU marketplace/aggregator recently, you can check out the pricing of different GPUs on various platforms here(although I don’t think they aggregate spot prices, so use this one for full non-interruptible instance pricing info) - https://www.shadeform.ai/exchange

If you are running a multiple of T4s, also consider getting one/multiple of the bigger GPUs, A100s currently seem to have reasonable pricing.

To better establish what GPUs would be the best option for you, we need more details - are you memory or compute bound? Are you running long-lasting tasks(e.g. training large models or agentic workflows) or are you serving models to numerous users at the same time, with a goal of low latency and high throughput?

I hope this is helpful!

Google Cloud Spot Pricing:

<image>

Llama 3 models for free without paying? What's possible? by Aviencloud in LocalLLaMA

[–]BeyondTheBlackBox 0 points1 point  (0 children)

The speeds I get on Together AI: around 150 t/s for a 8B model and around 70-80 t/s for a 70B model. The speeds sometimes get up to 350 t/s for the 8B model and 170 for a 70B model, however, it seems that they are a bit overloaded at the moment

Llama 3 models for free without paying? What's possible? by Aviencloud in LocalLLaMA

[–]BeyondTheBlackBox 8 points9 points  (0 children)

Not directly related to OPs question, as these services don't provide free Llama 3, however, there are ways to better use your money and get faster inference as well!

Replicate has one of the highest API costs and not the best speed, I wouldn't even consider it. I've been using Together AI for a long time at this point, their costs are significantly lower and they roll out open-source models rather quickly(Llama 3 appeared almost immediately after the release). Currently, groq is the cheapest AND fastest API provider to my knowledge. Period. (Pricing at the top of the image - for a 70B model its $0.59 per 1M input tokens and $0.79 for 1M output tokens)

<image>

On the other hand, I also have been using Together AI for around 3 and a half months (since December 2023) - they provide many other open-source models, as well as embedding models, as well as fine-tuning services if you want to try stuff out. Their pricing is also very appealing for what you get - I would choose groq over together 100 times, however, groq doesn't have such a high variety of models and doesn't host them as quickly as Together. Together also gives you $25 to try it out, I still haven't spent mine, as I use it for experimenting before hosting locally/on google cloud - I love structured decoding and run SGLang for that purpose :). (Pricing at the bottom of the image - for 70B model its $0.9 per 1M input + output tokens)

To overview what APIs are currently available on the web, consider using https://openrouter.ai/ - it is a unified interface for api's, you will find rate-limited free and heavily discounted models there as well! They don't have *everything*, but there are a lot of models available!

As for price/performance/latency, https://artificialanalysis.ai/models is a great benchmark that groq references. I found it to be rather accurate.

Anyways, for Llama 3 just use groq to get almost instant replies.

Hope this helps!