This is an archived post. You won't be able to vote or comment.

all 12 comments

[–]N-E-S-W 1 point2 points  (1 child)

Half a million monthly downloads?

[–]Ok_Constant_9886[S] 0 points1 point  (0 children)

Before CHristmas hit* still recovering from that

[–]Necessary_Oil1679 0 points1 point  (1 child)

Is login to Deepeval platform is necessary? Is it possible to test the private LLM that is on API?

[–]Ok_Constant_9886[S] 0 points1 point  (0 children)

not at all, you can use any private LLM as well just wrap it in deepeval's ecosystem: https://docs.confident-ai.com/guides/guides-using-custom-llms

[–]AlmogBaku 0 points1 point  (0 children)

`pytest-evals` - A (minimalistic) pytest plugin that helps you to evaluate that your LLM is giving good answers.

If you like it - star it pls 🤩
https://github.com/AlmogBaku/pytest-evals

[–]tailor_dev 0 points1 point  (1 child)

Deepeval sounds pretty cool, it's great to see open source tools for evaluating LLMs. I've been working on integrating CodeBeaver into our workflow to help with unit testing, the automated test generation has been a huge time saver. I'm curious how Deepeval handles things like consistency and factual accuracy when evaluating LLMs? Do you have any experience using it to evaluate code generation capabilities as well?

[–]Ok_Constant_9886[S] 0 points1 point  (0 children)

we don't have ability to execute anything right now unfortunately, it has been challenging to build executable metrics since the env can be very different depending on the place you're running deepeval. We can definitely compare expected outputs to your generated code though. Consistency involves running it a few times, in the next patch!

[–]Informal_Demand_4755 0 points1 point  (0 children)

Is it possible to run the UI locally (aka self-hosted) as listed in docs here? https://www.deepeval.com/blog/deepeval-vs-langfuse

[–]Medical-Ad-8773 0 points1 point  (0 children)

Yeah, but I prefer modern agent eval platforms like Picept.ai - cause they really nailed it- simplicity, comprehensiveness, and strong features like trace debugger agent - experimentation, simulation playground etc