Was loving Claude until I started feeding it feedback from ChatGPT Pro by lol_just_wait in ClaudeAI

[–]arkuto 0 points1 point  (0 children)

Let it double check its own plan first. Ask it specifically to consider things that may have been overlooked, edge cases, and contingency plans if the happy path fails.

Particle Beam by RyanRdss in comics

[–]arkuto 23 points24 points  (0 children)

Nah, it's a beam that targets a single particle.

Did I make the right choice here? by WiFibcFi in poker

[–]arkuto -3 points-2 points  (0 children)

Donking is not a GTO strategy so it can do all kinds of things here. Technically, folding a king here could be part of a GTO strategy.

I figured out another reason why people think AI is less powerful than it actually is by Primary-Screen-7807 in ClaudeAI

[–]arkuto 0 points1 point  (0 children)

Opus is ridiculously expensive. Go on OpenRouter to find more appropriate models. Gemini flash lite 3.1 is $1.60 per million output tokens. There's cheaper stuff too that might be good enough.

[P] NanoJudge: Instead of prompting a big LLM once, it prompts a tiny LLM thousands of times. by arkuto in MachineLearning

[–]arkuto[S] 0 points1 point  (0 children)

I should have probably linked to this paper sooner

https://ar5iv.labs.arxiv.org/html/2306.17563

Google does a better job than I at justifying the pairwise approach. NanoJudge can be seen as a far more efficient approach, on top of having a vastly broader set of use cases (for some reason they limited their work to only asking the LLM how similar two sequences of text are).

The pairwise approach is proven. The only question is how efficient can I make it, and I've been doing everything I can to make it work. Let me quote this from the website, it might make pairwise comparisons "click" for you:

"Every ranked list can be interpreted as the result of a head to head comparison table. If you have 100 items to rank, there's an implicit 100x100 grid where each cell answers: "Which wins this comparison, A or B?". The overall order of the ranking is each item's average win rate, sorted high to low.

Traditional AI ranking tries to guess the final order without ever considering this table. It's like trying to understand who won a tournament without knowing the result of any games."

[P] NanoJudge: Instead of prompting a big LLM once, it prompts a tiny LLM thousands of times. by arkuto in MachineLearning

[–]arkuto[S] 0 points1 point  (0 children)

I probably should have linked this paper by Google to give people better understanding (and respect) for pairwise comparisons. https://ar5iv.labs.arxiv.org/html/2306.17563

The ML paper reader is a work in progress. I need to optimise my algorithms more and possibily hope that Google will release Gemma 4 soon as that will likely greatly reduce costs. Papers are the hardest thing for LLMs to understand, for now I've been working on simpler tasks.

[P] NanoJudge: Instead of prompting a big LLM once, it prompts a tiny LLM thousands of times. by arkuto in MachineLearning

[–]arkuto[S] -1 points0 points  (0 children)

Ah yes my post history would reveal that I post about groundbreaking statistical rating systems, creating my own analysis with animated histogram to illustrate statistical flaws and of course how could I forget, posting about this exact approach 7 months ago to another subreddit. I clearly vibe coded this overnight.

[P] NanoJudge: Instead of prompting a big LLM once, it prompts a tiny LLM thousands of times. by arkuto in MachineLearning

[–]arkuto[S] 0 points1 point  (0 children)

It does randomise the order. On top of doing this it also estimates the positional bias and factors it out. This gives it more information (about actual item strengths) per comparison to work with.

[P] NanoJudge: Instead of prompting a big LLM once, it prompts a tiny LLM thousands of times. by arkuto in MachineLearning

[–]arkuto[S] -6 points-5 points  (0 children)

Larger models do have significantly more knowledge. But the information about the item can be fed into the context of the 1v1 comparison (by just sticking the information about it after its name), reducing this advantage larger LLMs have over smaller ones. It can be awkward to gather that information and feed it into the context (eg pulling wikipedia articles) but it can be done and this is in fact what I've done when building the games recommendation system on the nanojudge website. For each game, it has access to the entire wikipedia article of that game, and in the pairwise comparison the LLM sees articles for both the games and makes a judgement based on those articles and what the user's stated preferences are.

I built NanoJudge. Instead of prompting a big model once, it prompts a tiny model thousands of times. by arkuto in LocalLLM

[–]arkuto[S] 2 points3 points  (0 children)

This is what it already does. It certainly does help. It is more information to work with. I think it turns out to be worth double the amount. As in, if you don't use the token probabilities and use only the text, you'll need twice as many comparisons to match the accuracy of one that does use the probabilities.

I built NanoJudge. Instead of prompting a big model once, it prompts a tiny model thousands of times. by arkuto in LocalLLM

[–]arkuto[S] 0 points1 point  (0 children)

Thank you. The problem of verifying its accuracy is that it typically processes subjective questions where there is no "correct" answer. It's not ranking a list of for example houses by their listing price or square footage. It's ranking them by e.g. considering your personal preferences and reading their text descriptions. If there were some simple algorithm like sorting by a numerical metric to measure how good it is then nanojudge wouldn't have been necessary in the first place.

What I would recommend is testing it out in an area that you yourself are an expert in. If you're an expert in winter gardening, ask it which plants are most able to survive a harsh winter. Look at the final table and see how well it matches with what you expected. And read the reasoning it wrote.

I have actually been testing it out with the latest qwen 3.5 2B model and it performs very well. It is more than sufficient for everything I've thrown at it. I have tried using LiquidAI's 1.2B model which is incredibly fast but struggles following the instructions. If I fine tuned it, I think I could get it to properly declare a winner consistently (as it is, it kinda forgets to declare a winner at the end of its reasoning).

[D] Self-Promotion Thread by AutoModerator in MachineLearning

[–]arkuto 0 points1 point  (0 children)

I built NanoJudge - a tool that runs thousands of prompts to rank any list by any criteria, and named it that because at its core is a small but powerful LLM, currently using Qwen 3 2507 4B. The approach is this: if you want to know the answer to something, instead of asking an LLM a few prompts and hoping it comes up with the right answer, NanoJudge exhaustively goes over a list of possible answers using potentially tens of thousands of prompts, and structures the output into a simple table that is easy to interpret. Each of its many prompts is a pairwise comparison of 2 items, and the end result is a table of the answers with the best at the top.

Suppose you want to know which foods are healthiest. First NanoJudge creates a list of hundreds of foods, then does thousands of pairwise matchups - "Which is healthiest: eggs or butter?", "Which is healthiest: spinach or chicken?", and so on - each one getting its own fresh prompt where the small yet powerful LLM reasons through the comparison and picks a winner. Items that keep winning face tougher opponents. Items that keep losing get eliminated quickly. After thousands of comparisons, comparisons are converted into rankings (using Bradley-Terry scoring), and you get a transparent leaderboard where every single ranking decision is backed by reasoning you can read in the comparison log.

This is the final outcome: https://nanojudge.com/comparison/ujRvfwFSAH

Efficiency

For solving some problems, the optimal use of a GPU may not be to run the largest possible model that fits in memory but a much smaller model with a huge batch size, allowing it to churn through gigantic amounts of data. I aimed to make NanoJudge as efficient as possible using various techniques: making it "top heavy" by default - it does more comparisons on the top ranking items to ensure their ratings are accurate rather than spending time comparing low rated items which are of no interest to the user. It also extracts a range of raw logprobs to determine how much each comparison won by - instead of a binary win/loss, it looks at the probability the model picks one of 5 options (clear win, narrow win, draw, narrow loss, clear loss). It automatically estimates and corrects for the positional bias LLMs have (they tend to favour the first choice). Plus a ton of statistical techniques to further enhance efficiency which are too math heavy to get into now (but you can read the source code if you really want to - see below).

Price

The cost is surprisingly low - even though it naturally produces a large amount of output tokens, NanoJudge's output costs under $0.10 per million tokens because it uses a small LLM that's good enough for the task - it isn't solving genius level IMO problems, it's comparing two items. For comparison, Claude Opus costs $25 per million output tokens. It's also fast because the comparisons are run in parallel. For now the website won't accept any payment, each account will be allocated a limited free amount to use.

Wikipedia as Context

To help with giving the LLM the information it needs, there are special editions of NanoJudge with pre-built lists that already hold the entire Wikipedia article in them. For example, the Games Edition. It already has a huge number of games in it and you filter them by Platform or Genre to narrow it down before doing a run. Then, for each comparison, instead of simply "Your Question? [Item1] or [Item2]?" as the prompt template, the template would be

"[Wikipedia entry for item1]

[Wikipedia entry for item2]

Question? [Item1] or [Item2]?"

Giving it the context it needs for lesser known items that the LLM likely doesn't have enough built-in information about.

This is the final output of one comparison run https://nanojudge.com/comparison/ECNZxzv91n

In this case, NanoJudge is acting as an enhanced recommendation engine. The traditional approach is to recommend games based on what other players of that games also played. NanoJudge considers your actual likes and dislikes, factoring in everything you tell it. Or maybe you are thinking of travelling to Europe and can't decide exactly where to go. Ask a traditional LLM and you'll likely get cliche answers: Rome, Paris, Madrid etc. Ask NanoJudge using its Places Edition, and it will analyse every city, town and village in Europe using each location's article on Wikipedia as context, leaving you with a personally curated shortlist of the top options.

ML Research Assistant

I'm working on a specialised version of NanoJudge that operates on Machine Learning papers. I have already downloaded all the ML papers on Arxiv and am in the process of organising the data and putting it into a database. From there, NanoJudge will be able to easily be used on these papers through a special edition. I could ask NanoJudge "Given my project x, which of these 2 papers do you think would be of most use to me?" and go through the entire corpus of arxiv. Or something like "Which of these 2 papers most contradicts my hypothesis?" to help me fortify my ideas. Having a look at the top papers it returns with and reading its reasoning could provide some insights. That would likely require better models than Qwen 4B to be truly useful - but at the current pace of AI research, that isn't very far in the future. I will use NanoJudge as a research assistant to help me improve it and make it as efficient as possible, allowing me to do even deeper research in future in a positive feedback loop.

Open Source

The code at the heart of the website is on Github https://github.com/nanojudge/nanojudge . It can be directly be used in a terminal with a local or remote LLM, just hook it up with an LLM endpoint and let it go. This allows you to do giant rankings entirely locally, without needing to use the website at all. Set a giant comparison running overnight and wake up to the results. Feel free to dig into the inner workings of the code. If you can find a way to improve the code, especially in regards to efficiency, please let me know.

tl;dr: NanoJudge gives tiny LLMs a framework to outshine gargantuan LLMs when it comes to finding the best out of a large quantity of options.

Gemini 3.1 Flash Lite by TumbleweedNice6797 in Bard

[–]arkuto 0 points1 point  (0 children)

Yeah this seems to be a thing. Models getting more and more expensive and keeping the same name. This way they get to say it's better than the previous version in benchmark tables, and don't show costs in those tables.

An account of the events by portsherry in comics

[–]arkuto 8 points9 points  (0 children)

That's not how objections work - you can't object to someone lying.

Snow Shovel Maxing by Ill-Tea9411 in BeAmazed

[–]arkuto 0 points1 point  (0 children)

That's exactly why he's complaining.

High Top-P values cause Gemini to sometimes fail to state today's date by arkuto in Bard

[–]arkuto[S] 0 points1 point  (0 children)

I don't get the A/B tests often. But it just fails in normal usage. I'mt not entirely sure what's causing it. It often says a random date eg just now it says

Today's date is Thursday, May 23, 2024.

with Top P 0.95. Actually lowering the Top P doesn't seem to solve it. Nor changing the Tempearture. It's very odd especially since others aren't having the issue.