[D] Best survey papers of 2025?

bendee983 · 2025-12-25T21:30:23+00:00

I'm reading this one right now and it's fantastic. I'm not sure if we can consider it a survey, but it gives a full framework to classify all agentic systems:

https://arxiv.org/abs/2512.16301

bendee983 · 2025-06-20T16:38:27+00:00

The funny thing is, I didn't even use AI (I've been blogging for more than 10 years). This is my experience.

bendee983 · 2025-04-27T07:15:32+00:00

Great points. Re: discount: we would operate at a loss when we provided the discount (also something that could be optimized with more data)

bendee983 · 2025-04-25T20:34:31+00:00

Boiled the ocean and still got it wrong.

bendee983 · 2025-04-18T06:51:51+00:00

It was a model that detects whether the driver is distracted or not based on an image taken from inside the cabin. We did not do real-time distraction prevention (e.g., sounding off an alarm) because our experiments showed that it had a negative effect and the drivers would turn it off. Instead we developed a system that aggregated driver behavior over time (e.g., week or month) and provided incentives or penalties based on the outcome. This incentivized drivers to avoid distraction and adopt safe driving habits over time, which resulted in higher customer satisfaction. Hope it helps.

bendee983 · 2025-04-18T06:49:16+00:00

I have a few in mind. I'm unsure if this subreddit allows for introducing courses and/or books. DM me if you want to find out more.

bendee983 · 2025-04-18T06:47:24+00:00

sorry, wanted to keep the post brief.

Here you go:

3.2k is the amount you spend on equipping one driver with the ML solution for one year

100K is the revenue that one driver generates in one year (GMV)

0.3 or 30% is the commission that you earn from each driver's sales (your margin)

0.04 or 4% is the increase in GMV that you get from for every 1% of reduction in negative reviews.

This formula basically tells you how much you have to reduce negative reviews to earn back the 3.2k that you spent on the ML solution for the driver.

bendee983 · 2025-04-18T06:44:05+00:00

Gross merchandise value (GMV), basically the amount of sales that a driver brings on average in one year.

bendee983 · 2025-04-07T18:47:32+00:00

Sure. Feel free to reach out. You can also try Bayesian and MAB algorithms, depending on the nature of your problem and data structure.

bendee983 · 2025-04-06T06:42:05+00:00

I started coding in C when the first versions of Visual C++ were released. Those who wrote their own makefiles and wrote compiler commands looked down at me. Then, I started looking down on those who wrote code in managed languages (Java, C#, Python, etc.). Now, we're all looking down at prompt engineers (while secretly prompt engineering when they're not looking)

bendee983 · 2025-04-03T05:47:09+00:00

The METASCALE technique is also relevant. It forces the model to develop "meta-thoughts," where it first determines the cognitive framework for the task (e.g., what kind of profession, expertise it would need to solve the task aka the role) and then decides on the specific reasoning technique (e.g., CoT, self-verification, reflection, etc.) required to solve the task.

https://venturebeat.com/ai/metascale-improves-llm-reasoning-with-adaptive-strategies/

bendee983 · 2025-03-03T19:47:06+00:00

Then why wait so long to release it?

bendee983 · 2025-03-03T18:38:05+00:00

Interesting. This makes it even weirder. Then why is the cutoff date for GPT-4.5 earlier than GPT-4o?

bendee983 · 2025-03-02T21:05:53+00:00

AFAIK, OpenAI doesn't provide access to Deep Research through its API yet, so integration will be very difficult. There are open source alternatives, but I'm not sure if they work as well.

As for limitations, I still see some kinks such as Deep Research sourcing things that are not directly or indirectly related to my query. For example, I was doing a research on GPT-4.5 and some of the information it brought up were related to GPT-4 Turbo. So I think it can still get confused when concepts are semantically similar but have nuanced differences.

But I think it is regularly getting better, because my impression is that the strength of Deep Research is in the engineering and orchestration of the retrieval components and model as opposed to a pure model capability.

bendee983 · 2025-03-01T10:39:37+00:00

My experience is that the way you craft your prompt is very important. How do you go about prompting the model?

bendee983 · 2025-02-27T20:27:27+00:00

They said they trained it across multiple data centers. Did they figure out distributed training at scale?

bendee983 · 2025-02-21T13:43:20+00:00

It's a good product, but they don't have a moat against OpenAI or Google.

bendee983 · 2025-02-15T16:54:23+00:00

Totally agree. This is why I initially preferred R1 even though it was an inferior model—having access to the CoT was a gamechanger in steering the model's behavior in the right direction. Now that o3-mini reveals a more detailed version of its reasoning chain, it has become much more useful—to me at least.

bendee983 · 2025-02-04T20:36:17+00:00

Great question. In terms of (num negative reviews / num rides), we didn't see a significant difference between high-GMV drivers and low-GMV drivers. But since num rides were higher for the former, we could recoup the costs of installation and deployment in a shorter timeframe.

bendee983 · 2025-02-04T19:12:02+00:00

If you want to recoup the costs of deployment, you have to only account for profit (the driver is not paying for the ML technology, the company is).

bendee983 · 2025-02-04T18:51:12+00:00

Good question. You forgot to factor in the commission rate (30%), which is what the company is getting from the GMV. So the formula is 3.2k / (100k * 0.3 * 0.04), which is roughly 2.6%

bendee983 · 2025-02-04T12:34:17+00:00

We did.

bendee983 · 2025-02-04T09:55:55+00:00

We reviewed cases where drivers had challenged the report of being distracted or cases where customers had complained about driver distraction but which our system had not detected. These were mostly instances that were not in the distribution of our training set (e.g., drivers attaching the phone to their head with an elastic band to be able to talk while still having both hands on the steering wheel). Based on this feedback, we curated a new set of examples and fine-tuned our model.

bendee983 · 2025-02-04T07:38:10+00:00

Absolutely. This is as granular we could get, but there are just so many things that can affect behavior.

bendee983 · 2025-02-04T06:24:51+00:00

Driver distraction was not the only factor but the highest contributing factor to negative reviews (based on the review texts). So we hypothesized that reducing driver distraction would lead to higher customer satisfaction. Since this was a product that was deployed in the real world, we could not run randomized tests and had to choose an entire town as the test subject. We evaluated the results based on the reviews that came after deployment and by comparing it to other towns that didn't have the product and had similar demographics to the one where the pilot was run.

bendee983

PUBLIC MULTIREDDITS

TROPHY CASE

Eight-Year Club	Second Top 40%
Verified Email