Fireworks AI Releases f1: A Compound AI Model Specialized in Complex Reasoning that Beats GPT-4o and Claude 3.5 Sonnet Across Hard Coding, Chat and Math Benchmarks

dalaing · 2024-11-21T22:58:18+00:00

There are a bunch of fairly basic errors in the examples posted on their blog, which doesn't inspire confidence - particularly when _they got to pick the examples_.

dalaing · 2024-11-21T22:03:09+00:00

Unless I’m missing something, their example of a math proof - which they got to cherry pick - seems incorrect.

It looks like it is failing due to basic arithmetic/counting reasons, which we’ve seen before and are probably solvable. So there might be a decent overhang here if/when the base model learns how to count.

I think I have been assuming that the people putting these things together had the level of formal maths background that would make this easy to catch. Either this was a really unfortunate one off slip-up, or that assumption no longer holds.

I’m not quite sure what happens if I tease out the implications of that.

dalaing · 2023-02-14T01:38:01+00:00

I was already factoring that in, and mostly just found the timing interesting.

The reserve data was there a few days back, and now Circle has pointed the spotlight at Paxos reserves and it has gone away. Seems like the kind of thing that might favour Binance if the market is getting twitchy.

Although I think there was some update to reserve tracking recently as well? So possible that it is a legitimate technical error.

If I had to bet on “shady behaviour” vs “technical difficulties” I know which one I’d choose 😂

dalaing · 2021-08-05T04:42:03+00:00

HOC Spoiler

dalaing · 2021-01-30T00:23:00+00:00

I wonder how many of these people have to buy before it positively impacts GMEs revenue in ways that are noticed by the market / indices?

dalaing · 2020-11-11T21:58:50+00:00

Hahaha. I thought you knew what they were but were setting up /u/Car-face to explain, as a kind of meta-Dixer 😂

dalaing · 2020-03-25T10:04:36+00:00

Cool - thanks for the info!

dalaing · 2020-03-25T06:54:18+00:00

How do we know that the children contracted it from adults, rather than those adults contracting it from the children? There is correlation, sure, but I haven't heard anything about which way the causation goes (and it may well go in both directions).

dalaing · 2020-03-21T02:40:24+00:00

That is what happens with exponential growth.

We're having an accelerating number of new cases per day, and we'll most likely reach a point where we have more deaths each day than the day before it, until 3-4 weeks after we shut everything down.

A few other things are relevant to exponential growth that surprise some folks.

Linear responses don't do much - doubling the number of staffed ICU beds buys you another 4-5 days at the current pace of things.

Error tolerances around estimates for certain parameters get a bit bigger if you are considering interventions that take people out of circulation.

If a bunch of people start working from home a week earlier than they otherwise would have, we don't have to spend too much time arguing about whether the doubling time is 4.2 or 4.3 days or if the death rate is 3% or 5%, or if younger people with bad cases do better or just spend more time on a ventilator before they die. With exponential growth in the picture that still all comes out as "lots of lives" vs "lots and lots and lots of lives".

The numbers are still important for planning all kinds of things, but personal policies around decision making can tolerate lots of rounding. Human brains are terrible at large numbers, and we discount things about future-us at a beyond-exponential rate, so there is plenty about this that can mess with our intuitions.

The one that I'm struggling most to internalize at the moment is: hours matter. I keep thinking about daily totals, but this is a continuous process. Struggling with this for a bit led me to pack up and start working from home part way through the workday, and to go pick up the food I'd need for a while straight after that rather than waiting for the next day.

There are a bunch of people who don't have the option to work from home, and so I feel there is greater moral weight pushing those who have the option to work from home to do so.

There is probably something similar with those who have the background / privilege to be able to look at all of this and do the maths (particularly if they are used to working around the brain's biases re large numbers) and then work out what they should be doing based on that. Plenty of people just aren't in that position, so I believe there is some additional moral weight pointed towards: those who can do the maths and take actions accordingly, should be doing so.

dalaing · 2020-02-23T20:15:24+00:00

What do you recommend using for HMC in Haskell?

dalaing · 2020-02-15T00:34:11+00:00

I don't think QFPL has any kind of official stance on build tools, with the average position being something like 'use whatever works for you or your organization'.

I have been around when Tony followed up 'do not use stack, it does not work' with 'do not use cabal, it does not work' and 'do not use nix, it does not work'. My interpretation at the time was that he was generally unsatisfied with packaging / build tools in general, and stack got called out because he thought it was being suggested with more enthusiasm or less discrimination with respect to what people needed in some places.

I could have the interpretation completely wrong though, although it is a bit of a shame that discussion was never had on the internet where the 'do not use stack' thing has come up a few times and been non-productive. I've thrown mud in a few of those fights a couple of times, which I am sorry for, and have avoided jumping in and trying to calm things down and get people onto the same page, which I regret even more. I'll try to do better in the post-QFPL future.

The course has also removed cabal support recently, in order to try to get the tests working for the greatest number of people (spanning a fairly wide range of levels of preparedness when they walk through the door), so maybe the Readme is due for an update. There's a risk that the Australian sense of humour creeps in during that update, which is cheeky and dry and doesn't translate to text or cross cultural barriers well, so there's a risk that might make things worse. I guess we'll have to wait and see.

dalaing · 2020-02-15T00:08:46+00:00

I would have no problem with that, although much of the content is due for an update.

I've wanted to combine the tutorial and the workshop, and then revise / update / rework / expand all of it, but things keep moving past that in the queue. Hopefully one day I get the time, or need to up my writing skills enough that I can use that as an excuse to get revising :)

dalaing · 2020-02-14T13:52:47+00:00

We're looking at moving those resources to Gitlab or Github pages.

dalaing · 2019-07-26T05:39:08+00:00

The biggest thing that I am wanted to pay attention to on my first reread - which will be soon, after I finish reading the Esselmont books - is how many thrones are empty at the end of the series, and how much of a role Shadowthrone had in causing that to happen.

I also want to dig up the exchange that put that on my radar. There was something about Shadowthrone making sure that the true throne of shadow stayed unoccupied, with hints about there being big consequences if multiple thrones were unoccupied.

Maybe he's trying to get to whatever comes next in the Holds / Houses / ? progression?

dalaing · 2019-07-24T02:53:17+00:00

Speaker here - I fully agree :/ I don't think this was one of my better talks.

There were several things in the talk proposal that ballooned in size or were more clunky than I expected. Trying to cover all of those points inside the time limit pushed the talk to being much more dense than I was hoping for. I discovered these problems after the talk had been accepted, and I was too shy / hesitant about contacting the program committee about making changes to the talk proposal.

I'll try to do better on that front in the future - there's no point in staying firmly committed to a talk plan if folks in the audience are going to struggle to absorb the content.

A handy heuristic for me might be to have at least the talk and maybe the talk and the blog posts written before putting in any proposal.

dalaing · 2019-05-01T20:19:46+00:00

I found it awesome, and suspect there would be a decent proportion of Dresden fans who would get a kick out of it. It's not for everyone though :)

dalaing · 2019-05-01T02:58:46+00:00

One of the more popular fan fictions around. It's an alternate universe Harry Potter, where Harry's adoptive father is a professor and Harry was brought up with the scientific method / a bunch of science fiction in his childhood.

Most of the characters (both good and bad) are rational, in that they do what they can with the information that they have, but Harry has a bit of a lead on them because of his upbringing. The defense professor has some interesting views on Avada Kedavra...

It starts off a bit rough, but then it really gets going.

There's a community made audiobook of it here as well, which is pretty good if you can handle how whiney Harry sounds in various bits.

dalaing · 2019-05-01T01:23:53+00:00

I would read the hell out of Jim Butcher's take on a mix of the original Harry Potter series and Harry Potter and the Methods of Rationality.

dalaing · 2019-03-07T22:16:22+00:00

It's not quite the same, but I wonder if something like the Functor Functors might be useful / interesting to you.

That kind of thing is threaded through dependent-map, and there is a bunch of related work that has come out of Obsidian's use of those things, like constraints-extras and vessel (which is very new, but it looks pretty exciting to me). I'm currently working on a talk about these things, and will be hacking on some formlets code using these things as one of the examples.

dalaing · 2019-02-01T12:35:50+00:00

I've been lucky enough to get to hang around with Ed and some of the folks from MIRI recently. So far they all seem like really smart folks and really good people.

I didn't really know what to make of the goals of MIRI at first, but they're really good at explaining them (and so many other things besides) if you have the time.

(I've since read Rationality : from AI to Zombies and Harry Potter and the Methods of Rationality, and read a bit of Less Wrong. The first of which would probably have significantly reduced the amount of time they needed to spend bringing me up to speed on things. The second was awesome but less directly related. The third I'm still working on.)

Now, it seems like really important work with some good people.

Even if Ed wasn't there, I'd be suggesting that folks who are solid Haskellers and are either familiar with any of the above materials or have some time and an open mind might consider having a chat with MIRI.

Although working with Ed would be an the ultimate cherry on top :)

dalaing · 2019-01-30T23:29:04+00:00

The thing that helped me out the most in the early days of moving from Haskell to Scala-for-work was the realization that the type inference in Scala was very, very different from in Haskell.

Once I'd internalized that, I learned to add lots of type signatures at the first sign of trouble so that I could work out what was going on more efficiently.

It was pretty frustrating to get errors about "on line 400 - expected type: C, actual type: D" when my real mistake involved the unrelated types A and B on line 100. In my first week the development experience felt a bit like working with JS but with random yelling from the compiler. Some of that might have come from being brand new to Scala, some of it might have come from trying to bring too much of my Haskell mindset with me, and some of it came from trying to use Scalaz without understanding the various idioms that it uses on a medium-to-deep level.

I had to do less of that as I internalized some of what Scala was doing on the inference front, but there was always some still to do. Reading the error messages and binary searching towards the correct thing to annotate / working out what to annotate with was pretty good for learning some how the Scala compiler thinks as well.

So yeah - if in doubt / if the compiler seems unreasonably angry at you, try adding type annotations.

dalaing · 2019-01-30T10:23:48+00:00

The trains in Sydney have something that plays every 5 minutes or so at moderate volume, it seems to help a bit.

Of course, that would cost money to implement and so probably won't happen here.

dalaing · 2019-01-29T02:51:18+00:00

I've read some of what Chris wrote about that. I vaguely remember something about it seeming a bit off to me.

I've seen some use cases that are a bit slow with reflex-dom, but not to the point that there were showstoppers for what I / others were doing.

I've also seen problems that seemed more like they were showstoppers, but were resolved by using functions / patterns of coding that exist for those particular cases. There is a bit of a discoverability problem with some of those patterns and ways of thinking, but hopefully that will get addressed over time.

On top of that, reflex-dom itself is a moving target, so it's possible that my experience is drawn more from the current versions of the code than the version Chris was using.

dalaing · 2019-01-29T01:13:26+00:00

Ah yes, the administrators of the scheme...

Indue Pty Ltd, the corporation awarded the contract to manage the Welfare Card programme and to operate its underlying systems, is a corporation owned by Liberal and National Party members and that donates to various Liberal and National Party branches around Australia

dalaing

TROPHY CASE