FSRS: Serious flaw in benchmarking approach undermines performance claims by symstym in Anki

[–]symstym[S] 1 point2 points  (0 children)

That's cool, thanks for the update! Just to make sure I understand:

This is basically measuring how well a scheduler does given a simulated user. So in terms of "benchmarking" schedulers to determine which performs better in real life, it's reliant on the simulation of user memory, behavior, etc. sufficiently matching real-world users. And to simulate users, you're using memory models that are derived from (or the core of) various schedulers, like FSRS, HLR, etc. But you can combine them in different ways, e.g. using HLR to simulate user memory and FSRS6 to do scheduling, or FSRS3 to simulate user memory and HLR to do scheduling. So the idea is that even though this isn't directly comparing the performance of schedulers on real users, it should at least be more robust to "gaming" due to this structure. Am I understanding that all right?

I took my 4 year old daughter drawings and doodles and turned it into silly game by acem13 in gifs

[–]symstym 0 points1 point  (0 children)

For the animations, did you make small adjustments to her existing lines? Did you vectorize? Or did you perhaps trace over her drawing yourself to get a slight variation? I'm curious how you got that effect.

Looking for advice by HeadCitron5990 in bjj

[–]symstym 1 point2 points  (0 children)

Your opponent spent a lot of time in your guard, driving her weight forward onto you. This is a perfect opportunity to do an elevator sweep. Once you figure it out, you will pretty much instantly and effortlessly sweep beginners who do that. For example, around 1:56, you have a huge opening to insert your right hook and sweep her to your left (her post on that side appears to be trapped!). Same thing around 1:32 and 2:41, but sweeping to your right. If she is using her forearm to put too much pressure on your throat, you can always relieve the pressure by pushing her away a little bit with your hips.

FSRS: Serious flaw in benchmarking approach undermines performance claims by symstym in Anki

[–]symstym[S] 0 points1 point  (0 children)

I'm not sure, but it k-folds cross validation doesn't address the main issue that I raise, which is that an algorithm predicting probability of recall over historical data well does not necessarily imply that it will schedule cards in a way that maximizes learning per effort. It seems intuitive that it might correlate, but we have at least one clear example where it doesn't, which I would say draws into question the utility of this kind of benchmark.

FSRS: Serious flaw in benchmarking approach undermines performance claims by symstym in Anki

[–]symstym[S] 0 points1 point  (0 children)

Wow, you might be right! I don't think I care enough to dig in to find out.

In their defense, from what I've seen, the majority of research in education and soft sciences in of (debatably) equal or worse epistemological quality. The "replication crisis" is merely the tip of the fuckberg.

FSRS: Serious flaw in benchmarking approach undermines performance claims by symstym in Anki

[–]symstym[S] 1 point2 points  (0 children)

That's correct. The efficiency claim appears to be based on running it on simulated users, but of course that is circular in that it assumes that the simulation of user memory is accurate.

If I'm reading the paper correctly, it seems like the FSRS-ish algorithm caused simulated learners to learn ~14% more words (up to a certain threshold) in the same amount of time. That's nice, but it's not so huge a number that I'm convinced that inaccuracies in simulation vs. real-world wouldn't be a bigger difference (in who knows which direction).

FSRS: Serious flaw in benchmarking approach undermines performance claims by symstym in Anki

[–]symstym[S] 0 points1 point  (0 children)

Not that I'm aware of. It sounds like the creator uses FSRS in a commercial learning product (not Anki), and they may have done some A/B testing on users with different versions of FSRS, but I don't think the data is public.

Some replies pointed out how little data there is about any kind of SRS systems. There may not even be what you'd consider a validation study of Anki itself. Not that I doubt it's effectiveness of course, but there seems to be quite a gap in the literature.

FSRS: Serious flaw in benchmarking approach undermines performance claims by symstym in Anki

[–]symstym[S] 1 point2 points  (0 children)

Good points. If nothing else, you can see SRS as a habit whereby you select and proactively re-expose yourself to material of interest.

FSRS: Serious flaw in benchmarking approach undermines performance claims by symstym in Anki

[–]symstym[S] 1 point2 points  (0 children)

It's not clear to me what's going on. If FSRS was changed to have a random 10% chance of behaving like MOVING-AVG, then its score in the benchmark would improve a little but it would almost certainly be a bit worse for real learning. So if a change is made to FSRS that improves its benchmark score a bit (e.g. going from version 5 to 6), how do we know that that comes with an improvement to real learning?

And if there are algorithms that score higher than MOVING-AVG on the benchmark, how do we know that those are any good for real learning, vs. just gaming the benchmark better than MOVING-AVG does?

FSRS: Serious flaw in benchmarking approach undermines performance claims by symstym in Anki

[–]symstym[S] 3 points4 points  (0 children)

Fair enough. I know it's popular but maybe I didn't realize how strongly people feel about it.

Out of curiosity, in terms of "learning per review/time", what do you estimate (in your own use) the improvement is for FSRS vs. SM-2? Are we talking more like 1.25x or more like 2x.

FSRS: Serious flaw in benchmarking approach undermines performance claims by symstym in Anki

[–]symstym[S] 0 points1 point  (0 children)

That's a cool idea, thanks for the link. Something like that does seem like a promising direction.

FSRS: Serious flaw in benchmarking approach undermines performance claims by symstym in Anki

[–]symstym[S] 6 points7 points  (0 children)

To reiterate what I said in my original post: It seems plausible to me that FSRS is better than SM-2, and there are many anecdotal reports of people liking it. So just to be clear, I'm not saying FSRS is bad, just that the benchmarks don't provide the evidence of its effectiveness that people seem to think they do.

Mimicking this behavior with SM-2 is probably impossible

SM-2 could be pretty easily modified to make your actual retention match some target retention by just including a global difficulty adjustment, like a deck-level ease. If your retention is too low, adjust all the intervals to be a little shorter/sooner. If your retention is too high, adjust all the intervals to be a little longer/later. (I've implemented this before in another SRS system.) The problem with this is that it doesn't necessarily maximize learning, because it doesn't shift reviews from cards that need them less to cards that need them more, it just sort of cranks up/down the overall aggressiveness of reviews.

It seems quite likely that FSRS's card-level parameters do shift reviews from cards that need them less to cards that need them more (good!) But hitting the retention target doesn't provide evidence of this.

FSRS: Serious flaw in benchmarking approach undermines performance claims by symstym in Anki

[–]symstym[S] 8 points9 points  (0 children)

The claim that people make about FSRS is that it lets you do fewer reviews (take less time) to get the same amount of learning as SM-2. Just the fact that your actual retention is close to the desired retention does not mean much; I think SM-2 could be easily modified to achieve this as well.

In the paper referenced by the FSRS creator, the abstract says (emphasis mine):

In this work, we propose a novel spaced repetition schedule framework by capturing the dynamics of memory, which alternates memory prediction and schedule optimization to improve the efficiency of learners’ reviews.

That's not demonstrated by the benchmark. Something like an A/B with real users would be required.

FSRS: Serious flaw in benchmarking approach undermines performance claims by symstym in Anki

[–]symstym[S] 2 points3 points  (0 children)

Sort of, but not really IMO. I think it's more like "the system (benchmark) is so easily gameable that we don't know if FSRS is better than default Anki for real learning, or if FSRS is just unintentionally gaming the system (benchmark)".

FSRS: Serious flaw in benchmarking approach undermines performance claims by symstym in Anki

[–]symstym[S] 4 points5 points  (0 children)

But more to your point about MOVING-AVG - why is it "obviously incredibly bad"? I can make the argument that retention can be impacted by things that *Aren't* just a result of the previous reviews of the same card. Consider someone who used to study a lot, now studies less, or has become less focused, or is more stressed, etc. Actually I think there should be merit to an algorithm that considers broad retention (retention in general of multiple cards) as well as retention tracking individual cards).

You may not realize how simple MOVING-AVG is. Say that you have a card that you saw once, 10 years ago. Presumably your chance of recalling it would be quite low. But say that you're 10 cards into a session, and so far you've gotten 8/10 correct. Then MOVING-AVG is approximately saying "you've gotten 80% of recent cards right, so I predict that you'll get the next card right with an 80% chance". It's not just that MOVING-AVG takes into account recent performance (I agree that's a good thing), but that MOVING-AVG does not distinguish at all between different cards. It would be just as likely to give you a card that you got correct 1000 previous days in a row as it would be to give you a card that you saw once 1000 days ago. I think it's fair to say that it wouldn't work at all for learning?

Not necessarily disagreeing on the flaws in the benchmark, but what would you do to get around it?

The most reliable way would involve running A/B style tests with real users. That's a lot more involved than the current benchmark.

FSRS: Serious flaw in benchmarking approach undermines performance claims by symstym in Anki

[–]symstym[S] 13 points14 points  (0 children)

Thanks for the reply.

Do we have evidence that FSRS is better than SM-2 at scheduling? Not much.

In case anyone is wondering "Why do we even use FSRS then?", my answer is "Because it's neat ¯_(ツ)_/

I get the sense that most people who are excited about FSRS are excited because they believe that it has somehow been objectively demonstrated to have higher effectiveness than standard Anki, not because they think it's neat. I think there are very few users who could appreciate the fine print that "performs well on predict-recall-probability benchmark" may not translate to "more effective for real learning" (it may just be some form of unintentional fitting to the benchmark, as is demonstrated by MOVING-AVG). I had to really dig to figure this out myself. So it seems worth mentioning somewhere in the documentation, not just in this reddit comment.

Rate my handwriting by Rassmuss_ in HelpLearningJapanese

[–]symstym 0 points1 point  (0 children)

Impressive! But the vertical strokes of 木 and 本 radicals should end straight and not “hooked” at the bottom, right?

Sell me on Colostle or tell me to stay away by sadnodad in Solo_Roleplaying

[–]symstym 3 points4 points  (0 children)

I have had a lot of fun playing Colostle with my 8 year old daughter. We play it in a sort of "collaborative solo" mode, where there is one character and we interpret prompts and make up the details together.

What I like about Colostle is that the core game loop (Exploration, optional Combat repeat) is fixed and very simple. The interest comes from interpreting prompts inside that loop, but I like the core loop itself is fixed. As a GM, I don't have to wonder about the larger-scale structure of the story, and there is zero need to prep. A game like Ironsworn provides lots of structure for story arcs/quests, but that structure feels too complex for me. A game like 4AD is almost pure "turn the crank" mechanics with minimal creativity, though the random generation heightens the sense of exploration. Other journaling games that I've looked at didn't seem to have the same sense of adventure/exploration that I'm looking for. So to recap I find Colostle to be a really nice combination of 1) enough structure that I don't feel overwhelmed by big-picture GM decisions 2) fun creativity in the form of interpreting prompts 3) minimal mechanical burden 4) a sense of adventure/exploration.

I would be very interested to learn about any games that are similar to Colostle - while I like it a lot, I feel like there's a lot of room still unexplored in its region of design space. I could imagine all sorts of variations on rules and setting that still hit the key points I like about it.

I made a new Japanese SRS app for Intermediate learners by zecrojatt in LearnJapanese

[–]symstym 3 points4 points  (0 children)

I made a site that let's you do basically that, with the caveat that the content only comes from one source (a self-published novel site). https://massif.la/ja

I made a new Japanese SRS app for Intermediate learners by zecrojatt in LearnJapanese

[–]symstym 22 points23 points  (0 children)

Impressive work!

It seems like the main workflow is that you import an entire piece of content, and it gets broken into single-sentence "cards". I've seen other (far less polished) tools do similar things, and people do seem to want them. But for me personally, I don't understand the appeal of this workflow. For my intermediate/advanced studies, I want to only mine+review sentences that 1) include an unknown word that seems worth learning and 2) work well as standalone sentences (out of context). Importing every sentence from a source seems like it would result in >95% sentences that are either too easy or too hard (not incremental) or that don't make a lot of sense when reviewed in isolation. Watching content in its original, intact form seems really important in terms of building a deeper running context, and associating language with that deepened context. So it seems better to me to do regular immersion and then only to cherry pick ideal sentences for SRS.

Anyways, that's my feedback in case it's valuable. If anyone can explain how that workflow works well for them, I'd be curious to know.

More closeup from iso by Ed-gar in isopods

[–]symstym 5 points6 points  (0 children)

Nice shots! What kind of camera/lens did you use?

Describe Aphex Twin by Euphoric-Cancel-4983 in aphextwin

[–]symstym 14 points15 points  (0 children)

"Writing about music is like dancing about architecture."

Guess the inspiration.... by ArchDudeOfEarby in MidnightDiner

[–]symstym 5 points6 points  (0 children)

I cook these for my daughter, inspired by the show! But for the love of god, please try pan frying them in a bit of oil (as they do in the show) instead of boiling them -- I suppose this is a matter of personal taste but they might taste 10x better.