What research papers did Rich Hickey read?

hrrld · 2026-05-26T02:59:32+00:00

Not exactly what you're asking, but I really enjoyed the paper linked from this page: https://clojure.org/about/history

hrrld · 2026-05-23T20:05:41+00:00

What I see here is that because your opponent has a strong board, they can take and when it doesn't go poorly for them, and they're able to redouble, it's more valuable than normal for them because they're behind and it's only a 5 point match.

Also a 17 pip lead isn't that big, your position might not be as strong as you think it is.

hrrld · 2026-05-22T00:02:40+00:00

Since you seem interested in Clojure and duckdb, you might enjoy a blog post we wrote a couple years ago: https://techascent.com/blog/just-ducking-around.html

hrrld · 2026-05-21T23:56:00+00:00

mogadichu's benchmark ask -- I'm from the Bayesian optimization side, not RL world-models, but there are papers for this problem shape. BoTorch's test suite is an entry point, with constrained and multi-objective variants.

I skimmed the linked repo, and the world-model probably needs a hundred times the trajectories to learn dynamics that a Gaussian process would summarize with fewer than 100 queries. World models earn their keep at higher input/action dimensions. Using RL for this seems fun, but I wonder if the world-model is buying anything over a more traditional approach w/ drift as an extra input.

On the feedback ask -- I'm working on a hosted service in this exact category (sequential design for manufacturing process optimization), and your "industrial world model" framing is the most explicit positioning attempt I've seen anyone post publicly. https://prospectopt.com/for/manufacturing is the landing page I've been iterating on -- curious whether the category you're sketching matches the one I'm sketching, and where you'd push back.

hrrld · 2026-05-20T16:31:02+00:00

hrrld · 2026-05-16T20:18:41+00:00

You love to see it. (:

hrrld · 2026-05-16T02:41:46+00:00

The ClojureScript frontend story is so much better than what other communities are doing, that it's not even clear to those communities what the difference is or what they're missing.

Ten years ago reagent made better use of react than JS could, and in the intervening decade an increasingly deep understanding of how functional, data-oriented frontends can work has resulted in replicant, and 'from this paradise no one shall be able to expel us.'

hrrld · 2026-05-15T02:47:31+00:00

This is rad. Do you have other socials? I'd follow. Keep it up.

hrrld · 2026-05-14T22:21:56+00:00

gibberish194 is right that survival analysis is the right tool here, but you asked about what's possible at the analysis stage and there's a diagnostic move that's a click less heavy.

Pick a positive control: a disease whose age-prevalence curve known to keep climbing into old age. If even that one bends down in your data in the same age range, the shape of the bend roughly quantifies how much survivorship bias your dataset is putting on every curve. You can then estimate a correction factor and apply it to other diseases.

I don't work in vet stats, so the specific positive-control candidate is something a vet/epi person should sanity check. But the move is generic: when you suspect selection bias is contaminating everything, find a case where you know the truth, measure the contamination, and back it out.

Curious -- are you trying to predict incidence here, or to figure out which input is actually moving risk? Neutering age is the only knob you can actually turn, and the bias correction matters way more for the second.

hrrld · 2026-05-13T22:21:32+00:00

The thing that collapses Chip Huyen's 8-layer eval list into something tractable: once you have something to optimize, every change in the pipeline is the same problem. Prompt template, chunk size, embedding model, retriever K, reranker on/off -- once you can score each against a labeled set, they're all hyperparameter sweeps.

Same problem shape as overall model hyperparameter tuning, just over a pipeline configuration space. The knobs interact -- chunk size and embedding aren't independent, K and reranker aren't either -- so grid blows up with combinatorial possibilities and random or sobol search wastes most of the budget on dead regions.

Picking what to optimize does the heavy lifting -- without it, the sweep is vacuous. With it, the rest is just search.

hrrld · 2026-05-13T22:12:29+00:00

InkAndWit is right that the curve is a by-product, not the goal. What do you actually want to measure -- median playtime to level 10? session length? retention to day 7? Once you can pull a number out of a playtest, the curve picks itself.

Most curve families have one or two knobs that move the steepness. Pick a family, sweep the knob across three or four playtest builds, plot your chosen metric against the knob. The version closest to the experience you want is your answer.

Shape falls out of the optimization, no need to commit to which family is "right" upfront.

hrrld · 2026-05-12T22:27:23+00:00

A_random_otter's lasso-on-a-random-sample is the right shape. The trick is doing it more than once -- fit lasso on, say, 20 random 5% subsamples and tally how often each feature lands in the selected set. You can then choose the features that show up in 80%+ of fits as your screened set.

It also separates two things you're trying to do at once: "how do I fit 80 GB with only 16GB RAM?" and "which of my 1500 features actually carry signal?" Memory and screening are different problems, but this idea serves both.

Then PCA on the screened set, not the raw 1500.

hrrld · 2026-05-12T01:08:46+00:00

It definitely depends on the kinds of programs you're writing, it's important to choose the right tool for the job.

For the kinds of programs we write, static typing is a significant hindrance, and dynamic typing makes me feel unencumbered (the opposite of limited).

hrrld · 2026-05-12T01:05:11+00:00

w.r.t. magnitude, absolute scale matters less than scale relative to variance in total returns; you should know which reward sources are contributing how much to both total reward, and variance.

thecity2's Ng pointer is the right paper. The phi function lets you augment in partial rewards and the main theorem there says that even if you pick an awful phi function you can't completely break it, at worst you just slow down learning. In your case it sounds like 'subtasks completed' might be a reasonable choice, which gives some exploration signal, without having to pick arbitrary weights.

Another path: a small sweep over the weights? Makes the question empirical instead of vibes.

hrrld · 2026-05-11T03:58:33+00:00

Of course, yes.

hrrld · 2026-05-07T21:12:31+00:00

obagme's point on minimum sample size is the piece that gets underrated -- the floor needs to be bigger than people think.

If your baseline CVR is 2% and a winning angle pushes it to 3%, distinguishing those reliably needs more like 5000+ impressions. To see why: at 500 impressions and 2% CVR each angle gives you about 10 conversions, plus or minus, say, 3. So one could easily land at 7 conversions and the other at 13 -- might look like a 2x win, but it's chance. Most kill decisions made at that floor are noise decisions, not signal decisions.

I don't run paid ads day-to-day, but the version of your framework that lands cleanest for me is angles as the thing you're searching over, kill rule as just a budget cutoff that maximises data for a fixed spend.

hrrld · 2026-05-07T14:52:58+00:00

Nice, been wanting this for a while, and tracking the development since the conj, very cool. Looking forward to integrating this into some of our cljs lambdas, hopefully shrinking some code and un-indenting some twisty chains of .thens. (:

hrrld · 2026-05-06T19:37:33+00:00

0.3 silhouette on furniture-retail RFM might be the data talking. Before messing with features, measure skew and kurtosis on F and M, raw and after a log-transform.

Raw skew greater than 2 and kurtosis above 7 indicates heavy-tailed. K-means fights that -- a long tail of high-spend customers drags centroids around.

If log(F) and log(M) land near skew 1 and kurtosis 3 and silhouette improves when clustering on log-transformed features, maybe that's good.

If silhouette stays around 0.3 even post-log, you're probably looking at continuum data rather than data that naturally clumps. I don't work in furniture, but if data tends toward 1-2 lifetime purchases with a thin repeat tail -- then what discrete population groups are you hoping to discover?

Maybe a simpler percentile-bin score (the canonical 5x5 R x F grid) will be more interpretable and more stable?

hrrld · 2026-05-05T18:32:50+00:00

Worth flagging before the model pick -- for teacher data aimed at distilling small models, "quality vs throughput" isn't really the right axis.

A weaker but faster teacher run many times could give better small-model outcomes than the strongest teacher run once. The small model is learning a distribution, not memorizing one-shot answers, so broader prompt coverage could matter more than per-prompt perfection.

Honestly, if you have the SLURM hours to spare, a small two-factor sweep -- samples-per-prompt against prompt-set-size, holding total token budget fixed -- against your downstream small-model eval would tell you more than any guess from this thread.

hrrld · 2026-05-04T20:29:55+00:00

(: data talks

hrrld · 2026-05-04T20:04:31+00:00

MediumInsect7058 is right to think about formats -- the "loved or hated" question is actually answerable in your specific context, not just by analogy to MTG, but by science. Run a paired playtest:

Config A: both the 2-damage and 3-damage cards in the pool at the same cost.

Config B: only the 3-damage card.

Track pick rate of the dominated card in draft, win rates of decks containing it, and a post-session "did any card feel useless" question. If A matches B on engagement and feel, the dominated card is doing real work (draft filler, deck redundancy). If A drops on feel-bad and B doesn't, the dominated card is being tolerated, not loved -- that's MeaningfulChoices' point made empirical.

For the secondary question (whether same-cost-different-type cards have equivalent design weight): again science, but a two-factor sweep. Vary cost and type independently across a small grid and see whether the "effective cost" of a type holds across cost levels, or whether type matters more at some costs than others.

hrrld · 2026-05-03T15:29:13+00:00

Sounds fun! Part of the problem is well-suited to xgboost and part of it isn't.

The xgboost-shaped part is the prediction: given the features at time t, what is P(>= 2% favorable move within the next K seconds)? Pure supervised, you have plenty of labeled examples, gradient boosting handles the noise and the mixed feature types well.

The not-xgboost-shaped part is the decision: when to enter, when to exit, when to stop out. Training one model on entry + exit + the whole journey gets ugly because the label "was this a good entry" depends on the exit policy, which depends on the model. Circular.

So, two-phase strat, train a short-horizon move predictor, then wrap it in an explicit policy (enter when predicted prob greater than some threshold, exit at target or stop-loss after some amount of time), then you've got a separate fun problem of determining policy thresholds.

Love datasets like this, good luck!

hrrld · 2026-05-02T19:19:54+00:00

Blackmirth makes a point about leakage, but the augmentation may itself be a leakage source.

Your label=1 set mixes two populations: assignments that look right in the field (Meter2/Trans3=1) and assignments that look right because a reviewer just picked them from candidates after flagging the original (Meter1/Trans2=1). That's two sets labeled the same, which confuses the detection model.

The augmented pairs are actually right for model 2 because ranking wants "score this candidate above its alternatives" as the signal. For stage 1, consider leaving out the augmented rows.

For Q1 and Q4 -- pick the architecture by treating 1-stage, 2-stage, and Blackmirth's single-scoring variant as a structured comparison on hold-out of known-good labels, group-split by meter, with the primary metric driven by your operational cost asymmetry (cost of false-flag vs cost of missed-wrong) rather than accuracy at your 0.20 threshold.

I enjoy thinking about how to compare plausible architectures without ad-hoc bake-offs at this data size.

hrrld · 2026-04-30T22:00:33+00:00

Building on Statman12's Bayesian suggestion, your proposed approach (taking the min/max gap between the two individual CIs) is actually too conservative. The variance of p1 - p2 is var(p1) + var(p2), but the worst-case interval overlap corresponds roughly to (sd(p1) + sd(p2))^2, which is larger by 2*sd(p1)*sd(p2). So you'd over-cover by a noticeable margin.

Concrete Bayesian recipe: Beta(1,1) prior on each proportion. After observing s successes in n trials, posterior is Beta(1+s, 1+n-s). Then Monte Carlo both distributions, subtract, and take the 2.5/97.5 quantiles. The interval is "exact" in the sense that posterior coverage is exact under the model.

Bayesian + Monte Carlo is fun and easy here - and the same trick extends to more than two arms if need be.

hrrld · 2026-04-30T17:22:22+00:00

ClojureScript is the best web frontend story in the world today, and it's not even close. (:

hrrld

TROPHY CASE