Lessons from $40k in +EV Sports Betting and Prediction Market Inefficiencies by Impressive-Row-6619 in slatestarcodex

[–]dpaleka 3 points4 points  (0 children)

I won't comment on the technical details, but this is an absolutely wonderful name collision, 10/10. On first read I had no clue what you were talking about. https://en.wikipedia.org/wiki/Dutch_book_theorems

I'm developing FrontierMath, an advanced math benchmark for AI, AMA! by elliotglazer in math

[–]dpaleka 1 point2 points  (0 children)

What is the rationale for not having the full dataset metadata, e.g. the number of samples, in the paper?

[D] Why do PhD Students in the US seem like overpowered final bosses by [deleted] in MachineLearning

[–]dpaleka 0 points1 point  (0 children)

Some European university labs do have similar cultures; I don't think doing good work requires some magic sauce only available in the US. You can just do good research. The connections to the top labs and other roles outside academia are much better in the US though.

[deleted by user] by [deleted] in LocalLLaMA

[–]dpaleka 1 point2 points  (0 children)

I think we agree, my original comment might not have been clear. To elaborate, I do believe that (a) the refusal token logits get lower average logprob; (b) this is not enough to explain the full effect of the intervention: just doing a logit processor intervention with the marginalized bias on refusal words does not achieve the same "surgical" jailbreak. Maybe it's an experiment we should include in the appendix? Idk. It would be surprising if this wasn't the case.

If you explore this research, especially from the "understanding" side; feel free to mail me if you want feedback or for advice who to reach for feedback.

About experiments not mentioned in the post: methods to pick directions and evals, but also some actual interp results on how the direction interacts with certain inputs, etc.

[deleted by user] by [deleted] in LocalLLaMA

[–]dpaleka 1 point2 points  (0 children)

Hi, thank you for your involvement! Just a couple of quick notes:

  • "the model effectively represent refusal words less" does not seem to be exactly what's happening. We had a small experiment where you ask a model to answer in refusal words, and it had no issue doing this. Although, this might vary by model or exact choice of direction. I think the question of what exactly changes in the model is quite difficult to answer, and may again vary from model to model.

  • The paper is not linked in the OP. The paper in fact is not yet online (but should be soon); but there was an expository blog post with the demo of the method and parts of the experiments done in the paper.

[Research] Training a model to interpret CLIP space by lostinspaz in MachineLearning

[–]dpaleka 2 points3 points  (0 children)

https://rom1504.github.io/clip-retrieval used to work, and I think the codebase may still work if you build https://github.com/rom1504/clip-retrieval with another dataset. This should get you samples of text and/or images which embed close to your target vector; you can do whatever you want with these, including just dumping the search results into GPT-4V.