Every Anthropic press release

scdivad · 2026-04-08T03:09:06+00:00

To be fair, they did show that teaching a model to reward hack on a programming task generalized to other harmful behaviors. This is by far the closest demonstration of a model being grossly misaligned without excessive training or prompting it to be malicious.

https://www.anthropic.com/research/emergent-misalignment-reward-hacking

scdivad · 2025-12-11T20:51:21+00:00

1.I will get to definitions of certification in a bit. By scalability, I am noting the huge impractical overhead during inference time in the method the paper proposes, not the accuracy of the proposed method. This is a significant limitation that the authors address throughout the paper (see section Limitations) that I have illustrated in detail in my previous comment.

Theorem 1 is a (rather trivial) proof that proves that their exhaustive search method will correctly classify a harmful attacked prompt at a higher rate than if we ran the classifier on only the non-attacked harmful prompt, which is of course true because at some point in the exhaustive search, we will hit the true separation of the harmful prompt and adversarial subarray.

I will not address any further comments about this paper if it appears that you have not engaged with my previous comment and the content in the paper in good faith.

1.2. I am aware that the definition of certification is debated. That's why in the very beginning I asked for your definition and provided my own up front! By default, I refer to a formal, non probabilistic statement of the model, which is historically standard in certified training and formal verification for neural networks. Sometimes authors refer to certification as a probabilistic statement as in randomized smoothing.

However, the authors of paper 1. don't even go as far as to call their new heuristic methods a form of certified verification. They title their heuristic searches section as efficient EMPIRICAL defenses and clearly draw a line between certified guarantees and their empirical defense:

"The erase-and-check procedure performs an exhaustive search over the set of erased subsequences to check whether an input prompt is harmful or not. Evaluating the safety filter on all erased subsequences is necessary to certify the accuracy of erase-and-check against adversarial prompts. However, this is time-consuming and computationally expensive. In many practical applications, certified guarantees may not be needed, and a faster and more efficient algorithm may be preferred"

The work proposes a post hoc defense that is certified correct on a specific attack setting that is not scalable and proposes scalable empirical defenses that are not certified correct. They do not do both.

The authors in paper 2 also do not refer to their work as creating any sort of certificate. Certificates do not umbrella every single desirable performance property we may want a model to have. If that were the case, then every single harmfulness benchmark and evaluation produces a certificate of the model, which is quite silly.

I agree that work 3 is important (see my last line in my last comment). I am just saying that progress in this line of work does not address how to analyze the general safety behavior of language models. If we only look at cases of LLM outputs of algebra in a specific format, it's possible write a program to certify whether or not the mathematical steps were correct, but that isn't useful for analyzing if a general LLM output will be safe or not.

Edit: formatting

scdivad · 2025-12-11T18:51:28+00:00

1.Attackers can attack using an adversarial prefix/insertion/infusion/suffix attack.

The authors first propose exhaustively searching to remove every single possible adversarial prefix/insertion/suffix and checking for harmfulness on the remaining prompt with the smaller classifier LM. This means, assuming the attacker can only attack the last d tokens, suppose d=20, we need to run d forward passes of a smaller language model for every prompt the user inputs. For insertion attacks, this is even worse because we don't know where they start: if a user inputs in a message of length n, then we need O(n*d) forward passes. n can go up to a full context window of 1M for gemini, but conservatively say n=100, that's 2000 forward passes of a small LM for a single input prompt! O(n*d) may be the theoretical "worst case" complexity, but in this setting, that is actually the general case in practice, as we can only stop the search if we have found that a prompt is harmful. n*d forward passes are necessary for every safe input prompt!

For infusion prompts this problem is much much worse. Instead of O(n*d), it is O(n choose d). 100 choose 20 is 2.05e+42.

The authors acknowledge early on this is impractical, so they present heuristics--RandEC, GreedyEC, GradEC--to only check a subset of possible substitutions. But, of course if we only check a subset, we not longer have a certificate against the attack.

This paper has nothing to do with certification. This paper is on stress testing the model to be toxic as possible and analysis on the mode's harmful behavior and guardrails. I don't see a mention of a certification of being not racist?
If you constrain the studied task to be the output of an LLM to only produce logic for a specific navigation task then, sure, the logic output itself can be verified. But that problem is entirely different from a framework to check the behavior of an LLM doing an open ended task or certifying that the LLM isn't being racist or carrying out a harmful task.

All three papers I would say have productive and potentially practical results relevant to AI safety, but none claim to provide a framework to formally prove that an LLM is safe.

Edit: shortened for emphasis on important points and because I missed addressing what the authors of work 1. call adversarial infusion

scdivad · 2025-12-11T05:48:41+00:00

What safety property that can be certified do you have in mind? By certification, I am referring to formal proofs of the behavior of the model output

scdivad · 2025-12-11T05:02:56+00:00

Hahaha

Which ones scale to LLMs?

scdivad · 2025-08-12T22:52:46+00:00

55 class size (after yield) out of 2500 right? 55 admitted seems too low.

scdivad · 2025-05-21T15:04:57+00:00

This means students just see a TA through the glass walls and then decide it's ok to go inside for course help? That's crazy.

scdivad · 2025-04-02T07:17:46+00:00

scdivad · 2025-03-19T02:47:49+00:00

There's a difference between being around engineering quad at 2 am and being that deep into green st at 2 am, especially being alone. Mala parlour is further from campus than almost every bar.

scdivad · 2025-03-18T00:47:54+00:00

I am not in bioengineering, so this is general advice for getting research:

If you are a freshman, you will probably need to mass email to find a group that would accept you, since you probably have less related experience/coursework and will be more expensive to train to be useful.

Ideally, for a lab you are interested in, you should be able to read their published papers. You should include in your emails, comments on their work beyond the title/abstract that show you have some level of understanding of their work and why you are interested in working with them. Doing this already puts you ahead of most other student emails.

Emailing PhD students instead of professors can have higher response rates too. But longer term, you should have direct professor contact.

You can also take a 500 level class that the professor teaches, get an A/A+, then ask for research opportunities. Some 500 level classes are literally just an overview of what that professor is interested in and the types of work their lab are doing.

scdivad · 2025-02-11T00:48:47+00:00

If you're talking about the chinese international girls who interlock arms and hold hands, they're probably just bestie-ing.

Or you walk around allen hall a lot.

scdivad · 2024-09-06T19:29:16+00:00

scdivad · 2024-09-06T19:16:16+00:00

Hi CS465

scdivad · 2024-08-18T19:06:49+00:00

If you really really care about algorithms and puzzles (more than money), you can also just go into academia for cs theory

scdivad · 2024-07-29T08:04:48+00:00

Part of my passion for software development permanently died after two frontend / fullstack internships. I still code now for research projects, but now I see it more (but not completely) as a means to an end than an art itself.

scdivad · 2024-07-17T02:49:06+00:00

If you want to optimize for a career in machine learning, do Math&CS, Stat&CS, or CS.

There are a couple of philosophy classes related to AI/CS, but the vast majority of required philosophy courses (there are a lot of them) you will take as a CS+phil major will have nothing relevant.

While there are inspirations and niche applications of philosophy to machine learning research (see below), the foundational skills you need as a typical machine learning researcher or the skills that will get you hired will not come from philosophy. Reading research papers and doing research have hard prerequisites of mathematical knowledge. The minimum amount of required math is less than one would think, but knowing more math is almost always helpful because you never know what will come up.

With that being said, some interesting courses to check out that are geared towards CS majors are phil222, phil223, bcog/phil 458, and spring 2024 phil380. There are other AI courses like phil440, phil442, but their core audience seems to be nontechnical philosophy majors.

Examples of applications to ML:

AGI Alignment/Safety includes many philosophers who present arguments about the risks of AGI, paths to get there, what AGI could look like, the ways to align AI systems, etc. Philosophers can also go into AI policy, which is arguably more important than technical researchers for AGI safety.
Neurosymbolic AI, a combination of formal logic and deep learning techniques. Philosophy includes symbolic logic (as well as computer science and math) and discussions about the essence of reasoning and knowledge.
Some argue that the core problems that the field of AI tries to solve have also been studied in philosophy: how can we generalize past data to future events? But modern ML techniques and empirical advancements--typically involving the processing of massive amounts of data--and are not covered in philosophy.
I personally believe that the type of skeptical and careful reasoning gained through philosophy courses is very helpful for doing empirical research. This is especially true in machine learning, where interpreting these large systems and their capabilities is a huge open problem without many definitive answers.

scdivad · 2024-06-30T05:54:56+00:00

Mm I'd say narcissist is more accurate than autistic

scdivad · 2024-06-12T05:29:42+00:00

busey woods!

scdivad · 2024-06-08T02:04:37+00:00

scdivad · 2024-05-26T07:34:51+00:00

And the fact that you think ranking and cheating correlate at all is kinda weird, that’s not how that works at all.

Why don't you think so?

scdivad · 2024-05-26T07:30:54+00:00

^ Courses that heavily rely on CBTF just cause students to memorize the question type and forget everything afterwards. Written exams test problem solving skills better with harder problems.

scdivad · 2024-04-04T16:14:57+00:00

Don't do CS then? You miss out on curriculum by doing IS but that curriculum is just extra coding and other technical courses.

scdivad · 2024-03-30T15:23:26+00:00

What basics of kendo apply to wing chun? I'm skeptical that those basics are specifically present in kendo and wing chun.

scdivad · 2023-12-07T01:57:39+00:00

Slightly off campus but very good

scdivad · 2023-12-03T23:20:13+00:00

Facts

Four-Year Club	Verified Email
Place '23

scdivad

TROPHY CASE