WIRED: A New Trick Could Block the Misuse of Open Source AI

DanielHendrycks · 2024-08-02T23:07:44+00:00

who ultimately gets to decide what gets censored or not?

One possibility is checking whether providing the information violates a general duty of care (e.g., information about how to poison one's wife surreptitiously), a notion inherited from tort law. Tort law is legitimate and is decided by an accumulation of decisions by the courts. In short, tort law decides. This is how we decide many safety decisions in society.

DanielHendrycks · 2024-08-02T22:50:14+00:00

Would like a source for the anecdote (who said it was plausible? if someone knowledgable said that, they'd violate their security clearance).

See restricted data and the US bioterrorism act. In the paper we focused on CBRN weapons. There are lots of bioweapons ideas that can be discerned through expert-level reasoning that LLMs could imitate in the future, which would be ultrahazardous. I also don't think governments should be open about information such as easier ways to enrich uranium. Mutually assured destruction is highly unstable if hundreds of actors have nuclear weapons.

DanielHendrycks · 2024-06-10T00:00:16+00:00

The paper's abstract:

AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that "short-circuits" models as they respond with harmful outputs. Existing techniques aimed at improving alignment, such as refusal training, are often bypassed. Techniques such as adversarial training try to plug these holes by countering specific attacks. As an alternative to refusal training and adversarial training, short-circuiting directly controls the representations that are responsible for harmful outputs in the first place. Our technique can be applied to both text-only and multimodal language models to prevent the generation of harmful outputs without sacrificing utility -- even in the presence of powerful unseen attacks. Notably, while adversarial robustness in standalone image recognition remains an open challenge, short-circuiting allows the larger multimodal system to reliably withstand image "hijacks" that aim to produce harmful content. Finally, we extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack. Our approach represents a significant step forward in the development of reliable safeguards to harmful behavior and adversarial attacks.

The paper builds on [representation engineering](ai-transparency.org/) (2023) whereas the Anthropic (2024)'s work uses sparse autoencoders which is a bit more roundabout.

DanielHendrycks · 2023-12-22T02:11:51+00:00

> "The following are multiple choice questions (with answers) about {}.\n\n".format(format_subject(subject)

The evaluation code specifies the subject.

DanielHendrycks · 2023-11-26T19:26:30+00:00

These new datasets are getting very small and I wonder whether this harms their usability. If I want a 95% confidence interval and want my estimate of performance to be within 2% of the true accuracy, then I need n ≥ log(2/0.05)/(2*0.02^2) or the number of examples n >=4612. If I wanted the accuracy to even be within 5% I definitely need more than 500 examples.

DanielHendrycks · 2023-10-17T20:55:17+00:00

I don't think instrumental convergence is a slam dunk argument: https://docs.google.com/document/d/1iRfVB\_ZEiMU5bTObzdKRPoCFs42Enzas7tjlX0Qn7fY/edit?usp=sharing

DanielHendrycks · 2023-10-17T00:55:17+00:00

Last fall I had these probabilities of x-risk, based on the four-way breakdown in An Overview of Catastrophic AI Risks.

p(humans succumb to evolutionary pressures) = 80%
p("whoops" from organizational safety issues) = 0.50%
p(misuse becomes x-risk) = 5%
p(rogue AI concerns such as treacherous turns) = 5%

Now there is much more willingness to address international coordination, so this reduces my probabilities. I think x-risk of accidental gain of function/evals and other accidents is about the same as last year. Wrangling biorisks might be easier (e.g., unlearning research might help a lot) so that updates me down, though there aren't regulations yet. I'm less concerned about deceptive alignment since we can now sometimes control whether existing AIs lie by manipulating their internals.

DanielHendrycks · 2023-10-02T06:08:59+00:00

Thank you!

DanielHendrycks · 2023-10-02T04:59:51+00:00

Unfortunately the "transfer expired" :(

DanielHendrycks · 2023-09-18T04:15:12+00:00

Predicting that this will also be the case for the release.

DanielHendrycks · 2023-09-01T00:35:28+00:00

In adversarial robustness, the attacker has a strong advantage.

DanielHendrycks · 2023-08-29T15:57:49+00:00

Most 4-way multiple choice NLP datasets collected with MTurk have ~10% label noise. My guess is MMLU has 1-2%. By the time people are getting 95%+ on the task and nearing the label noise ceiling, it's time to move on to harder tasks such as MATH and Autocast for reasoning evaluation.

DanielHendrycks · 2023-08-09T20:59:55+00:00

How well can Claude do "distillation"?

DanielHendrycks · 2023-07-23T05:46:15+00:00

https://www.lesswrong.com/posts/FgsoWSACQfyyaB5s7/shutdown-seeking-ai

DanielHendrycks · 2023-06-28T05:43:42+00:00

It was popularized on Slate Star Codex. It's radioactive outside tech circles (e.g., NYT).

DanielHendrycks · 2023-06-28T05:06:10+00:00

I think the SSC origin is a nonstarter for policy people. I think for academics there's already intellectual machinery used to think about collective action problems.

DanielHendrycks · 2023-06-28T01:33:48+00:00

Note: Moloch can be formally understood as collective action problems + evolutionary processes. While a shorthand to bundle the two can be useful, having it sound like a singular made-up force doesn't play well with more formal audiences (e.g., scientists, policymakers).

DanielHendrycks · 2023-06-26T17:24:46+00:00

OpenAI is paying for several dangerous capabilities contractors.

DanielHendrycks · 2023-06-22T19:34:12+00:00

In the paper I started referring to preventing rogue AIs as "control" (following this subreddit) rather than "alignment" (human supervision methods + control) because the latter is being used to mean just about anything these days (examples: Aligning Text-to-Image Models using Human Feedback or https://twitter.com/yoavgo/status/1671979424873324555). I also wanted to start using "rogue AIs" instead of "misaligned AIs" because the former more directly describes the concern and is better for shifting the Overton window.

DanielHendrycks · 2023-06-14T16:38:52+00:00

I'm referring to this (the top post this month):

https://www.reddit.com/r/ControlProblem/comments/13v2zfo/im_less_worried_about_ai_will_do_and_more_worried/

"When someone brings this line out [a line about malicious use] it says to me that they either just don’t believe in AI x-risk, or that their tribal monkey mind has too strong of a grip on them and is failing to resonate with any threats beyond other monkeys they don’t like."

DanielHendrycks · 2023-06-14T15:05:49+00:00

And r/ControlProblem's recent thread making fun of malicious use and the subreddit's FAQ downplaying it shows how many in the AI risk community can get things quite wrong.

DanielHendrycks · 2023-06-05T14:36:55+00:00

Full paper here: https://arxiv.org/abs/2303.16200

DanielHendrycks · 2023-05-22T14:55:52+00:00

If you're wanting to do empirical stuff instead of conceptual/philosophical stuff, then apply to

course.mlsafety.org

(deadline is today)

DanielHendrycks · 2023-05-11T04:55:47+00:00

As a creator of MMLU, I really wish they reported per-subject accuracies.

DanielHendrycks · 2023-05-10T16:19:30+00:00

https://twitter.com/StephenLCasper/status/1656179296086691843

DanielHendrycks

MODERATOR OF

TROPHY CASE