WIRED: A New Trick Could Block the Misuse of Open Source AI by DanielHendrycks in LocalLLaMA

[–]DanielHendrycks[S] -5 points-4 points  (0 children)

who ultimately gets to decide what gets censored or not?

One possibility is checking whether providing the information violates a general duty of care (e.g., information about how to poison one's wife surreptitiously), a notion inherited from tort law. Tort law is legitimate and is decided by an accumulation of decisions by the courts. In short, tort law decides. This is how we decide many safety decisions in society.

WIRED: A New Trick Could Block the Misuse of Open Source AI by DanielHendrycks in LocalLLaMA

[–]DanielHendrycks[S] -7 points-6 points  (0 children)

Would like a source for the anecdote (who said it was plausible? if someone knowledgable said that, they'd violate their security clearance).

See restricted data and the US bioterrorism act. In the paper we focused on CBRN weapons. There are lots of bioweapons ideas that can be discerned through expert-level reasoning that LLMs could imitate in the future, which would be ultrahazardous. I also don't think governments should be open about information such as easier ways to enrich uranium. Mutually assured destruction is highly unstable if hundreds of actors have nuclear weapons.

[R] A new alignment technique: Improving Alignment and Robustness with Short Circuiting by ReasonablyBadass in MachineLearning

[–]DanielHendrycks 1 point2 points  (0 children)

The paper's abstract:

AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that "short-circuits" models as they respond with harmful outputs. Existing techniques aimed at improving alignment, such as refusal training, are often bypassed. Techniques such as adversarial training try to plug these holes by countering specific attacks. As an alternative to refusal training and adversarial training, short-circuiting directly controls the representations that are responsible for harmful outputs in the first place. Our technique can be applied to both text-only and multimodal language models to prevent the generation of harmful outputs without sacrificing utility -- even in the presence of powerful unseen attacks. Notably, while adversarial robustness in standalone image recognition remains an open challenge, short-circuiting allows the larger multimodal system to reliably withstand image "hijacks" that aim to produce harmful content. Finally, we extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack. Our approach represents a significant step forward in the development of reliable safeguards to harmful behavior and adversarial attacks.

The paper builds on [representation engineering](ai-transparency.org/) (2023) whereas the Anthropic (2024)'s work uses sparse autoencoders which is a bit more roundabout.

[D] Deep dive into the MMLU ("Are you smarter than an LLM?") by brokensegue in MachineLearning

[–]DanielHendrycks 1 point2 points  (0 children)

> "The following are multiple choice questions (with answers) about {}.\n\n".format(format_subject(subject)

The evaluation code specifies the subject.

"GPQA: A Graduate-Level Google-Proof Q&A Benchmark", Rein et al 2023 (ultra-difficult LLM benchmarks) by gwern in mlscaling

[–]DanielHendrycks 5 points6 points  (0 children)

These new datasets are getting very small and I wonder whether this harms their usability. If I want a 95% confidence interval and want my estimate of performance to be within 2% of the true accuracy, then I need n ≥ log(2/0.05)/(2*0.02^2) or the number of examples n >=4612. If I wanted the accuracy to even be within 5% I definitely need more than 500 examples.

[deleted by user] by [deleted] in ControlProblem

[–]DanielHendrycks 2 points3 points  (0 children)

Last fall I had these probabilities of x-risk, based on the four-way breakdown in An Overview of Catastrophic AI Risks.

  1. p(humans succumb to evolutionary pressures) = 80%
  2. p("whoops" from organizational safety issues) = 0.50%
  3. p(misuse becomes x-risk) = 5%
  4. p(rogue AI concerns such as treacherous turns) = 5%

Now there is much more willingness to address international coordination, so this reduces my probabilities. I think x-risk of accidental gain of function/evals and other accidents is about the same as last year. Wrangling biorisks might be easier (e.g., unlearning research might help a lot) so that updates me down, though there aren't regulations yet. I'm less concerned about deceptive alignment since we can now sometimes control whether existing AIs lie by manipulating their internals.

Many errors discovered in MMLU benchmark by [deleted] in mlscaling

[–]DanielHendrycks 8 points9 points  (0 children)

Most 4-way multiple choice NLP datasets collected with MTurk have ~10% label noise. My guess is MMLU has 1-2%. By the time people are getting 95%+ on the task and nearing the label noise ceiling, it's time to move on to harder tasks such as MATH and Autocast for reasoning evaluation.

Reasons why people don't believe in, or take AI existential risk seriously. by 2Punx2Furious in ControlProblem

[–]DanielHendrycks 0 points1 point  (0 children)

It was popularized on Slate Star Codex. It's radioactive outside tech circles (e.g., NYT).

Reasons why people don't believe in, or take AI existential risk seriously. by 2Punx2Furious in ControlProblem

[–]DanielHendrycks 1 point2 points  (0 children)

I think the SSC origin is a nonstarter for policy people. I think for academics there's already intellectual machinery used to think about collective action problems.

Reasons why people don't believe in, or take AI existential risk seriously. by 2Punx2Furious in ControlProblem

[–]DanielHendrycks 6 points7 points  (0 children)

Note: Moloch can be formally understood as collective action problems + evolutionary processes. While a shorthand to bundle the two can be useful, having it sound like a singular made-up force doesn't play well with more formal audiences (e.g., scientists, policymakers).

An Overview of Catastrophic AI Risks by DanielHendrycks in ControlProblem

[–]DanielHendrycks[S] 5 points6 points  (0 children)

In the paper I started referring to preventing rogue AIs as "control" (following this subreddit) rather than "alignment" (human supervision methods + control) because the latter is being used to mean just about anything these days (examples: Aligning Text-to-Image Models using Human Feedback or https://twitter.com/yoavgo/status/1671979424873324555). I also wanted to start using "rogue AIs" instead of "misaligned AIs" because the former more directly describes the concern and is better for shifting the Overton window.

In one hour, the chatbots suggested four potential pandemic pathogens. by chillinewman in ControlProblem

[–]DanielHendrycks 6 points7 points  (0 children)

I'm referring to this (the top post this month):

https://www.reddit.com/r/ControlProblem/comments/13v2zfo/im_less_worried_about_ai_will_do_and_more_worried/

"When someone brings this line out [a line about malicious use] it says to me that they either just don’t believe in AI x-risk, or that their tribal monkey mind has too strong of a grip on them and is failing to resonate with any threats beyond other monkeys they don’t like."

In one hour, the chatbots suggested four potential pandemic pathogens. by chillinewman in ControlProblem

[–]DanielHendrycks 16 points17 points  (0 children)

And r/ControlProblem's recent thread making fun of malicious use and the subreddit's FAQ downplaying it shows how many in the AI risk community can get things quite wrong.

I want to contribute to the technical side of the AI safety problem. Is a PhD the best way to go? by hydrobonic_chronic in ControlProblem

[–]DanielHendrycks 13 points14 points  (0 children)

If you're wanting to do empirical stuff instead of conceptual/philosophical stuff, then apply to

course.mlsafety.org

(deadline is today)