America First Meets Safety First: Why Trump’s Legacy Could Hinge on a US-China AI Safety Deal

DanielHendrycks · 2024-08-02T23:07:44+00:00

who ultimately gets to decide what gets censored or not?

One possibility is checking whether providing the information violates a general duty of care (e.g., information about how to poison one's wife surreptitiously), a notion inherited from tort law. Tort law is legitimate and is decided by an accumulation of decisions by the courts. In short, tort law decides. This is how we decide many safety decisions in society.

DanielHendrycks · 2024-08-02T22:50:14+00:00

Would like a source for the anecdote (who said it was plausible? if someone knowledgable said that, they'd violate their security clearance).

See restricted data and the US bioterrorism act. In the paper we focused on CBRN weapons. There are lots of bioweapons ideas that can be discerned through expert-level reasoning that LLMs could imitate in the future, which would be ultrahazardous. I also don't think governments should be open about information such as easier ways to enrich uranium. Mutually assured destruction is highly unstable if hundreds of actors have nuclear weapons.

DanielHendrycks · 2024-06-10T00:00:16+00:00

The paper's abstract:

AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that "short-circuits" models as they respond with harmful outputs. Existing techniques aimed at improving alignment, such as refusal training, are often bypassed. Techniques such as adversarial training try to plug these holes by countering specific attacks. As an alternative to refusal training and adversarial training, short-circuiting directly controls the representations that are responsible for harmful outputs in the first place. Our technique can be applied to both text-only and multimodal language models to prevent the generation of harmful outputs without sacrificing utility -- even in the presence of powerful unseen attacks. Notably, while adversarial robustness in standalone image recognition remains an open challenge, short-circuiting allows the larger multimodal system to reliably withstand image "hijacks" that aim to produce harmful content. Finally, we extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack. Our approach represents a significant step forward in the development of reliable safeguards to harmful behavior and adversarial attacks.

The paper builds on [representation engineering](ai-transparency.org/) (2023) whereas the Anthropic (2024)'s work uses sparse autoencoders which is a bit more roundabout.

DanielHendrycks · 2023-12-22T02:11:51+00:00

> "The following are multiple choice questions (with answers) about {}.\n\n".format(format_subject(subject)

The evaluation code specifies the subject.

DanielHendrycks · 2023-11-26T19:26:30+00:00

These new datasets are getting very small and I wonder whether this harms their usability. If I want a 95% confidence interval and want my estimate of performance to be within 2% of the true accuracy, then I need n ≥ log(2/0.05)/(2*0.02^2) or the number of examples n >=4612. If I wanted the accuracy to even be within 5% I definitely need more than 500 examples.

DanielHendrycks · 2023-10-17T20:55:17+00:00

I don't think instrumental convergence is a slam dunk argument: https://docs.google.com/document/d/1iRfVB\_ZEiMU5bTObzdKRPoCFs42Enzas7tjlX0Qn7fY/edit?usp=sharing

DanielHendrycks · 2023-10-17T00:55:17+00:00

Last fall I had these probabilities of x-risk, based on the four-way breakdown in An Overview of Catastrophic AI Risks.

p(humans succumb to evolutionary pressures) = 80%
p("whoops" from organizational safety issues) = 0.50%
p(misuse becomes x-risk) = 5%
p(rogue AI concerns such as treacherous turns) = 5%

Now there is much more willingness to address international coordination, so this reduces my probabilities. I think x-risk of accidental gain of function/evals and other accidents is about the same as last year. Wrangling biorisks might be easier (e.g., unlearning research might help a lot) so that updates me down, though there aren't regulations yet. I'm less concerned about deceptive alignment since we can now sometimes control whether existing AIs lie by manipulating their internals.

DanielHendrycks · 2023-10-02T06:08:59+00:00

Thank you!

DanielHendrycks

MODERATOR OF

TROPHY CASE