AI Safety Research Feed

Curated collection of the latest research, papers, and reports on AI safety, alignment, and governance

Total Papers

2,847

+127 this month

Research Groups

156

+8 new groups

This Week

New publications

Latest Research

Showing 6 papers

Constitutional AIAnthropic

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Andy Jones et al.

2024-01-16

847 citations

We propose Constitutional AI (CAI), a method for training AI systems to be helpful, harmless, and honest using a set of principles or constitution. Our approach uses AI feedback to critique and revise responses, reducing the need for human oversight while maintaining safety.

RLHFSafetyAlignment

EvaluationUK AI Safety Institute

Frontier Model Evaluation: A Comprehensive Framework

Sarah Chen, Michael Rodriguez et al.

2024-01-15

234 citations

This paper presents a comprehensive framework for evaluating frontier AI models across multiple dimensions of capability and safety. We introduce standardized benchmarks and evaluation protocols for assessing dangerous capabilities.

EvaluationBenchmarksSafety

SafetyDeepMind

Scaling Laws for AI Safety Interventions

David Kim, Lisa Wang et al.

2024-01-14

156 citations

We investigate how various safety interventions scale with model size and compute. Our findings suggest that some safety techniques become more effective with scale, while others plateau or even degrade.

ScalingSafetyEmpirical

Red TeamingOpenAI

Red Teaming Language Models with Language Models

Alex Johnson, Maria Garcia et al.

2024-01-12

423 citations

We present an automated red teaming approach that uses language models to discover failure modes in other language models. This method scales red teaming efforts and uncovers novel attack vectors.

Red TeamingAutomatedSecurity

InterpretabilityStanford University

Interpretability in Large Language Models: A Survey

Jennifer Brown, Kevin Zhang et al.

2024-01-10

89 citations

This comprehensive survey reviews recent advances in interpretability techniques for large language models, covering mechanistic interpretability, probing methods, and visualization approaches.

InterpretabilitySurveyMechanistic

GovernanceFuture of Humanity Institute

Governance Frameworks for Frontier AI Development

Thomas Anderson, Rachel Green et al.

2024-01-08

67 citations

We analyze existing governance frameworks for AI development and propose improvements for managing risks from frontier AI systems. Our framework emphasizes transparency, accountability, and international coordination.

GovernancePolicyRisk Management