AI Safety Research Feed
Curated collection of the latest research, papers, and reports on AI safety, alignment, and governance
Total Papers
2,847
+127 this month
Research Groups
156
+8 new groups
This Week
23
New publications
Trending Topic
Constitutional AI
45% increase
Latest Research
Constitutional AI: Harmlessness from AI Feedback
We propose Constitutional AI (CAI), a method for training AI systems to be helpful, harmless, and honest using a set of principles or constitution. Our approach uses AI feedback to critique and revise responses, reducing the need for human oversight while maintaining safety.
Frontier Model Evaluation: A Comprehensive Framework
This paper presents a comprehensive framework for evaluating frontier AI models across multiple dimensions of capability and safety. We introduce standardized benchmarks and evaluation protocols for assessing dangerous capabilities.
Scaling Laws for AI Safety Interventions
We investigate how various safety interventions scale with model size and compute. Our findings suggest that some safety techniques become more effective with scale, while others plateau or even degrade.
Red Teaming Language Models with Language Models
We present an automated red teaming approach that uses language models to discover failure modes in other language models. This method scales red teaming efforts and uncovers novel attack vectors.
Interpretability in Large Language Models: A Survey
This comprehensive survey reviews recent advances in interpretability techniques for large language models, covering mechanistic interpretability, probing methods, and visualization approaches.
Governance Frameworks for Frontier AI Development
We analyze existing governance frameworks for AI development and propose improvements for managing risks from frontier AI systems. Our framework emphasizes transparency, accountability, and international coordination.