In this episode, Ryan Greenblatt, Chief Scientist at Redwood Research, discusses various facets of AI safety and alignment.

Watch Episode Here

Read Episode Description

In this episode, Ryan Greenblatt, Chief Scientist at Redwood Research, discusses various facets of AI safety and alignment. He delves into recent research on alignment faking, covering experiments involving different setups such as system prompts, continued pre-training, and reinforcement learning. Ryan offers insights on methods to ensure AI compliance, including giving AIs the ability to voice objections and negotiate deals. The conversation also touches on the future of AI governance, the risks associated with AI development, and the necessity of international cooperation. Ryan shares his perspective on balancing AI progress with safety, emphasizing the need for transparency and cautious advancement.

Ryan's work (with co-authors at Anthropic) on Alignment Faking: https://www.lesswrong.com/post...

Ryan's work on striking deals with AIs: https://www.lesswrong.com/post...

Ryan's critique of Anthropic's RSP work: https://www.lesswrong.com/post...

SPONSORS:
Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers. OCI powers industry leaders like Vodafone and Thomson Reuters with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before March 31, 2024 at https://oracle.com/cognitive

NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive

Shopify: Shopify is revolutionizing online selling with its market-leading checkout system and robust API ecosystem. Its exclusive library of cutting-edge AI apps empowers e-commerce businesses to thrive in a competitive market. Cognitive Revolution listeners can try Shopify for just $1 per month at https://shopify.com/cognitive

CHAPTERS:
(00:00) Teaser
(00:51) About the Episode
(05:05) Introduction and Welcome
(07:06) Exploring the Arc AGI Challenge
(09:16) Inference Scaling and Strategy
(12:32) Reasoning and Prompt Engineering
(17:20) Challenges and Future Directions (Part 1)
(19:55) Sponsors: Oracle Cloud Infrastructure (OCI) | NetSuite
(22:35) Challenges and Future Directions (Part 2) (Part 1)
(33:39) Sponsors: Shopify
(34:59) Challenges and Future Directions (Part 2) (Part 2)
(46:40) Speculating on O3 Aggregation Mechanisms
(50:41) OpenAI's Approach to AI Verification
(55:34) AI Safety and Misalignment Risks
(01:00:14) The Complexity of AI Alignment
(01:03:58) Claude's Alignment and Training Setup
(01:11:46) The Implications of Alignment Faking
(01:23:30) Debating the Release of Guardrail-Free Models
(01:26:42) Experimental Setups and Findings
(01:30:04) Emerging Challenges in Model Alignment
(01:41:33) Reinforcement Learning and Alignment Faking
(01:45:51) Transparency and Chain of Thought
(02:05:09) Exploring Model Simplicity and Self-Prediction
(02:07:32) Making Deals with AI
(02:11:38) Speculative AI Welfare Policies
(02:18:22) Meta Honesty and AI Communication
(02:28:27) AI Welfare and Model Welfare Lead
(02:36:31) Allocating Resources for AI Safety
(02:39:53) Training AIs in Philosophy
(02:44:07) Collaboration with Anthropic
(02:53:03) International Governance and AI Risks
(03:08:31) Potential AI Misalignment and Societal Response
(03:17:09) Concluding Thoughts and Future Directions
(03:18:21) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...

PRODUCED BY:
https://aipodcast.ing

Helen Toner: OpenAI Reflections, Adaptation Buffers, and AI in Warfare

Is a US-China Thucydides Trap Unavoidable? With David C. Kang from the ChinaTalk Podcast

New in Nature: Google Agents Beat Human Doctors, Make Scientific Discoveries – With Vivek Natarajan and Anil Palepu

Inference Scaling, Alignment Faking, Deal Making? Frontier Research with Ryan of Redwood Research

Watch Episode Here

Read Episode Description

Read next

Helen Toner: OpenAI Reflections, Adaptation Buffers, and AI in Warfare

Is a US-China Thucydides Trap Unavoidable? With David C. Kang from the ChinaTalk Podcast

New in Nature: Google Agents Beat Human Doctors, Make Scientific Discoveries – With Vivek Natarajan and Anil Palepu

Inference Scaling, Alignment Faking, Deal Making? Frontier Research with Ryan of Redwood Research

Watch Episode Here

Read Episode Description

Read next

Helen Toner: OpenAI Reflections, Adaptation Buffers, and AI in Warfare

Is a US-China Thucydides Trap Unavoidable? With David C. Kang from the ChinaTalk Podcast

New in Nature: Google Agents Beat Human Doctors, Make Scientific Discoveries – With Vivek Natarajan and Anil Palepu

Scaling "Thinking": Gemini 2.5 Tech Lead Jack Rae on Reasoning, Long Context, & the Path to AGI