The AI Scouting Report: Jailbreaks and Defense

The AI Scouting Report: Jailbreaks and Defense

Nathan Labenz synthesizes recent research in mechanistic interpretability and AI safety, how top players in the space like Anthropic and OpenAI are addressing them, and jailbreaks like the Calvin and Hobbes one you may have seen online.

Nathan's aim is to impart the equivalent of a high school AP course understanding to listeners in 90 minutes. If you're looking for an ERP platform, check out our sponsor, NetSuite: http://netsuite.com/cognitive

Questions or topics you want us to review for future episodes? Email TCR@turpentine.co

SPONSORS: NetSuite | Omneky

NetSuite has 25 years of providing financial software for all your business needs. More than 36,000 businesses have already upgraded to NetSuite by Oracle, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you're looking for an ERP platform ✅ head to NetSuite: http://netsuite.com/cognitive and download your own customized KPI checklist.

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.

LINKS:
Scouting Report Part 1 - Fundamentals : https://www.youtube.com/watch?v=0hvtiVQ_LqQ
Scouting Report Part 2 - Impact, Fallout, and Outlook: https://www.youtube.com/watch?v=QJi0UJ_DV3E
Universal Jailbreaks with Zico Kolter, Andy Zou, Asher Trockman: https://www.youtube.com/watch?v=BwltbhR0JgU&feature=youtu.be

X/SOCIAL:
@labenz (Nathan)
@eriktorenberg (Erik)
@CogRev_Podcast

TIMESTAMPS:
(00:00) Episode Preview
(02:26) AI Engineer Survey
(03:53) P(Doom)
(00:07:52) Representation engineering
(00:09:20) Using contrasting prompts to understand model’s inner representations
(00:15:16) Sponsors: Netsuite | Omneky
(00:22:00) Controlling AI systems and detecting jailbreaks
(00:28:53) LLM performance and refusal rates varying by language
(00:33:13) Towards monosemanticity: decomposing language models with dictionary learning
(00:54:12) Implications of the aforementioned paper

Music license:
D5PTICTBVE63M43U

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.