Everything You Wanted to Know About LLM Post-Training, with Nathan Lambert of Allen Institute for AI

Everything You Wanted to Know About LLM Post-Training, with Nathan Lambert of Allen Institute for AI

In this episode of The Cognitive Revolution, we dive deep into frontier post-training techniques for large language models with Nathan Lambert from the Allen Institute for AI.


Watch Episode Here


Read Episode Description

In this episode of The Cognitive Revolution, we dive deep into frontier post-training techniques for large language models with Nathan Lambert from the Allen Institute for AI. Nathan discusses the groundbreaking Tulu 3 release, which matches Meta's post-training performance using the LlAMA base model. We explore supervised fine-tuning, preference-based reinforcement learning, and the innovative reinforcement learning from verifiable reward technique. Nathan provides unprecedented insights into the practical aspects of model development, compute requirements, and data generation strategies. This technically rich conversation illuminates previously opaque aspects of LLM development, achieved by a small team of 10-15 people. Join us for one of our most detailed and valuable discussions on state-of-the-art AI model development.

Check out Nathan's Lambert newsletter:
https://www.natolambert.com
https://www.interconnects.ai

Be notified early when Turpentine's drops new publication: https://www.turpentine.co/excl...

SPONSORS:
Incogni: Take your personal data back with Incogni! Use code REVOLUTION at the link below and get 60% off an annual plan: https://incogni.com/revolution

Notion: Notion offers powerful workflow and automation templates, perfect for streamlining processes and laying the groundwork for AI-driven automation. With Notion AI, you can search across thousands of documents from various platforms, generating highly relevant analysis and content tailored just for you - try it for free at https://notion.com/cognitivere...

Shopify: Shopify is the world's leading e-commerce platform, offering a market-leading checkout system and exclusive AI apps like Quikly. Nobody does selling better than Shopify. Get a $1 per month trial at https://shopify.com/cognitive

Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers13. OCI powers industry leaders with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before December 31, 2024 at https://oracle.com/cognitive

80,000 Hours: 80,000 Hours offers free one-on-one career advising for Cognitive Revolution listeners aiming to tackle global challenges, especially in AI. They connect high-potential individuals with experts, opportunities, and personalized career plans to maximize positive impact. Apply for a free call at https://80000hours.org/cogniti... to accelerate your career and contribute to solving pressing AI-related issues.

RECOMMENDED PODCAST:
Unpack Pricing - Dive into the dark arts of SaaS pricing with Metronome CEO Scott Woody and tech leaders. Learn how strategic pricing drives explosive revenue growth in today's biggest companies like Snowflake, Cockroach Labs, Dropbox and more.
Apple: https://podcasts.apple.com/us/...
Spotify: https://open.spotify.com/show/...

CHAPTERS:
(00:00:00) Teaser
(00:00:59) Sponsors: Incogni
(00:02:20) About the Episode
(00:05:56) Introducing AI2
(00:09:56) Tulu: Deep Dive (Part 1)
(00:17:43) Sponsors: Notion | Shopify
(00:20:38) Open vs. Closed Recipes
(00:29:48) Compute & Value (Part 1)
(00:34:22) Sponsors: Oracle Cloud Infrastructure (OCI) | 80,000 Hours
(00:37:02) Compute & Value (Part 2)
(00:42:41) Model Weight Evolution
(00:53:16) DPO vs. PPO
(01:06:36) Project Trajectory
(01:20:39) Synthetic Data & LLM Judge
(01:27:39) Verifiable RL
(01:38:17) Advice for Practitioners
(01:44:01) Open Source vs. Closed
(01:49:18) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://www.linkedin.com/in/na...
Youtube: https://www.youtube.com/@Cogni...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...

TRANSCRIPT:

Nathan: Hello, and welcome back to the Cognitive Revolution!

Today my guest is Nathan Lambert, author of the popular interconnects.ai newsletter and Machine Learning Researcher at the Allen Institute for AI, which today is releasing Tulu 3, one of the most comprehensive open source efforts to diffuse the understanding and practice of frontier post-training techniques for large language models.

By systematically working to match Meta's post-training performance using the same Llama base model, and sharing all their findings and data publicly, Nathan and the Allen Institute team have illuminated what has historically been one of the most opaque aspects of LLM development, and this conversation represents one of the most detailed discussions of this topic available anywhere online.  

We cover the full spectrum of post-training techniques, including supervised fine-tuning, multiple flavors of preference-based reinforcement learning, and a new technique called reinforcement learning from verifiable reward, which rewards the model for accurately answering questions with objectively correct ground-truth answers.

At each step, we dig into the practical details that make these techniques work, the associated compute requirements, data generation strategies, and value derived from each, and also the experimental designs used to measure performance while exploring the vast space of possible training recipes. 

We even get into some fascinating emergent behaviors that echo the frontier reasoning behaviors we've seen from OpenAI's o1.

Nathan's frank discussion of both the technical and organizational challenges of this work, including in a few moments where he acknowledges aspects that are not yet well understood, provides a super useful window into what it takes to develop state-of-the-art language models, and the fact that they ultimately succeeded in matching Llama performance, with a team of 10-15 people, makes their approach one to study closely.  

How long this level of open development can continue into future generations remains an open question in my mind. Even with billionaire estate backing, human-generated preference data and annotations are cost prohibitive for the Allen Institute, and the synthetic data generation techniques used in this project may not be available if frontier developers follow OpenAI's lead in choosing not to release o1-style reasoning traces.  

That said, Nathan expects that the community will figure something out, and just before publishing, we've seen Chinese AGI company DeepSeek announce a new o1-style model called DeepThink, which does seem to show its work and is reportedly going to be open sourced in the near future, a development that could shift the question from whether or not such open reinforcement learning projects can continue to whether or not they should, and which should definitely cause anyone who things that Western companies can maintain a comfortable lead relative to Chinese companies from here all the way to AGI to stop and rethink their assumptions.  

In any case, this episode is one of the highest-leverage pieces we've produced on this show, full of concrete, practical insights that were previously scattered throughout the literature or simply  hidden behind closed doors, and I'm really grateful to Nathan for such an open and technically detailed conversation.

If you're finding value in the show, we appreciate it when listeners take a moment to share it online, post a review on Apple Podcasts or Spotify, or just leave a comment on Youtube.  And your feedback is always welcome, via our website, cognitiverevolution dot ai, or by DM'ing me on your favorite social network.  

With that, I hope you enjoy this super deep dive into the frontiers of LLM post-training, with Nathan Lambert of the Allen Institute for AI.


Nathan Labenz: Nathan Lambert from the Allen Institute, creator of Tulu. Welcome to the Cognitive Revolution.

Nathan Lambert: Yeah, thanks for having me. The Nathan squared pod has happened.

Nathan Labenz: I can't believe it's taken this long, but I'm excited that the day is finally here. So a lot of ground to cover. I'm just going to fire questions at you rapid fire and let's see how far we can get. For starters, tell us about the Allen Institute. You guys have put out a lot of stuff. I would love to just get a little context on kind of the philosophy behind it, the, you know, the funding, the, you know, the GPU, are you GPU rich or GPU poor? And kind of, you know, how are you looking to put a dent in the universe?

Nathan Lambert: Yeah, so the Allen Institute, I'm new blood here. I joined just over a year ago. The Institute is actually 10 years old. It's funded by Paul Allen, which is not surprising for people in the tech scene and knowing the Allen Institute in Seattle, given the name. And the original CEO, Oren, I don't know how to pronounce his last name, also was a professor at UW, a lot of UW ties, I think was buds with Paul Allen. And this kind of grew out of that to just kind of do AI research for the common good, make things very open, very hybrid academic industry. That is kind of the story of the first decade, roughly. It's not quite the first 10 years, and now there's like Oren left. The process of leaving started a few years ago, and there's the new CEO, Ali Farhadi. Also, Professor U-Dub was at Apple before. Most prominent work from his group, I think, is the YOLO line of work in computer vision. And it's kind of bringing in new blood to make this transition to build open language models. I think with a lot of institutions realize that in order to be credible in AI, you need to have some amount of cred in the language modeling space. And right now for AI research, it is to be like, look, we can do this too. And it is... a narrowing of scope to try to release great models and build models. I think Ulmo is the start of this, which was a very long story in 2023 that I was not a part of most of. I joined late in the year and launched in early 2024. And then 2024 has been the year of doubling down on this and scaling up projects that you can do when you have more people. I think we'll talk about the post-training side of things and there's plenty on pre-training and stuff in evals as well, but it's really like what the academic projects in open source AI look like when you can bring more people in. And I think throughout the year, a lot of our projects have been getting bigger and we've seen like MoMo, the vision model, which is primarily on the like vision team, but gets involved with the language pre-training team and other things. It's like these projects getting bigger and trying to increase the scope of research while doing as much as we can in the open. So no constraints on what we can say. end of the day, nonprofits are about telling a good story and stories are what can actually like change how policy happens and how people view the world. So at the end of the day, the Allen Institute needs to build good models so we can communicate why open source has benefits. And it's like kind of reductionist and it's a little cynical to view nonprofits as storytelling works. But when your money mostly comes from a tech billionaire's estate, like you got to say some of it as it is and But the momentum is good, so it's fun to be in the space and see things continuing for the time being.

Nathan Labenz: I think stories are honestly a super important part of life in general and the AI moment in particular. It feels to me like we are summoning this crazy new alien intelligence and really lack, in most corners, a positive vision for what we want to do with it, what impact we want it to have on life. And there's a lot of very general talk about, won't it be great when we have everything we want? And it's like, yeah, can we put a little more detail on that? So I do appreciate the importance of storytelling in this space. You mentioned pre-training, post-training. The project that you've just put out today as we record this a few days in advance, but we'll try to publish it right around the time of recording, is called Tulu. It is really a deep dive into post-training techniques. Give us a little bit of context for kind of why this project, what the goals of it are. And then I really just want to go deep on the actual methodology, what you've learned and try to develop my own intuitions for the state of the art in post-training.

Nathan Lambert: Yeah, so I think this kind of story fits well with what we were talking about with AI2. So AI2's kind of open post-training recipes have been under this Tulu brand for over a year, almost a year and a half. I think Tulu 1 was back when like Open Assistant was new, was like, what can we do by mixing all these popular data sets that people are putting out in 2023? All these data sets and small models, like how do you systematically mix instruction data sets? to combine them with open resources. And Tulu2 was around when DPO was very popular, and that was when we showed that you can scale DPO to 70 billion parameters and continue getting the benefits from preference fine tuning and open recipes. Then we went away and did Tulu2.5, which was kind of like trying to answer the PPO versus DPO question, which the TLDR was, if you tune your parameters right, PPO is probably better a bit, but it's probably not worth the effort to spend all your time on preference tuning when you can just be making better data and better pipelines, which is what two to three is about, which is kind of like inspired by the transition we're seeing with like the llama report with chatbot arena, it's like turned into a hockey stick again, where we have this incremental scores and the open AI and Google are like skyrocketing their scores again. Nemetron paper, Apple had a paper, which these post-training recipes are just much more complicated than big SFT step, do DPO on some chat data and be done. And the philosophy is kind of like, we know these labs are doing it. Like, how do we try to understand what the open groups should be doing and where there are hills to climb when you're increasing the complexity substantially in post training? Like, Lama 3 has hundreds of peoples on it and like a large amount on post training. We have 10 to 20 on ours. And it's like, what could academics actually do to do modern post training and curate new data use new algorithms, sequence things together, but like kind of move beyond this paradigm of like just increasing like alpaca eval or vibe scores and try to get really specific on improving math, improving if eval. Code is something we did a bit, but it wasn't the biggest focus. I think our evals for code were mostly saturated pretty early on, but kind of shifting this post training into a much bigger domain of academic progress and kind of, honestly, there's been a slow in how often people are releasing models and data sets and kind of like reboot that. I think the DPO era was a very fun one for seeing open post-training models. It was bonkers. Like every week there was a new like DPO model that was state of the art on something, but that was like last December, last January, and it feels much slower now. It's much more big players dominating the AI narrative and it's trying to reset that.

Nathan Labenz: Yeah. So, One of the things that I noticed in the materials that you shared is that the goal of the project was to beat Llama 3.1. And you're starting, if I understand correctly, with a couple of different base models, but one of the base models is Llama 3 base basically, right? So you're essentially saying, okay, if we take the same pre-trained version that you folks at Meta started with, can we match you or exceed you hopefully with post out of post training to post training apples to apples basis. They obviously put out a big report and, you know, shared a lot about what they did, but I believe they don't share much or any of the post training data. Give us a little bit more. I'm like, what do they share and not share?

Nathan Lambert: I would say they share an outline of their method. They don't have a lot of, like I might have hyper parameters, but not really clear without noting their code base and things like this. So they share a high level outline of like, here is our approach and here are the types of things that we do, but not the specific things that we do and definitely not the data. So the biggest differentiator for our post training is that we release the data. I think we built like three to six new data sets just for this project, and we're using those with a combination of already existing data sets, and we'll release all of them at once. But really what you're saying is right, is like the development model here was we had a set of evaluations we cared about. that we tried to cover things like factuality, knowledge, math, code, instruction, following, safety. And the goal, the primary development target was Llama 3 AP, where we effectively trained about 1000 AP models to try to get these scores to be higher at AP and then validate them at 70 B. And we're also adding like all no models, which is our like big census is our main thing, which is like trying to like make sure the recipe translates. So for that reason, it's like kind of a basic goal is like, okay, we like in order to know you're in the right ballpark, you need to beat these people that, are known to be pretty good. I think I've talked to, like I've talked, the fun thing about Llama base models and Instruct is that the closed labs can also validate their pipelines on it. So like the closed labs can take the Llama 3 base and then use their post training and compare it to the Instruct model that Llama also released. So hearing from some people in industry, they're also like, yeah, Llama 3 Instruct was like 3.1 Instruct. The latest stuff was like pretty cracked. Like it was a pretty good post training. I think obviously like OpenAI's infrastructure isn't going to work as well in Llama base models, but The 3.1 models particularly actually were very good. I think it's clear that there were some ways that they were rushed or trying to focus on systems rather than novelty. It was mostly SFT and it was DPO for the Llama 3.1 instruct. And I know they are expanding into more things now. And we're kind of, by being smaller and moving fast, I think we're actually going to do some, we're doing some things that I bet will look pretty similar in the Llama 4 instruct or like a Llama 3.5 instruct type thing. But that's kind of like the theme is we want to beat their numbers. We're doing similar things. We have the advantage of being smaller. We can be a little bit more clever maybe, but that was the development target. And it really is like, I can look at the days. I don't have the dates in mind. It was like, if we started this project four months ago, it's like, Oh, a few months in we passed Llama 3 Instruct. And then like a month and a half later, we pass Llama 3.1 Instruct to 8B. And then like a couple of weeks later, it's 7DB and it's like just metronomic. And it's like, oh my God, every time it's like, oh wow, this is like actually happening. So it's fun to see. And we're very excited to see what people build on it. And I don't like, we might have a discussion later on, which is like fine tuning from base versus fine tuning from instruct, which is another kind of, there's another fork. And I think fine tuning from base is very academic and very important to post training to understand how to do this from base models. And fine tuning from instruct is a very important area, especially for new companies and domain specific things. Because if Llama is gonna put out a very strong instruct model, you don't have to do the whole Tulu 3 suite to make it good at your specific domain. It was like, we're doing part of this, but eventually there's going to need to be research there. I think on the day of recording Nexus Flow, which is a startup that came from Some of the people are from Berkeley Nest, the group that released the Starling models. They released another fine-tune on Instruct, and we saw Nemotron fine-tune on Llama Instruct go very viral recently. So there is some messaging that we're not doing this fine-tune on Instruct, but it is another area. I just think in terms of fundamental long-term ecosystem building, the FromBase is still the thing that needs the most transparency. So... Let's get into the real nitty gritty here.

Nathan Labenz: I have a general sense. I think most of our listeners will have a general sense of, you know, certainly pre-training, understand what that is, you know, web scale data, next token prediction. Okay, cool. Then you've got your... typical recipe there may be some exceptions or caveats to this but typically next comes instruction tuning and then comes some sort of preference tuning and that's where like reinforcement learning from human feedback fits in and then there's different algorithms and you've mentioned PPO and DPO that can be used to actually do that final stage of preference tuning that's the baseline understanding give us the next level of understanding you know what's most important to understand after that Yeah, I guess I'm going to jump to the real nuanced thing, which you might not even have in your questions.

Nathan Lambert: It's like, what are open recipes doing and what are closed companies doing? In the open, we have the benefit of being able to train on outputs from very strong models like GPT-4, which is something that we did. some amount of legal uncertainty there, but we definitely put it, the wording in our data sets is like, look, you're understanding your training and outputs from these models. But that definitely gives an advantage in some way. LLaMA, in order to get a model like GPT-4 needs to build LLaMA 405B. They need to do instruction tuning over their various base models and all of these things that kind of get this distillation model to work from. So they are trying to do a similar thing, which is a lot of the instructions are written by a very strong model. So a lot of math and code instruction data is now written by for LLaMA, LLaMA 405B, for OpenAI, GPT models. And in that way, we are doing something similar. But in the open, you can kind of take shortcuts, which is like, we don't need to do this whole pre-training. We just go right to the model, which is best possible. for SFT. I think that stage actually looks very similar. I think we don't have as good of control over which prompts we're using. I think the closed labs will do a lot of filtering and controlling of the distribution. We do a lot of mixing, but it's mostly on taking the levels of known data sets and seeing what their performance is on downstream evals. It's largely a process of adding in a data set to instruction tuning for a specific capability and making sure it doesn't degrade other performance. So if you have like a math eval, which is like the all capital math eval, which you can improve this, but a lot of it comes down to the formatting. So some math data sets will have different formatting in the completion that will make it harder for your model to learn from that. Or it can affect something like GSM8K or coding or stuff like this. So these are the kind of things that you're doing. And then also at the end, you're probably going to try to include some more general chat data. We included 100,000 multilingual samples from Cohere's AYA because we know chatbot arena has a decent amount of multilingual, even though multilingual wasn't in our val suite. And there are some other safety things which are more borderline, which is just like, how should a model behave? And this is kind of what happens in SFT is like, there's some art to it in squashing second order effects, but I think largely that's really similar to what the closed labs are doing. And then at preference tuning, it takes kind of big turn, which is the closed labs use humans for their preference data. We do not have the money. So we use all a LM as a judge to collect our preference data. I like to quote John Schulman saying on this, which is the pithy way to say it is like human preference data is high noise, but low bias and LLM preference data is low noise, but high bias. We do not know exactly what bias we are getting, but largely we can still see pipelines of We collect our own preference data from various model generations, language models label it, the preference data is good. The biggest change here is from datasets like Ultrafeedback, which exist on Hugging Face. We essentially redid their pipeline with completions from the models we had trained at SFT, and that gives a meaningful low percentage improvement over just using a random preference data. on the Hugging Face Hub. So it's higher effort. You have to go through the effort of generating completions of BLM, making sure that's all right, getting diversity, doing LL as a judge. But that type of thing gives us like, we give across our experience, time and time again, this like on policy idea of preferences, even with LL as a judge was better. So I do think that that's kind of what, if you look at Meta's system diagram in their paper, they show this. They're like, we take a new model, we pass it in, we get new preference data and we train on it. The difference is that they do multiple iterations, which could be based on how they get data, could be based on how their timeline is and stuff like this. I do think multiple iterations is something we're seeing again and again, and we didn't look at, but just kind of checking the box of like, okay, on policy data does work for preference tuning. It's not a huge paradigm shift, but it's a whole different approach than people have to do. We've seen some of this in kind of like on policy and new like online PPO algorithm variants that everyone kind of gets like spiralized. It's like, oh God, we have some other star PPO algorithm I can't pay attention to. But the algorithms were going in that direction where they're doing more of like generation from the model and labeling. The way that we did it, I think is a bit more segmented, which is kind of like closer to what LLaMA is doing. This at a high level is similar to what LLaMA 3.1 does. I've heard that they weren't doing any fancy verifiable RL, any O1 stuff on theirs. What we kind of added to this, I think I can, specifically this is in the paper, like you can look at the acknowledgement section of the paper and you'll see like one name and a bunch of normal grant garble gobble it's not in the draft I sent you but like um there's one name there and he told us to just do RL on verifiable outputs so this is the stage three that we did which is essentially taking the training set from things like MATH, GSM8K and it's actually like if eval is verifiable because you like count if the constraint if eval as a summary is a bunch of is an evaluation where there are constraints on prompts, like make sure your response has X word, make sure your response has X paragraphs, and these are things that are verifiable in Python code. So across these things, like math, instruction following, we have a system where you have these prompts with constraints, and then we used RL to just give a reward if the constraint was satisfied. And we have seen on multiple models that essentially can improve GSM8K, you can improve math, if involves a little bit trickier, there's a bunch of RL details there. But we essentially do this at the last stage to be able to, if our DPO like worsened our math scores, we can bring it back up. And with our pipeline where the DPO data is very trained to our models, we actually get most of the math improvement there. And the last RL stage is pretty minor. But if you take some old RL model off of Hugging Face, you can apply this like RL verifiable rewards thing to it and get like a 15 point boost in GSM8K without a ton of degradation and other evals. So it's like, we're really just scratching the surface there, but the winds of AI, you can see that it is going this direction. There's murmurs that a lot of big labs do stuff like this. You definitely think they're doing it for code. We have an added code to code interpreter. O1 is described as a large scale RL system specializing in reasoning tasks. It's like, oh, they're probably doing something like this. I think one of our models just we left running RL for this math really long and it started doing this like, let me check my answer again, like redoing chain of thought within the chain of thought. It was like literally the thing that OpenAI was showing us where it's like, wait, let me check that. This is definitely not 01, but I think we've seen other papers in the space coming out. There's like Vine DPO is one that's really, Vine PPO is very similar. Let me get their name right so it's easier for people. People have talked about QuietStar. That's a bit more complicated. I think Trice is also one, which is also a bit hard to understand, but motivated by the same thing. So the literature is going in that direction, but we kind of showed like, look, you can do this in your preference tuning. You don't have to just do a math model. Like you can just add on these types of RL at the end and it doesn't blow the model up. So to kind of summarize, it's like, as we go, we've unlocked, we have these three stages, SFT, DPO, and RL, and each of them were kind of taking more alpha from what industry is doing. Like SFP is known, it works. DPO, there's some new tricks that we needed to uncover, parameters, different settings. We use this like a length normalized setting. But that's mostly known, but good information. But the RL stuff is like, look, this is like real. We're beating llama numbers. We used RL. It's still mind blowing to me that RL is becoming more relevant, I think. Like 2023 is like, oh, where's RLHF going? But now it's not even just RLHF. Like we didn't even have a reward model in this. We can go through the technical setup later, but there's no reward model. It's just an RL value function and it's fine. So that's a summary of the training. It's like probably somewhat dense if you haven't heard these terms before, but also informative.

Nathan Labenz: Yeah. Well, I like that. We're going to unpack it a little bit, but that's a great initial overview. So... I'll maybe just come at this from a couple of different angles to try to understand the overall landscape better for starters, like how much compute and how much value goes into and comes out of each of these stages, maybe compute and data, like, you know, just relative amounts and relative value that you get from each one.

Nathan Lambert: Yeah, so this is like a general purpose instruction model, which I have to caveat with it, because I think if you have different domains, you have much different compute. I think SFT is something we talked about a lot. The data site largely kept growing. We didn't do a lot of subsampling results because we're like, we're searching for high numbers. Our final mix is about a million prompts. Most of them are single turn. This model will not be as good at Llama as at multi-turn. We're not a Meta AI shop. We don't need that as much. We don't have the evals for it, but it's about a million prompts at SFT. If you're using, I can give really specific throughput numbers. If you're using 32 H100s, you can train a 8B model in about a day on R code. R code is not super optimized because it relies on transformers. I think you can get about a 40% speed up if you're using kind of really specific code, which is something we might do in the future to only do like Olmo and Llama architectures. Because if you support fewer architectures, you make it much faster. So that's like a million prompts.

Nathan Labenz: So that was 32 H100s is like a 24-hour train cycle to train on a million prompts. Yes, for SFT. At market prices, that's what? Let's say $2. Prices have come down, as I understand. $2 per H100 per hour. So $60 an hour, $1,500 kind of compute cost. I wasn't expecting this number, but yeah, yeah, yeah. I think...

Nathan Lambert: It's reasonable. It's ballpark that, like ballpark $1,000. Because I do think we could probably get 90% of the performance with half the prompts. I think looking at our mix, we have like 30 plus percent math. We were going hard on math, trying to get these numbers across. We ended up doing general math eval, and then we have subsets for specific subsets of math, like intermediate algebra where we weren't as good. So we looked at the eval and saw which micro subset was not as good and tried to make data specifically for that. So you don't need all of this math eval. So that's a good rule of thumb. DPO and preference shooting, we haven't been able to show as strongly that scaling helps, like getting more prompts in and keeps improving evals. I think it'll be like order of a couple hundred thousand prompts. It's not as big. DPO does use more compute, especially at 70B, I think. It's because you need the reference and policy model Still, we run these jobs and they take like six to 12 hours if you're using somewhere between like 16 or 32 GPUs, much faster than SFT, just because the dataset is much, much smaller. I think at 70B, we added the normal We, like, changed our forward pass from the default, like, hugging face slash DPO implementation to make it a little bit more efficient. We do caching reference log probs so you don't have to store both 7 dB models in memory at the same time. Because if you don't do these optimizations, it starts to look a lot more like PPO where you need 128 GPUs or more to do at 7 dB, which is, like, quickly you can see how PPO kind of balloons what is going on. And it can take longer if you're trying to really get the best absolute scores. So I do think like SFT is by far the biggest compute because we have the most tokens. DPO, I would say ballpark a quarter of what we talked about for SFT would be a couple hundred bucks. And RL is probably like, especially at 70B, could be almost similar to SFT if we really run it for a long time. But you can get most of the benefits probably in a similar amount of compute as SFT. So the RL curves look remarkably similar to like kind of old school RL tasks where at the beginning they get the most improvement and then it's kind of like level and bouncing around and like maybe going up a little bit. So if you do like one epoch, which is this first improvement, you're going to save a lot of the money, but we're like, oh, we're trying to get the best numbers. Let's let it run for a few more days and see what we're doing.

Nathan Labenz: How about the like eval improvement through that process? I guess if you have a base model, you you have to basically few shot. If you go to instruction tuning, then you could just do zero shot or few shot. How are you- Yeah, the dial setting is crazy.

Nathan Lambert: There's a lot of iteration on it. I would say that, For performance, roughly, I think we get 90% of our performance at SFT, and then the last 10% of mix of DPO and RL. And then Alpaca eval slash vibes things, you get the most from DPO and RL rather than SFT. But at this point, our SFT mix beats. It beats on our eval suite, the LLaMA 3.1 numbers. But it's very focused on our evals. And I think the preference tuning kind of softens it a bit to be a nicer model to talk to. The eval suite is complicated. I don't know all the details off the top of my head. It's good to look at the paper. But we tried to do... There's this whole distinction between pre-training evals and post-training evals. And I do agree that most post-training evals should be, like, using a chat template. So you're generating from the model to generate tokens. A lot of them are CoT, chain of thought. And there's kind of a distribution over the number of in-context shots. I think there's some that are zero shot. There are some that are eight shot. kind of depending on the domain. And probably the trickiest thing to manage is answer extraction for reasoning, which is like, does the answer appear in the format that eval expects? Particularly on math, Llama 3.1 uses what's called the Minerva format in a specific prompt. And we use like what is called a flex format, which essentially you like, if the default writing is like the answer is in boxed, you also allow like URL like boxed and the answer is colon and like one other thing. And this is mostly to be fair to like Quen. So Quen is a fun example where their 72B model gets a score of like six with Llama 3.1 setting, but a score of 74 with our setting. So it's like, okay, like we're like competing with Quen Instruct in a way, like we have to have a setting that is not just... what Llama does. And I mean, all the other labs are doing things like this, which is tailoring their training to eval. And it's hard to know what they trained on. We did a lot of work to decontaminate all of our training data sets on the evals that we're doing for development. We also have an unseen eval suite. So we do a few methods to check for exact match and overlap on all of the training datasets that we used throughout it, which is some of the final datasets, and we're also going to release decontaminated versions of other datasets along the way. So popular names like OpenInstruct, NVIDIA's Daring Anteater dataset has some contamination on math. For example, like the Hugging Face Numina Math TIR, which is Tool Integrated Reasoning, which was for a math competition, has math contamination, so we have to remove this. It's like they were using their model for a Kaggle competition, not for like fair evaluation. And it really goes to show it's like, one, we're releasing the data, but two, we're showing like how easy it is to have contamination. So it's like, yes, we need more people to release it because we don't know if any of the models we're comparing against trained on test, either on our development set or our unseen set. It's just like, we can't know. We found a lot of contamination out there, but we don't know what other people are doing. And it's kind of, I think there's probably a snarky paragraph in the paper, which is like, we don't know which of any of these models trained on the evals.

Nathan Labenz: It's like, okay, wish for the best. When you're finding contamination, you're finding it in data sets that other people have open sourced, or you're using some sort of diagnostic to detect it in the model? I've seen some techniques like that.

Nathan Lambert: We're looking at data sets. People are working on model diagnostic, but we're looking at data sets. And there was this weird thing We'll talk about datasets where essentially the biggest thing we're doing is exact prompt match. So if a certain dataset, training dataset has more than like a 2% exact prompt match with one of our test sets, which is like that's the percentage of the test set in the prompts exactly, we consider that contaminated and remove them. The other thing is like, can you detect if models trained on certain test sets? So I have, earlier in this year, my big project was RewardBench, which is like trying to build an ecosystem for evaluating reward models. Now it is going. There's lots of academic papers on it. But one of the weirdest things we found is recently substantial contamination on RewardBench prompts, which was taken mostly from other test sets, was generated by Llama Instruct with the Magpie method. So Magpie is a synthetic data method that manipulates the chat template to get the model to generate prompts in distribution of what it was trained on. So this is the type of like weird head scratching you need to do to do like, is a model trained on something? You can't prove it. But if you have the weights, you can get the model to generate ish prompts from its distribution. And then you can check those on test sets. Which is really funny because Magpie is meant as kind of like a training architecture for like generating new data. But when you then realize that that might be what you need to do for evals, I told the author and he's like, oh, cool. Another use case for my paper, which is really funny. But I do think that whole ecosystem is going to continue to grow a lot because it's you need it to have some source of truth, I think. evals are only growing in value and cost, and you need to be able to audit the models to assess this across industry. So there's going to be some mix of new startups, academic research, regulation on decontamination. The whole thing is kind of a fair game.

Nathan Labenz: Yeah, interesting. Okay. Let's go back to the post-training stages. And I want to get you to try to help me develop my intuition for how the model and how the model weights are evolving through that post-training process. I have a pretty nice little story that I tell people about the pre-training process where I'm like, given some text, the model's job is to predict what comes next. there is a ground truth as to what comes next. And so it, you know, outputs a distribution. And then the question basically amounts to how can you, you know, with the chain rule, how can you tweak all the little numbers that are the weights in this model to just nudge yourself so that you're a little closer to correct. And we just do that, you know, a ton of times. And obviously there's batching and other complications, but you know, the atomic unit is you make one prediction, you get a score and you nudge yourself to be closer to having made the right prediction. Now, that gets a little more complicated in various phases of the post-training process, right? Instruction tuning, from what I understand mostly, is similar, right? It's mostly the same.

Nathan Lambert: It's the same, but you add in new tokens. So you add the tokens to help constrain the model, which is like USER or some other EOS token. You add these things in in a specific format. So you have to learn a little bit of new tokens, but otherwise it's exactly the same.

Nathan Labenz: So that's kind of weird, a little bit counterintuitive in as much as at least for some sort of instruction tasks. And this is, I guess, you know, a natural segue to the preference training. You think like what you really want is to be telling it what's good and what's not good. But with the instruction tuning, you're still basically just saying this is exactly the ground truth tokens you should have predicted. And all the weight adjustments are just specifically toward that exact string.

Nathan Lambert: Yeah, it's almost like for simple things, I don't think you need a lot of SFT data because you're just like manipulating the weights to be more focused on this type of formatting. Like you need to do enough so that this is the way the model outputs. But for specific tasks like math, I think you can make a lot bigger gains, which is just like the model has not seen a lot of this and you're really trying to like do the same thing that you do during pre-training, which is like teach the model to do chain of thought math. Like you need to have that somewhere to kind of kickstart things. I think pre-trained data probably has some of it, but I think that that's a balance that we'll see evolve, which is like, what is the basic amount that you need to do some instruction following? And then like, how much more do you need for specific capabilities? I think at some point, a good thing for that, I did many times saying this project is good for the readers. Like go look at the Llama report and look at the percentage of instances per domain at instruction tuning. And at instruction tuning, they have way higher math coding reasoning than preference tuning. So I think those are the domains where the model needs just more flops to understand the basic capabilities in SFT. But at preference tuning, it becomes a contrastive loss function either through DPO, which is pairs in comparison them. If you dig into the DPO math, it's actually some weird double negative where it's decreasing the probability of the rejected response. It does that more than increasing the probability of the chosen. There's weird oddities in the actual DPO math there. If you go really deep, PPO is almost more intuitive where you generate new samples, you have a value model that assigns attribution to each of the tokens where higher number is good. And then it tries to increase the likelihood of things that it sees as being good. And in this case, it's guided by a reward model. In our case, it's guided by like, is the answer right? it becomes much more flexible. I think DPO is somewhat restricted to the generations that you give it, but RL in that way is flexible, where I think it just can kind of change the behavior of the model more. I think when we apply math, we see that it takes different reasoning steps. We see those kind of different types of changes than you see at SFT and DPO. I do think there is going to be a lot more understanding of what these stages actually do. And it's kind of funny because like SFG is where most of the performance gains come on evaluations from chat to capabilities. But it just does really feel like when you do these and when you look at the loss functions that there's kind of more that you can change about the weights doing RL or doing this kind of direct preference optimization sort of thing. That's something that we've heard for a while. I think leaders in RLHF have said RL is just a more flexible loss function. There's a lot more scale you can do. I think that was the framing that somebody like Jason Wei at OpenAI gave in a talk. It's like, look, RL has a lot more leg room because the loss function is very different and we haven't explored it. Things like 01 and our own experiments make me more optimistic in this. It's very weird. You can do a lot of weird things. changes the model. Notably, the eval scores don't go down that much and it will take a lot more intuition both on the individual level and the academic level to understand these parameter changes. Because it's so weird, it like is fundamental to the fine tuning domain where it's like RL is doing something very different than the previous loss function. In some ways it's remarkable that you can just do SFT and pre-training and then you can just throw RL at it. And the KL regularization is enough to just like hold it in place, but like you just change loss functions 99% of the way there and it doesn't break everything. It's like this thing, That type of thing makes me give a lot of sympathy to the whole Dario, whatever mindset of they just want to learn. It's like there's something complicated going on that I don't have deep enough intuitions of deep learning to grapple with.

Nathan Labenz: Yeah. Okay, that's really good. Let me just go back to a couple points for quick follow-ups. First, the KL regularization, that is basically a way of just... anchoring to the earlier distribution, right? So your loss function has multiple terms, and one of them is like, subject to getting better in the way that we want you to get better, which we'll come back to in a second. Also, like, don't change too much. Is that basically the right intuition?

Nathan Lambert: You essentially look at the log probes of the original model and the new model, and you make sure that the difference in probability is not too big between your RL model and your And I think the way that we phrase it is you have a KL budget. So you can only change the model so much. And once you reach your budget, you normally don't expect the model to keep changing if you're doing value updates. Because it can't make substantial changes. It's just moving around in the same neighborhood. Which I think is a nice framing. I wish more DPO... DPO is different where it's a controlled KL distance through their beta parameter throughout the number of epochs that they're doing. Whereas RL is like, where doing online RL is a little bit more open-ended on the KL side of things. But I do think this in general in post-training, showing more plots of like performance versus the amount of KL that you spend is very nice. I think historically, these just sound like total random numbers to people not looking at this. It's like PPO KL will be like 10 to 20 scale. type thing. This is when you're doing the full thing on chat, for example, or like GSM 8K, like really basic math KL is like one. So it's like, we're really, really not changing the model very much doing this RL for math compared to what you would do if you're doing PPO for everything. But if you also compare that to like best event sampling or something, best event sampling over reward model also has way lower KL spend than doing this full like PPO online RL thing. And I don't know the intuitions off the top of my head for where DPO falls, but it's like all of these preference, that's like kind of a way to measure how much you're changing the model in preference tuning. It's like, what is the KL difference across a controlled set of prompts? there's a problem where we're in all of these numbers that I'm saying, it's on a different set of training prompts. So it's almost like we need to have like a standard, like these are the prompts that we, these hundred prompts are what we evaluate our tail distances on across different domains to really see how much the model is moving in general. But I think that's a good way that people can look at it.

Nathan Labenz: That one that you're anchoring to is the original base model or it's the instruction tuned model? Instruction tuned. And that just stays, no matter how many like iterations you're doing, the reference model stays the same the entire time.

Nathan Lambert: There are people that there are experiments where you can do like do one round of PPO with your reference as the SFT model. And then you can like start another one with your reference as the first PPO model or DPO. So people do do things like that. It's just not as popular. There's definitely some papers there, which is like moving, like resetting your DPO reference to give you more ability to learn. I don't remember the names off the top of my head, but that is a sort of idea that people fiddle with.

Nathan Labenz: And these KL divergences, they are, the penalty gets, it's like a square function, right? So the penalty gets bigger, the more you sort of drift. And that's kind of where this like budget notion comes in is like, you sort of have a, a limit as to how far you can move before that term, I guess, starts to dominate basically.

Nathan Lambert: There's an approximation that you use in a lot of KL In a lot of RL, you use an approximate rather than a full KL, where the approximate, you're essentially just... I pulled it up. I have this silly RLHF book that I'm working on, which is mostly notes and fundamentals of RLHF. And the regularization is this. There's a difference between what a lot of people do in these implementations as an approximate KL. I think that... like John Stallman has a blog post on this that everyone references, where the approximate is you just subtract the log probs of a generation for the policy model. You do that minus the reference log probs, which is a different function than the full KL. I don't know if it's a change. It seems like that's a change between square and not. just kind of thinking about it off the top of my head without having the exact tail equation. Interesting.

Nathan Labenz: Okay. So there's, I mean, presumably all of these things like could work and probably have, you know, work to some degree and have some pros and cons. But yeah, that, that sharp bottom versus kind of flat bottom is an interesting distinction. So I still want to get a little bit better intuition on how these multi-token evaluation and multi-token comparisons work. and maybe we could break this down also by PPO, DPO. I'm getting like, as a user, you know, if we start with the human experience, I either am presented to generations and asked to say which one I like more, or I might be presented one and asked to rate it one to seven, or maybe I presented two and asked to rate them one to seven or whatever. So that's kind of the, the, the signal is derived from that, right? Then we, in a lot of these schemes, train a reward model on that so that we don't have to always have a human doing the evaluation. So now we have a reward model that attempts to score in a way that the human would score, but it's still giving you this sort of this one or that one is better, or that, you know, here's a one to seven kind of rating of these outputs. The critical distinction there, I think, that I'm still kind of struggling to develop like a clean story I can tell my friends who aren't so obsessed with this as I am, or certainly not as you are, is how does that score, translate to updates when it has to go through this multi-token generation. With the pre-training and the instruction tuning, I have this simple, you predict one token, you can score that one token, you can update on that. Everything else is aggregation of that. Here, the score applies to this whole generation. you know, there could be like weird forks in the road or, you know, even if just like you add an extra token now, like everything is kind of off by one token. How do I compare base generation to the, you know, the current policy generation, given that I can't do it on like a token by token basis anymore?

Nathan Lambert: So the RL math is essentially you will give it a label based on the whole trajectory and then the value model, the If you're doing PPO, the value model will take a generation and it will output a value per token. Then you will have the label from the reward model, which is like the reward from your environment. And that is used to update the value model to kind of have these per token updates. The policy is actually going like in PPO, you're actually going to be taking attribution for every token in your batch and doing updates based on what they think will be a long term, better generation in PPO. That's very different than DPO. I don't have like a per token just understanding of what is happening in DPO, because it's if you think about the loss function, it's doing essentially all of the tokens are kind of grouped into chosen or rejected. And it's trying to increase the margin between chosen and rejected log probs across a sequence. Which I think early in the DPO days, there were questions on like length issues, because what you're looking at is the sum of the log prob. So then if you're like, all negative like they're all like negative numbers because it's a log of a probability they're all less than one so they all become negative numbers and you're summing them up and it's like what does that do because you're therefore like increasing the margin between these two negative numbers is what the loss is doing so i don't exactly have as clear of an intuition there but i You kind of look at the losses and see how those are different. Whereas I think a bit more like per token attribution in RL versus DPO. And they're both substantially different than SFT.

Nathan Labenz: In the PPO side, what's the original ground truth that translates a generation level score to token level value?

Nathan Lambert: That's often glossed over, I feel like. It depends on your training setup. So some setups, you'll start the value model from nothing. So then it'll take a few hundred steps to take the scores from the reward model to kind of warm up the value model. So you can kind of see if your value model is a random init, you can see this where your value model loss has to converge before your policy will start changing notably. There's other ways where you can... init a value model from a reward model or an SFT model, which is what we do on this verifiable outputs to make learning a little bit cleaner. And I actually don't know how that changes the value model exactly. I think that essentially it must be some mapping between like the log probs and value that it is doing, which is why you would warm start with the model. But I don't know exactly what that mapping does, but essentially it takes steps to learn that this represent, and it's Bellman updates, it backdrops from this final reward is how the value model learned. So it takes time to get to the earlier tokens, which is what the warmup is doing.

Nathan Labenz: Fascinating. Is there a intuition or even just an observation as to kind of what sort of different tokens get what kind of value? Like presumably something like the answer is would have like low value and then the actual answer would have high value.

Nathan Lambert: Or the answer is might have high value because it knows that the answer normally follows it. Or something like this. I think it's probably not easy for us to build intuitions on this because the feature space is so big. It's like, how are you going to build intuitions over what? In O1's case, is weight a high value token? But if it's too high value, it would only just be like, wait, wait, wait, and never answer. So there's definitely some weird things that I don't know how clear it is to get intuitions out of it.

Nathan Labenz: Okay.

Nathan Lambert: Fascinating. It's all the tokens are conditioned on the previous tokens. It's like the value is conditioned on everything that has come before it. So it's like not just the token. It's like the token in context.

Nathan Labenz: Have you looked at the end? This is a bit of a digression, but have you looked at the Entropics project? I have it in detail, but I do think it's a good direction. Yeah. It seems like there's some connection here where... Basically, I mean, I haven't studied that in depth, and I don't think it's like, you know, a total game changer necessarily, but it does feel like at least the seed of something that could be a big deal.

Nathan Lambert: Inference time compute is very related to RL, where it's like, having a good model from RL that can attribute value is very useful to spending more on inference. And we'll see all of these things continue to proliferate. And I put Anthropic more in the inference time compute thing, but there's very similar intuitions. Like if you sample more within a bat, if you sample more completions from a prompt in RL, you're essentially like exploring more to potentially find a high value completion. And if you're doing inference time compute, a lot of it comes down to how do you sample from the language model and encourage it and like re-weight or interrupt it or change the sampling to find the right completion. And that's where like Monte Carlo tree search type things come in is like search over reasoning steps or all of these like Q star or whatever type of things you want to go down that direction. It is related, but I think you can draw a boundary between training and inference and it's okay.

Nathan Labenz: Yeah, the connection that I'm, the thing that's bringing it to mind for me is just this sense of looking at the logits on a token by token level and sort of inferring from there, you know, like, is this going well? Is it not going well? Is that what Anthropic is doing? Basically, I mean, it has kind of multiple modes where if the model is like confident for multiple tokens in a row, and I should say I'm a little bit out of school on this because I've only studied a little bit, but if it's like high confidence, it just kind of proceeds. If the... logics start to indicate low confidence, then it will manually inject a follow-up question in some cases. If that doesn't work, it'll double back and try again.

Nathan Lambert: Well, I would say that I do think that something like this is what like 01 could be doing. Like just like, we see how easy it is for RL to go off the rails. Like if you run RL for a long time on something like if eval, which is these constraints, it literally does exactly the constraints. It's like, please include this word in your answer and make your response 100 words or longer. It literally just will print the word 100 times and then stop. It's like stupid stuff like this. And it's like, well, maybe if you can do RL plus have some more fine-grained control over the generation, you're like, oh, you squint.

Nathan Labenz: It's like you just have to figure out how to put these things together. So one more thing on DPO versus PPO before we move on. So PPO has this reward model. Just to make sure I'm understanding correctly, PPO has the reward model. There's then this process of translating the reward score.

Nathan Lambert: The right intuition is that the reward model is trained to do PPO, but in the RL framing, the reward model is actually the environment in a way. So the environment in RL is supposed to be what returns the reward. So the reward model is a very constrained environment and your actions, your inputs to the environment are prompts and your actions are completions. So it's like a totally broken RL environment, but the reward model is not part of like the training update. It's more of an isolated thing that we just happen to train.

Nathan Labenz: Gotcha. And then the to compare against DPO, there is in PPO, there's this process of mapping or translating a score to token wise value, which then goes into the back prop. versus in DPO, there is no reward model, right? And just contrasting two outputs and some fancy math, which nobody seems to have a great intuition for. Maybe the original authors.

Nathan Lambert: I've talked to them a lot. I feel like I had more of an intuition for this like six months ago when I was really in vogue, but... I do think it's still changing the tokens individually. It is in that way doing something similar, but it's not doing it through a value model. It's only mediated through the loss, which looks at the sums of logprobs. So it kind of removes that intermediate step.

Nathan Labenz: Yeah, so that's simpler, requires less compute, requires less GPUs, and just for whatever reason seems to consistently fall slightly short of... Yeah, I mean, in our setup, we tried to throw PPO in at the end and we haven't gotten PPO to beat our DPO setup again.

Nathan Lambert: So it's like, okay, we went through this big process where we saw PPO is better. And I still think like if we kept hammering at it, we could probably make PPO better. But just experimentation time and data is so much more important that it's just like, We're not going to take four times the compute in experimental time. It's not worth it. At this point, you're like, oh, our PPO is not as good. Maybe it's like half that we don't really know how to train a reward model because we haven't been focusing on it as much, but also half of just like it's harder. So I do think that it's funny. It's such a great, silly debate. It just never goes away.

Nathan Labenz: Yeah, but it's a great, very good zoom out corrective for you to give me there too, because as much as all this stuff is easy to go down a rabbit hole on data quality matters. more than which algorithm you choose and probably matters more than anything. Maybe you could give some intuitions there.

Nathan Lambert: Yeah, it's like we baseline Tulu2 versus Tulu3 and it's like Tulu2 versus Tulu2.5, which in DPO to PPO is like 1%. And then on our evals, like Tulu2 to Tulu3 is like 14%, which is all just like data curation and process and stuff like this. So it's like, okay, like there's your number. Like if you want to go off and be a algorithmic, like academic, like your most cases, you're going to be dividing one to 2%. versus making really specific data where you care about where you can get like 10x. I'm not surprised. That's probably what industry does too. Like these postering teams are like, You're the person generating Python code data. You are a full-time Python code instruction and preferences and prompts and filtering lead. And you do it and you generate crazy good data on one really, really specific thing that may or may not be in the actual about suite. It's like, that's just like, even though we aren't doing, like we have people on data, but it's not like they're spending a month on this one really specific thing. And you have 12 people doing that. It's like, No academic lab will ever do exactly what the closed labs are doing in that level of depth.

Nathan Labenz: Yeah. I mean, it's super resource intensive. I guess you said you trained a thousand instances of 8B in this whole project, right?

Nathan Lambert: If you look at our VAL leaderboard, we have like a thousand plus models. So not all of them are 8B. Some of them are tests. But I think it's ballpark for... If you want to get this level of results, that's probably about the process that you will need to go through. Like you could be twice as good and then it's 500, but that's not that big of a change.

Nathan Labenz: Yeah. So I just want to get a little bit of like a procedural understanding for what this, you know, the overall trajectory of the project is, and maybe we can contrast it against something that I do and that I've like talked about in previous episodes, which is just fine tuning a model for a specific task. I've got a whole kind of lecture and how to on that. And I tell people typically start with 10 gold standard examples. The quality matters most, like just staple your pants to the chair and work on those 10 examples until they're the absolute best you can make them. That's small enough. You could probably do a few shot in a lot of cases. Maybe you'll want to fine tune depending on whatever. And then I typically tell people expect three rounds of iteration where you're going to do that fine tuning. You're going to find some weaknesses in it. You're going to then come back and augment the data. If it's like generally not good enough, you need to like 10 X the data is usually what I tell people. If it's an edge case that you just hadn't considered before, you can maybe just kind of patch with, you know, give it five or 10 examples of, that situation and how to handle it correctly and you know three rounds maybe more if there's like a lot of different edge cases you want to handle over time typically should get you there if you're doing one task obviously a huge difference here is you're doing a whole 15 minute discussion because partially i'm also reflecting on it and it's very interesting is it i mean this goes all the way up to like the CEO is like i'll sit down it's like how do we manage the fact that we have extremely excited students.

Nathan Lambert: A lot of AI2 is collaborators with UW, which is like total best of the best, super motivated students want to do something, but they can't do these whole projects. And then you have like, people like me or slightly like people like me that are my same age. So I'm like a junior, mid seniority research scientist, did a PhD, have some experience, but I'm also like not a professor. I have more visibility. So that ends up being like, I ended up being effectively more senior because I have distribution and I have done things, but you have also these people that are like research scientists full-time, They have some projects, they can focus on this and do a lot of it. It's like, how do you balance these incentives is very hard. So I would say that for this project, in a academic sense, I would be last author, but in an industry sense, I'm first author because I just have to do all sorts of random things. It's just like incredible, like just hold everything in your head and try to keep track of everything at once. And there's admin help and we have meeting structure to help with this, which is like the philosophy section of the paper is like, what are we trying to do? And are we on track going in this direction? Because what it starts with is like, I would say there's like, four to seven leads that have done a substantial amount of like first author level work in a normal academic setting, which is like making a lot of data, running a ton of experiments, building the whole eval setup. There's like all these things, which is like, At the beginning, we kind of sit there and it's like, what do people want to do? We haven't talked about this, but like one of the things we really wanted to see is like LLAMA2 and LLAMA3 did rejection sampling. It's like, can we make rejection sampling work? And it's like two people and all they're doing is like, can we make the numbers go up with rejection sampling? And this is like a negative result we have, which is like, we did a lot of things, there's some interesting things like, If you're using an 8B model, the generations are less good than the instructions we start with. So if we're generating new instructions, then we do SFT on them. It kind of makes sense that it would make the model worse. But it's like we had these two people doing instruction tuning. Some people started with like, let's compare these DPO alternatives, DPO, like norm DPO, whatever all these things are, figure out what these settings are. And then some people that are like, where are the open datasets? Can we get them? Let's start training on open datasets again. Where are we falling short of LLaMA? So the kind of start is like this algorithmic phase, and then you get the bearing of like, okay, where are we not doing well with LLaMA? Are things just not working? Let's kill them. And then it kind of shifts into the second phase of like, let's try to get really specific data. Let's kill some of our long tail projects and we're working on specific data. And then you continue to mix, you continue to make your Val suite more stable. When we start the project, the Val suite is not that stable. It's like, huh, like we're getting crushed by LLaMA on this thing, but like, the formatting is totally bizarre. It's like esoteric because we tried to recreate the entire LLaMA eval suite in our code base. So there's like some evolution there early in the project as you converge on evals, but then it becomes much more exploitative, which is like, we build new data for specific capabilities. So people are kind of heads down, like building math data, building IF eval data. And then there's probably a trade-off of like some people are working on SFT data and SFT mixing and some people are working on DPO and DPO mixing. And then like soon this whole like does this RL thing work, which can take like two people to start doing this RL thing kind of on their own, but like part of this project. And then as the weeks and months go by, like you try to finalize SFT. So like SFT is finalized. The data mix is probably like a month ago. Then we have to we did more decontamination or like oh we have to turn it again oh our 70b hyper parameters are wrong oh we have to turn it again oh we have to try model merging we have to train it a few more times the quick note on model merging it's like it's a safe bet to merge SFT by running multiple seeds on the same data set but it could often be that just by running on multiple seeds one of your seeds for SFT is actually going to be the best So if you want to get a good SFT model, you want to just train it three to five times across a few random seeds, which is pretty funny. It's just like even more compute. And then like for the last month has been mostly like final touches and then full on DPO, which is like, we have to generate, we have to do this on policy thing, which takes a lot of API credits and a lot of just hands-on, which is like, you take your SFT model, you run completions. And then those are like the person that just owns on policy preference data. So you give them prompts and models and they do like VLLM to make generations. And then they use the OpenAI API to do LLM as a judge. And then they're like, here's your preference data set. And then we have different models. So you have 8B, 7DB, we have all the models. So it's like, this is the person in the team that's just owning, like they make preference data. And you have to do this. And then those people, that goes to a training person, training person does DPO, they pass it to RL. So you kind of just have these people that naturally, to zoom out, you have people that naturally converge to different areas. And this is probably like 10 to 15 people that are actively involved most of the time. At this size, we did not need strict delegation in management. If you go to the Llama size, 100% chance you need delegation in management and rules who makes what decision when. Effectively, I'm like softly that person, which is like making the call on what SFT mix is final. but there's definitely an interesting transition organizationally, which is like, you cannot scale it more because it becomes a mess if you're more than 10 to 15 contributors on one of these language modeling processes. But then there becomes a big cost if you add in managers, because then you have to do a lot more of like delegation of decision-making and stuff like this. So in some ways you can tell by how I was describing it. A lot of this is like somewhat chaotic and free form and just relying on people being in the weeds, in the details, and very happy to communicate with the various people that they know needing it. And a lot of that is autonomous. It's just like, I can't be over Valentina's shoulder being like, get your DPO model to Hamish to do RL on. It's like, they just do that. And a lot of those processes are messy like we describe our standup meetings as chaos and it is very funny because we just sit down we write our updates and then we do 50 minutes of just like chaotic technical updates based on what the heck people are doing it's like the just information overload in a lot of ways so i don't know i think that accurately reflects the chaos but there is like some kind of cadence to it of like what is your goal are you on target How do you make decisions to kill sub-areas of the project, like kill rejection sampling, kill long context, kill multi-turn? These are just things that as you get better numbers, you know you need to get the model out. So you just have to keep reducing entropy and reducing entropy. So it kind of goes to the space of you explore in the middle and then you get momentum and you collapse onto a final model. we can do that much faster than LLaMA can. Because getting LLaMA out is a bigger issue for them legally, strategically, and stuff like that. So they do a lot bigger of an investment cycle to get these models out. I think even more so than like OpenAI. It's like OpenAI and Google and stuff release these new API models every couple months. Whereas LLaMA is like, you get like one or two LLaMAs a year and they drop the weights and it's final. So it's like kind of interesting to think about the distributions and there's I mean, there's a lot of policy discussions on, like, weights being final and out there, but, like, that type of stuff is also informing the development cycle, which I haven't talked about at length.

Nathan Labenz: Yeah, that sounds fun. It is. That's a lot. Multiple follow-up questions. So when you are doing the day-to-day experiments, what sorts of... kind of reminds me of chem I was a chemistry research assistant as an undergrad in reaction development and it had there were some commonalities where it was like you know we throw a bunch of uh reagents in we kind of run the reaction then we come see out you know come back a couple days later and measure how well it works and at times we were surprised and it was a you know high dimensional space so you could kind of more acid less acid let's let's do an experiment with you know varying amounts and then vary the solvent and just all these different things that you could vary. It sounds kind of similar where I'm imagining that like the thousand things were actually like a hundred experiments of 10 variations each. Yeah. So this is the tension.

Nathan Lambert: It's like, how do we be scientific at the different stages? So different things we can run ablations on and some of them like mixing are just like here is our process. It's like, we did this. It's like, it's very much unacademic, but in like the RL side, it's like, oh, we can compare different initializations for the value model. You can compare different regularization, these specific hyperparameters. So there's definitely both of that, which is like, you need people that can operate in this chaos, which is just like intuitively, like, where do we go? But you also need people who are very scientific as like, yes, no, does this work based on X clear results. So it kind of, it has the full spectrum there. And I, I mean, I agree. My background is in microelectromechanical systems and other kind of EE stuff before kind of shifting into AI. So there is that kind of messy, like in the lab nature of like, you're just trying to build this thing or you're like trying, in that case, it's like you're doing a reaction. It's like it works or it doesn't in some cases.

Nathan Labenz: So you're literally just like, I guess maybe a better way to ask the question is how much juice are you finding in different dimensions of exploration? Like you could vary the mix of the instruction fine tuning, or you could vary some hyper parameters or, you know, what are the sort of classes of things?

Nathan Lambert: Yeah. But I think like most of it's data, it's like you kind of find hyper parameters that work and they don't really change. But within that, I think there's still a very high level of juice. I think we're doing a general model, but I still think we could fit more evals into our mix and improve performance without substantially changing the size. I think that like the amount, like you were talking about this with your recipe, like the amount of data that you need to target a specific eval is actually not very high. And there's a lot more that you can do. With that fact, there's a lot more post-training that is not really touched. And I think that the opportunity is high. It's mostly about setting yourself to have an eval feedback cycle. I talked about killing these different capabilities and that's because we didn't have evals that we liked. It's like, I'm 100% sure that we could improve them, but it's just much easier in a distributed environment where you have a source of truth, which is your evaluation. Because the big thing is like, how do you develop character? I think character is something that you don't have a valuation for in your models. And our models, if you compare them to Claude, will not have as consistent of a character. especially things for these models that are going to many users, character is important. And I find it very fun, but it's like, I don't know how to motivate the right people to do a four month, like let's like, let's make like, oh no, like, like chef's kiss emoji, like really bright spot on. Like CEO is going to be like, dude, what the heck are you doing? Like what, what is happening here? So in that regard, that is a split. Academics works with the vowels, but Anthropic, like Amanda Askell, has a lot of final say. It's like, this is what Claude is supposed to be.

Nathan Labenz: Yeah, that's really fascinating in multiple respects. Okay. Let's talk about the synthetic data and LLM as judge situation. I get, obviously it's super resource intensive to create this stuff, you know, with humans doing it. Yeah. 1% typically is kind of my.

Nathan Lambert: Yeah. What do we spend on API costs? Maybe even less at this point. It's far enough in that's fine. Like everyone cringes here when we look at our OpenAI bills, but. I would guess it's over 50 grand we spent on LLM as a judge credits. Or like about for this project. Which if you're doing that with humans, it would be millions. Yeah. And that would be... It still hurts. It just goes to OpenAI and we're doing open source research.

Nathan Labenz: Yeah. I mean, that's like 10,000 million tokens. Yeah. Because it's like, you know, a couple.

Nathan Lambert: Yeah, the LM that's a judge is weird because you throw most of it away. So it generates a bunch of tokens and you take one, which is the answer. So it like does COT or something and then you take the token. So I don't know. It's hard to attribute exactly where they're going, but yeah.

Nathan Labenz: Yeah, it sounds like you may have, if my back of the envelope math is right.

Nathan Lambert: It's still way less than the GPUs is the thing. But it's like all the people I work with is like, we are dealing with a world where our GPU, I mean, you had some sense, I didn't answer the question. It's like effectively order of a few thousand H100s, not all H100s, there are some other compute. And it's like, if you live in a world where that is the resource that your project is targeting, it's like, we are targeting big impact projects where we have $10 million plus assets. It's like per year, like per GPUs. It's like this is not like the $50,000 API bill is like, oh, we're really pushing it big this month. Like I'm like, this doesn't matter, which that makes it very odd, which is like a lot of these people are students and I wish I could pay them more in that context. And it's like, I also am not paid Frontier Lab, but it's like all these, this is why all the compensation stuff is so wonky in the labs when they're spending so much on GPUs. It's like their headcount for the, what they're paying their key researchers like doesn't matter at the end of the day, which is like so bizarre and hilarious, but like, I guess good for the people that are making the millions of bucks, like sure, it doesn't hurt them.

Nathan Labenz: So Just staying on the synthetic data for a second and kind of LLM as a judge, how would you describe the, I have some like, you know, I'm an AI enthusiast, but when it comes to using LLM as a judge, I have a little bit of like intuitive misgiving where I'm like, are we too quick to delegate, you know, to the AIs the decision on what is good for an AI to do?

Nathan Lambert: So that was a project that I'm not an author on, but here I really liked the framing. I think I'll do the framing and I'll comment on results, which is a bigger problem than just the paper. The framing is if you can have humans and LLMs do preference data, which do you send to humans versus LLMs? And that I think solves a lot of the problems, which is like, there are definitely things that we want humans giving the answer on, but there are a lot of mechanical tasks that we can outsource LLMs. This paper, is probably one of many things that will come. I'm sure labs are doing stuff like this as well. The problem with it is like, there's a lot of papers that like this, that it's like, why can't we have academic papers that show that humans are important? So the same thing is like, here we like look at our GPO results on like a human controlled data set versus human preferences versus LLM preferences. And they're like, LLM preferences scores on the vowels are higher. And it's like, what are we missing? And it's like, I always look at the paper and I'm like, everything seems fine, but I disagree with your final conclusion because I feel like it's not going deep enough, but I don't know how to do the experiment to go deeper. So that is kind of me reiterating in a different way the thing that you are feeling where it just doesn't, it doesn't seem like as a, open research area, we have gotten to the bottom of what preference data is doing to the models and why we can just not use humans. So I'm like, oh, this sucks. But I do think eventually we will continue to chip away at that understanding, which goes back to this, like humans, high noise, low bias machines, low noise, high bias. Like eventually we will understand what these biases and noises represent, but we do not. And it's fine. I think RLHF has shifted more away from this like safety or like what is a preference area that's not discussed as much now, which I think when you're talking about normative relations and kind of these sociological things in quote-unquote preference tuning, it is more important to be in touch with these types of human versus machine biases. If it's literally like make math number go up, I'm like, okay, it's fine to not have as clear of a guidance there. I think the thing that I've wanted to see is you give instructions to annotators And I would like to see how well the model reflects the instructions given. If you collect preference data with different instructions, how does that change the final model? It is a very complex attribution, but it would be nice to see.

Nathan Labenz: But the sort of layman's takeaway at the moment is at least like the best frontier models do outperform humans as measured by the preference data that they create gives you downstream better eval scores than the human preference data.

Nathan Lambert: Yeah, I think the caveat is we don't have the same human preference pipelines that the labs do. This is like whatever pipelines we have access to. But yeah, that's the conclusion.

Nathan Labenz: Of course, that comes from the fact that they originally did this with humans and put a ton of time, energy, resources, blood, sweat, tears into it, and did a good enough job that now it's really hard to replicate what they did for humans. Who knows? Their human effort might have been a little bit better than the current AI effort, but it's just really hard to replicate the quality of the human.

Nathan Lambert: mobilization that they had to to get there in the first place uh i'm speculating obviously there but that's why i kind of described RLHF in some ways has forked between um academic and industry like industry cares much more about Chatbot Arena and user retention than academia does and that will probably widen and that's fine i don't think we will ever have the ability to hill climb on Chatbot Arena because we don't have 100 million users it's fine I would like to know what they're doing, but it's like, I don't, it'll take time.

Nathan Labenz: Okay. Another thing that I wanted to go back to that really caught my ear and then we went a different direction at first, but you know, might in some ways be the most important observation that you mentioned was if I recall correctly, it was like in that final stage where you're doing the reinforcement learning from what's it called verifiable reinforcement learning with verifiable rewards.

Nathan Lambert: It might get changed in the paper, but that's the name that we're going with right now.

Nathan Labenz: But this is essentially, did you get the math problem correct? Or like, does your code work? I guess you didn't have code specifically, but anything, I mean, this obviously seems like a huge trend just for cost and scalability reasons. It's kind of why I've been, I wonder if you would agree with this characterization. I was just telling somebody the other day, like, we probably should expect superhuman performance on things like code before too long because the objective, you know, and fast feedback is such that like, there's not a limit at what a human programmer can do. Whereas like superhuman poetry, first of all, is ill-defined. And second of all, like there's not, you know, it's, it's going to be a noisy and slow signal kind of indefinitely. Yeah.

Nathan Lambert: Yeah. Yeah. I agree.

Nathan Labenz: OK, so now in doing that process, I believe it was within this reinforcement learning from verifiable reward process that you started to observe a human-like reasoning where it would like double back or check its results again. This is an emergent phenomenon?

Nathan Lambert: Yeah. Yeah. It's like to one math domain. But to be clear, this is like we trained it for way longer than it's practical. So like at this point of training, normal evaluations have totally tanked. So the model is like not as good at normal things. It will still converge to good math answers in this one specific eval that we're training on. But it's like its whole chain of thought process kind of like got borked. And it's just funny when you see the same keyword that everyone was viral about with OpenAI 01 where it's like, wait, let me check that. And I was just like, this is so funny. But in the same way, it's like not that surprising. It seems like an RL thing to do. The other examples when 01 came out is like, oh, look, it changed to French and then it changed back. That's just like... me it's just like models doing really funky things is like not surprising that it's the RL part of it because like how like none of our loss functions otherwise are encouraging such like weird shit but also it's like an end of one so it's like it's much more just for fun and for piecing things together over the long term then definitively like that model is anything like what one is doing. I do think that the same type of training approach probably applies where they have some sort of verifier and then they do a lot of RL. They do this on many more domains at once and probably with a mix of deterministic and learned verifiers. And they probably do way more RL training than we are doing with some tricks. But I do think that it's not unreasonable to be excited about that. And I don't think... There's good reasons that there's some smoke and mirrors about 01 where they made it seem more complicated than it is.

It's a breakthrough and I'm sure they did a lot of really interesting novel things and hacks to get there, but we've seen all the labs release the same things. And again, and again, like there's going to be O1 equivalents from Anthropic and Google within five months or whatever you want to say.

Nathan Labenz: Okay. That is really interesting. kind of informed speculation as to what they are doing. But I just want to make sure I'm understanding this weight observation correctly. This is a purely emergent phenomenon in the sense that this was not something that you gave it example data to learn from I find it very unlikely that some of one of our, like we didn't train, we didn't do SFT on like O1 chain of thoughts or anything.

Nathan Lambert: We can't get O1 chain of thoughts. We didn't do any like intermediate edits to chain of thoughts or anything like this. People that I think are smart have said that they've seen LLaMA do the same behavior if you like really crank the temperature up and do things. So it's not in that regard, it's like not that different. I mean, even like the stupid reflection 70B model is in principle, a similar idea. So, like, there are a lot of ways to induce this behavior. This is a very, very, very open-ended one that was with the RL loss on some verifier. So, like, that's why I was like, oh, this is so... It's like, we stumble upon O1-like behavior without even meeting it. It's like... OpenAI deployed this thing to 100 million users. It has to be like a somewhat stable recipe. It's not like they had some cracked checkpoint that did this and they're like, we're going to deploy this to everyone. Like they could definitely do this behavior and get it many times. I'm interested in that.

Nathan Labenz: Yeah, I think that's a super fascinating little tidbit. Of course, all these things are kind of in there or like, you know, on the verge of being in there. But what does wonder... I think a lot about the grokking results and, you know, it seems to make sense to me to understand behaviors as kind of on a spectrum from on the one end, like fully stochastic parrot, you know, just raw correlations between tokens. And then on the other end, like a fully grokked actual algorithm that has been, you know, traced into the weights through enough time. And of course, for any given behavior, it's not at all trivial and maybe borderline impossible to tell which is which. But when you start to see these things from such a simple signal as like, you got this problem right, or you got this problem wrong. And you start to see these like qualitative behavior changes. And especially when it does, I mean, as the grokking did too, right? That happened in what would have traditionally been considered like an extreme overtraining regime. It does start to at least paint a suggestive picture that there's like potentially some sort of I won't go as far as grokking for this, but like a phase change kind of happening where it's like entering into a regime where, you know, it is, I think it seems fair to say that it's like beginning to actually reason.

Nathan Lambert: It's like the weird RL expressive behavior, I would say. Whereas like no longer can, it's a generation fundamentally shifts to do like weird stuff. Whereas like most internet text is very just, going forward where this link RL behavior is just much more cyclic and strange. Which is why it's sad we can't see the 01 reasoning traces for general use. Once we have something like that, making these analogies will be much more compelling. Like if we could run 01 on our same prompts and you look at way more of them, I think it'd be a lot easier to say like, yeah, this is a very similar behavior. We just have to, you have to then scale the training regime to be stable, do this in every domain and like do it repeatedly.

Nathan Labenz: Yeah, which obviously is not trivial, but I mean, this is something that you observed I mean, it's without any intent to see it.

Nathan Lambert: Yeah. You're just like one of the RL, two RL leads, like Hamish and Costa, but just poking around generations. Because when you do RLHF and RL, it's like, you need to look at the generations to make sure the models are working. It's just like a normal thing. And they're like, oh, this is a really funny thing. And I was like, I don't want to tweet this. This is so silly. The O1 TinHack community will love this. But like, it's, in that respect, it's not like we're searching over the, generations for the word weight or something. We found a couple of them. I'm sure there's many more if we look for it.

Nathan Labenz: Yeah, there's no substitute for digging into the data. I can't repeat that mantra enough probably. If you were to describe everything that you learned and take away all of the process of the experimentation and the learning, What's the sort of idealized final process? Is it as simple now if I was going to just redo your thing on a different base model or whatever? Do I have basically one day of 32 H100s for supervised fine tuning and another day for the next phase and another day?

Nathan Lambert: I think you need to do a few mixes. You need to do a few informed mixes based on the base model and like what capabilities it needs more or less help with at kind of each stage. But you can probably start from the same superset and then do a few experiments with more or less and various behaviors. So you're probably like a few cycles per base model to make sure things look right if you really want to get the best performance. You can take these data sets off the just like kind of from the shelf and use them and you'll probably get like 80 to 95% of the performance on a given base model. So depending on how much you care, it's pretty fine. And I would say similar at, um, both SFT, DPO, and RL. The interesting thing with RL is that we don't quite know how the ceiling is defined per base model. So if you do RL on a less good SFT model, we have found that on Llama, for example, on 8B, RL for GSM8K will always saturate at 87 to 88 GSM8K. That's just a fundamental limit of the base model. No matter what we do at SFT or DPO, we could then get the GSM8K to something like 85 reliably without too much degradation. On different ULMO-based models, like we take the ULMO model for July, it's like we bumped it from 60 to 75. And it's like, we don't know what defines these kind of saturation limits, but as you get better at having more training stages, a lot of these kind of like just take it off the shelf and train will look different. So it's like, if we know we can really boost GSMAK, it's like, we don't need that SFT training data. And maybe we can use that SFT budget in a different way. But that type of sophistication is something that we haven't explored a lot of. So that's why it's kind of, in my mind, it's like to reflect on this and try to think about what you do next. It's like there's a very different thought process. If you really know you can recover certain abilities at different times, where it's not just like arg max at every stage. And that takes a lot more experimentation that we haven't done.

Nathan Labenz: So most people that are doing stuff in the world are... I don't know this for sure, but my general understanding of most actual industry projects are that people are not trying to create general purpose chatbots. They're trying to create something that fits a specific need in a specific context, you know, and kind of maxing like one or possibly a few different tasks. What advice would you give to people who are like, okay, how do I map this onto my probably much simpler situation? Maybe I, you know, should I do like one model per task and just keep it simple? Should I, if I have five different tasks that I want to do and they're like kind of related, should I try to do them all in one?

Nathan Lambert: mixed data fine-tune um it depends on the interface you use if you really can always just choose the right model you can do one one per i think there's a lot of interest in general interfaces right now um some of this will change in the RL stage i think i'm gonna give it i'm gonna i'm signed up for some NeurIPS talk on like AI engineers and i'm gonna try to figure out a world view on like how to come up with these RL verifiers for different tasks. Because I do think if you have a verifier and the distribution matches your task, like this RL stuff will just kind of work. And it would be really interesting to see more engineering-y and less research-y people try to adapt this and just take it. And we have early days of what we call like LLM gym or like an open source repo where you can like add different constraints and then just do RL on it and take the model. And like you could add these domains, you're essentially adding domains for RL and language models. So I think that's kind of the newest frontier where some of these like SFP single domain, few domain, it's like you could try it and you can see and it's not that interesting, but like if we can unlock RL with verifiers for so many different niches, it'll be really like the next moment in this kind of like fine tuning specific language models narrative.

Nathan Labenz: And I guess the general purpose answer there would be LLM as judge, right? I mean, you can get more concrete where possible, but I, you know, I always think about my, company Waymark, we do video creation for small business and there is partially a ground truth, but honestly, we don't really have that much trouble with the, the concrete, like the objectively verifiable stuff mostly works out of the box. That's like making sure you're delivering the right amount of content and the right structure and so on. And then the real question is like, what is a good script for a video? And we, you know, if I'm applying your lessons learned to that, I basically would just say, judge it and make a preference set and go with that.

Nathan Lambert: Yeah, I don't know if I have the energy for the whole multimodal discussion, but there are definitely different guidelines as you go multimodal. Preferences are more powerful in things like images, audio, video, because our intuitions, especially human preferences, are just intuitively much more expressive than in text. So a lot of things they're saying are very like text in that way, like capability in a very narrow sense.

Nathan Labenz: Yeah, that makes sense. I mean, I wouldn't expect... In the images, I feel like you do sometimes get some pretty good feedback.

Nathan Lambert: You can probably do detectors of different types of objects, if you are saying. If they have a prompt, you can do a detector to make sure that certain noun or entities in the prompt are in the image. And I bet you could have... That is a way of doing precise instruction following for images. And I wouldn't be surprised if it's already done.

Nathan Labenz: Yeah, that's interesting. I mean, I find the Malmö model quite interesting for its pointing. It seems like Claude has now kind of got a pointing function too. But that is a cool, I don't need to dig into that too much today. I'm interested to see how it works for robotics.

Nathan Lambert: Because you think of the robotics task, it's like, it bridges the gap between VLM and Planner in a very nice way. Where the VLM could be asked a question of like, how do I, or like, what tool do I need to do X? And then it can just point at it rather than saying the answer, which like, and then the planner knows to get things that are pointed at or something. And that is interesting.

Nathan Labenz: And some of my robotics friends are excited about it. Oh, one more in the weeds question, then one kind of zoom out. On policy, we've mentioned a couple of times the importance of on policy data. My intuition for that is just like, you want to be working from where the model is now. And I guess if you don't do that, then it just doesn't work as well. Can you give a little more color on that?

Nathan Lambert: Yeah, I mean, it's very related to this kind of PPO, DPO debate where a lot of the proponents of PPO is that you're scoring and updating the model based on things that is generating itself rather than completions you got from elsewhere. And that's really the thing. It's just like, it seems like it's a bit better learning signal if within these batches and these pairwise comparisons, the tokens you are looking at are closer to the log probs of the current model, which I think makes a lot of sense because what you do at, like, say, DPO is you compute the log probs of the token into completion, and that updates your weight. So it's like, okay, like, it makes sense if the log probs are not in, like, a weird space it hasn't seen before. And it's nice to just see that it backed up by experiments.

Nathan Labenz: Yeah. Okay, cool. That makes sense. So finally, the big zoom out, I guess, is, like, 01 is obviously out there now. We don't get to see the reasoning traces. What do you think this implies for the future of proprietary, AKA closed versus open? One of the striking slides in the presentation that you sent me was basically very directly saying, we can't do this as an open organization if we don't have the frontier models to do all these generations, to do these scoring. There's just not a substitute for that in the open world. Now we don't even have the 01 traces. So does this suggest an end of an era of open source catching up? Or what do you think is going to be the I think it'll end up being fine.

Nathan Lambert: I think that it takes time and it will be different. But there's so much interest. I mean, I'm trying to hedge, but I do feel like it's pretty likely that I end up doing a project in this. But there's just so much interest in it. And it's both exciting and new. In some ways, we have less of a disadvantage in time because we're seeing it and we're not waiting. GPT-2 was entered at GPT-4 when people got serious. We're entering at 01. But I just think that it is time and time again, we see that people are very motivated in this area and It'll probably be trained differently than 01 was, but people will figure out how to elicit the same behavior. It's like once you have an existence proof, it will help. We still have Llama 405b, which is very powerful open weight model for reasoning and stuff like this. It's mostly just, I think, takes the iteration on building entirely new infrastructure that the community converges on. Like the fine tuning infrastructure is not going to be like, well, part of it will be used in building 01, but I suspect there's going to be whole, it's like, what is the Transformers library for 01 models? There's something different there that I think is simultaneously exciting and maturing for the ecosystem, which is like, look, we need to approach these systems and training in an entirely different way and there will be a bigger spectrum of resources that are done to do this like again there's the pessimistic case of like we need that we don't know all the steps along the way but like i'm pretty sure we'll get something close to it with the amount of excitement that we see What do you think that looks like?

Nathan Labenz: What's your, is it sort of, is it like elaborate prompting to kind of generate these synthetic traces? I could imagine you take a 405 and you like chain four prompts together and say, now look at it a different way. Now look at it a different way and kind of stitch those into some bootstrap bubble.

Nathan Lambert: Yeah, I feel like I'm losing the energy to go through this whole thing. This is a post that I plan on writing is like my plan for reproducing 01. I mean, you need to generate some seed data that looks like it. You probably need, you have a language model, look at chain of thought traces and modify and continue them. And then you kind of do that and you need to figure out how to seed a model with those and then kind of have the right domain to do some sort of RL on it on certain verifications. So it's kind of like, get some initial data that looks good, get a model that can do it a little bit. And then you have to, the feedback loop of actually like having things that can verify and having the update function reinforce that behavior is I think the hardest one. And we'll start to see efforts on generating this data and generating verifiers and then this putting it together thing, which is where there can be a lot of variability in performance. I think we already see online, there's people like open like 01 reproduction. And I'm like, Still, I don't even need to look yet. Until 2025, I don't even need to take them seriously. All the ones that are already out are not that serious, unless it's Google, Anthropic or something.

Nathan Labenz: Yeah. All right.

Nathan Lambert: Well, we'll check back in with you in a few months, perhaps. I need to wrap my head around this. I'm in the different type of post-training environment for a while. And then that's kind of the next thing to explore along with some better taxonomies of agents and stuff like this, which is fun, but it's a big mental adjustment.

Nathan Labenz: Yeah. It's a target rich environment. That's for sure. Yeah. Cool. Well, this has been fantastic. Any other closing thoughts or calls to actions you want to share before we break?

Nathan Lambert: I'll post a lot of these things on my blog Interconnects. I already have another 01 blog post in the queue. I don't know when I will send it, but. That's to come. And it's fun. And I'll see a lot of people around at NeurIPS. If you're listening, I'll be there, which is exciting. So thanks for having me. This is a rapid fire. You exhausted me. By the end of the day, I was like, oh, man, I'm like, I'm cooked.

Nathan Labenz: Yeah, well, you've been working hard and I certainly pushed you down a bunch of different little dark corners. So thanks for doing it with me. I do think a lot of alpha in catching up with what you've been up to. So really appreciate the time and energy. And with that, I will say, Nathan Labenz, thank you for being part of the Cognitive Revolution. Yeah, thanks for having me.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.