 
In this episode of The Cognitive Revolution, Nathan delves into the fascinating world of AI-generated research ideas with Stanford PhD student Chenglei Si.
Watch Episode Here
Read Episode Description
In this episode of The Cognitive Revolution, Nathan delves into the fascinating world of AI-generated research ideas with Stanford PhD student Chenglei Si. They discuss a groundbreaking study that pits AI against human researchers in generating novel AI research concepts. Learn about the surprising results that show AI-generated ideas scoring higher on novelty and excitement, and explore the implications for the future of AI research and development. Join us for an insightful conversation that challenges our understanding of AI capabilities and their potential impact on scientific discovery.
Be notified early when Turpentine's drops new publication: https://www.turpentine.co/excl...
SPONSORS:
Weights & Biases RAG++: Advanced training for building production-ready RAG applications. Learn from experts to overcome LLM challenges, evaluate systematically, and integrate advanced features. Includes free Cohere credits. Visit https://wandb.me/cr to start the RAG++ course today.
Shopify: Shopify is the world's leading e-commerce platform, offering a market-leading checkout system and exclusive AI apps like Quikly. Nobody does selling better than Shopify. Get a $1 per month trial at https://shopify.com/cognitive.
Notion: Notion offers powerful workflow and automation templates, perfect for streamlining processes and laying the groundwork for AI-driven automation. With Notion AI, you can search across thousands of documents from various platforms, generating highly relevant analysis and content tailored just for you - try it for free at https://notion.com/cognitivere...
Brave: The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR
Oracle: Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, and nobody does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive
CHAPTERS:
(00:00:00) About the Show
(00:00:22) Sponsors: Weights & Biases RAG++
(00:01:28) About the Episode
(00:05:30) Introducing Chenglei
(00:06:22) Path to Automating Research
(00:07:58) Notable AI Research Projects
(00:15:26) Evaluating Research Ideas (Part 1)
(00:19:39) Sponsors: Shopify | Notion
(00:22:33) Evaluating Research Ideas (Part 2)
(00:25:49) Research Setup and Design
(00:29:38) AI Prompting and Idea Generation
(00:34:40) Diversity vs. Quality of Ideas (Part 1)
(00:34:40) Sponsors: Brave | Oracle
(00:36:44) Diversity vs. Quality of Ideas (Part 2)
(00:42:05) Inference Scaling and Execution
(00:45:04) Anonymizing and Evaluating Ideas
(00:53:22) Headline Results and Analysis
(00:58:45) Observations and Insights
(01:09:02) Novelty Indicators and Deception
(01:11:59) Top AI-Generated Ideas
(01:14:41) Next Steps and Future Directions
(01:20:43) Expectations for the Future
(01:23:14) Outro
SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://www.linkedin.com/in/na...
Youtube: https://www.youtube.com/@Cogni...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...
TRANSCRIPT:
Nathan: Hello, and welcome back to the Cognitive Revolution!
Today my guest is Chenglei Si, a PhD student at Stanford who is developing ways to use LLMs to automate research, and lead author of a fascinating new paper that asks the question "Can LLMs Generate Novel Research Ideas?"
The question has become one of the most important in the entire field, as the ability to generate research ideas worth pursuing - particularly in the domain of AI - has long been considered a key precursor to recursive self-improvement loops and a possible intelligence explosion.
Since the early days of GPT-4, we've seen several notable attempts to create AI research assistants, including projects like Co-Scientist from Gabe Gomes group CMU, which we did a full episode on, and more recently the AI Scientist paper from Japanese company Sakana AI.
These systems demonstrated capabilities that would have seemed impossible just a couple years ago – including translating natural language instructions to chemistry protocols, and using Semantic Scholar to assess research ideas for originality, but the question of whether AI systems can produce high-value novel research ideas, true "eureka moments", has been subject of fierce debate.
Chenglei and his collaborators set out to shine light on this subject with an ambitious study. They asked more than 100 human PhD researchers working in AI for new research ideas, incentivizing quality with cash prizes, and then asked Claude for novel ideas as well.
After processing the text in an attempt to create a level playing field for evaluation, they then had expert reviewers rate all the ideas without knowing their source.
The results were striking: the AI-generated ideas scored significantly higher on both novelty and excitement.
As with many recent AI results, this paper has become something of a Rorschach test - those inclined to believe in rapid AI progress see it as a major milestone, while skeptics criticize the methodology – somewhat unfairly in my view, but you can judge for yourself as you listen – and more persuasively in my mind, emphasize that this work, which focuses specifically on prompting techniques for language models, may not generalize to the harder sciences.
Personally, while I agree that we can't confidently project this one result onto other domains, I find the experimental setup here to be quite well done, the statistically significant results seem credible, and I think Chenglei's individual observations are worth taking very seriously as evidence too. Everyone listening to this show should be familiar with the AI maxim "look at your data", and nobody has spent more time with the raw outputs than he has, so when he reports that 9 of his 10 favorite ideas turned out to be AI generated, and that the AI ideas are more "out of the box" and "less grounded in existing work" than human ideas, I think we would do well to listen.
Overall, my feeling right now is that we are at a sort of tipping point, where the Claude 3.5 Sonnet and GPT-4o class of models can sometimes, with great effort put into system design and many millions of tokens to burn, generate meaningful novel research ideas, but not in a way that makes frontier research dramatically more accessible or scalable for non-experts.
The next generation of models, starting with the o1 series, seem pretty likely to change that. I've been coding with o1 and Cursor a lot lately, and I have been struck by how effectively o1 can review my entire codebase and my plan for a new feature, and return both a critique of my approach and a recommended alternative that is genuinely better.
As a community we're all still figuring out exactly what the limits of this capability are, but if I had to bet, I'd say that systems like Chenglei's, powered by o1 and a bigger inference budget, might well produce undeniably needle-moving results quite soon.
As always, if you're finding value in the show, we appreciate it when folks share it with friends, or write a review on Apple Podcasts or Spotify, or just leave a comment on Youtube. And we welcome your feedback via our website, CognitiveRevolution.ai, or by DM'ing me on your favorite social network.
For now, I hope you enjoy this conversation on the emerging reality of AI-automated research, with Chenglei Si.
Nathan: Chenglei Si, PhD student at Stanford, researching the automation of research. Welcome to the Cognitive Revolution.
Chenglei: Thank you. Excited to be here.
Nathan: I'm excited for this conversation. I think you are working on a super interesting frontier and your recent paper with a provocative question in the title has certainly caught my attention and I think the attention of the field more broadly. So the paper that we're here to really go deep on today is called, Can LLMs Generate Novel Research Ideas? A large scale human study with a hundred plus NLP researchers. I'm looking forward to breaking this down and really getting into the nitty gritty of the methodology and what we can learn from it and what we should maybe take away from this and expect to see as we go forward into the near future. For starters, want to maybe just give me a little bit of kind of your background, like how did you get to focus on this question of automating research and maybe love to hear too some of the results from the last year or two that you have found to be most compelling.
Chenglei: Yeah, so I came from a very NLP background. I was doing NLP research back during my undergrad, and then I came to Stanford last year, and then I met my advisor, Stian Tatsu. And I always wanted to do something different for my PhD. And back then, the big background in the whole NLP field is that large language models are getting really good. And then we saw some really impressive performance on things like reasoning, math, and coding and stuff like that. And we feel like it probably could give opportunities to some very new applications. And as someone who has done research for many years, I feel like being able to automate research feels like something very different and very exciting. And importantly, I think it will have huge practical implications to the entire field. And I just got excited about this agenda right away. And I'm lucky that most of my advisors are very excited about this whole agenda as well. So we actually got started on this whole thing pretty early. I think roughly one year ago, this whole project actually took around a year to finish.
Nathan: Yeah. Interesting. And I trust that the human parts of that took the most legwork and took the most time. That is often the striking difference these days.
Chenglei: Yeah. Building the agent itself took us like two or three months and then finding all the people and doing the actual human study took us like eight months or so.
Nathan: Yeah. Interesting. Before we dive into the details of your paper, what other projects, and they don't necessarily have to, it could be academic papers, but it could also be open source projects, have captured your imagination and kind of stand out to you as the notable step change improvements on the road to this project destination of automated research.
Chenglei: Yeah. So I think to me personally, there's this line of work on self-improving AI. There has been discussion on whether, as AI keeps getting better, there will be a point where AI can be even smarter than humans. At that point, how do we possibly keep improving AI if AI is getting smarter than us humans? Then it gives rise to this idea that maybe we can let the smarter AI to propose new ideas and improve itself. There has been some preliminary attempts on that. There has been some work done by folks here at Stanford, like Eric Zuckerman, and that's one inspiration for this line of work. But of course, there's also another line of work on building tools that can assist researchers. A lot of colleagues from HCI, folks like Semantic Scholar, a bunch of other people have been doing this for many years. So research assistant tools, creative support tools is another inspiration for this. But I think we really take a very different approach in the sense that once you do something that sounds even more aggressive, in the sense that we want you to see if we can possibly automate the entire research process rather than just build some assistant tools there.
Nathan: Yeah, I think you've zeroed in on a really important question. I've looked for what it's worth at. This is kind of an area I try to keep up with because it feels to me like... If we are going to have an intelligence explosion, this is kind of the form that it's going to take, right? It's going to be, and I'm by no means like alone in this analysis or original in coming up with the analysis, but basically it seems like the expectation from the field is at some point, we're going to start to see AI systems that can make meaningful contributions to LLM research. And if LLM research becomes 10x, 100x, 1,000x more productive because we can sort of spin up AI LLM researchers and they can actually take real bites out of big problems, then it's really hard to say how good the overall resulting models could get. And that could take us into a different regime. So it feels like there's this sort of tipping point possibly where if we can't get over that hump, then things maybe stay somewhat normal. And if we do get over that hump, then kind of all bets are off and the future starts to look quite unpredictable and potentially quite weird. I guess for starters, is that your mental model of why this matters so much in the first place?
Chenglei: Yes, you mentioned the word intelligence explosion. So I've read the whole situation awareness thing that Leopold wrote. My advisor Tatsu was warning me that using intelligence explosion would sound too aggressive for an academic setting. But I think on this podcast, it's probably fine. So yes, roughly that was one exciting vision that I kind of want to open to you. There's a very interesting data point. If you remember, there's a figure there where Leopold was saying, Around the year of 2028, we would be able to get automated ALEC Redford. So ALEC was like the lead author on the GPT paper. So if there was a point where we are able to automate ALEC Redford and we scale this up, that means we'll have massive increase in the way AI can improve itself. That would be insane. But we want to, you know, maybe take a step back. Automating ALEC is a pretty difficult role. There's only one ALEC in the whole AI field. So maybe once you really see the current generation of language models, how good they are right now. What do you compare that with a more reasonable baseline? So maybe want to see how do they compare with, let's say, average PhD students. I want to offer a concrete benchmarking point to gauge the progress that we have made so far. So that was the motivation for the whole thing.
Nathan: Yeah. Okay, cool. So just to and by the way, Alec Radford has been making the rounds on Twitter just today as or in the last like 24 hours, we're recording on first of October. And he's been a popular answer to a question that somebody posed, which was, who is the most important or influential scientist that is not very well known. And people are piling on to say Alec Radford is the answer to that question. That is indeed an extremely high bar for automation. To just further tee up the specific question that you're asking with this research around using LLMs to generate novel research ideas, I think it is worth taking a minute to just review a couple of my personal favorite highlights from the research over the last year. Because I do think that has been an area where it hasn't quite been there yet. And it feels like we've sort of been always a little bit short. One that came out very quickly after GPT-4 was released, and we actually had the lead author, Gabe Gomes, on the show for a prior episode, was called Co-Scientist. And they put together a really nice early, but I think has stood the test of time, agent framework that allowed a user to basically say something like, please synthesize aspirin for me. And then the system would do everything needed to synthesize aspirin, including like looking up, how do you synthesize aspirin, translating that to actual instructions that could be sent over to an automated lab. They were integrating this with Emerald cloud lab at Carnegie Mellon and you can actually program a real physical lab via an API to carry out the step-by-step protocol that you wanna run. And that was amazing, right? I mean, to go from a natural language sentence all the way through to like a purified powder getting spit out the other end of a system like that. I was like, man, this is a major result. And obviously that's why I wanted to have him on the show back then. But the one thing that I would say at that time was, it did not seem to me like the AI was actually coming up with meaningful ideas. So in my review of that paper, I called it, you know, the question was like, is this text to science and, or text to experiment? And I was kind of like, yeah, I think it's more text to protocol because the AI at that time wasn't really generating the idea so much as taking the idea and figuring out how to break that down into steps and execute on it. But that was nevertheless a big milestone. I don't know if you have any comment on that one before we move on, but I'll highlight a couple other ones and then we'll get into yours.
Chenglei: Sure. I think the idea is cool for sure. There have been similar attempts in maybe other areas. Our friends at Sakana AI, for example, have sent their paper. They also try to improve what our NGN agent system is. I think the high-level research agenda is exciting for sure. But I think the tricky thing is the evaluation part is really difficult, especially in scientific research. How do you possibly evaluate that research? Even for academic settings, we do peer review. That isn't the perfect solution either. So I think all of this really boils down to the technical challenge of doing evaluation. And that is really the crux that we're trying to focus on in this paper here.
Nathan: Yeah. The AI Scientist was another one that was on my list. And I thought that one was a notable step forward. I mean, all these things have their critiques. And I feel it's like a bit of a Rorschach test that tells you often as much about the critic of the paper, whether positive or negative in their assessment, as it does about the research in some ways, because there is just so much room for interpretation still of these results. I find that to be fascinating. The AI Scientist, to me, though, did take a step forward in that it introduced a mechanism for generating a bunch of ideas and also trying to identify which of those generated ideas were the most novel ideas. And they used the Semantic Scholar API to help with that. So they would have a bunch of ideas generated, then go do a search against the Semantic Scholar API corpus of information to pull back as much related stuff as possible. And then they would sort of try to see like, has this already been done? To what degree does this seem like it's already represented in the literature versus if not, then that's more likely to be a meaningfully novel idea. And that would be the kind of idea that they would want to prioritize. The reviews of the actual AI written papers that came out of that didn't seem like they were necessarily going to get admitted to top conferences. And even in their own assessment, the second team basically said this would be like the work of an early grad student that definitely needs more advising from somebody who has better zoomed out context on the field. But I did still think that was a notable mechanism that seemed to kind of show the way to where this was going. I mean, one of the ways I describe this is, you know, do we get eureka moments from AI systems or not? And initially with GPT-4, I used to say no eureka moments. That was kind of my headline, you know, for GPT-4 performance. And then we saw the eureka paper and that was out of NVIDIA, Jim Fan and group. And there they used GPT-4 to write reward functions for robot hand reinforcement learning. And they saw that GPT-4 outperformed human authors of these reward functions. And I always am fascinated by that because I'm like, Jeez, it takes some real know-how to even write a reward function in the first place, right? Like this is not something that any old person off the street can do. You're engaging here with like a certain level of expertise as a baseline and still seeing GPT-4 perform above that. Not exactly. I mean, it's sort of like a hypothesis, you know, a reward function is, you know, how well is it going to work? You're making a guess as to what would work. So I feel like that was kind of moving in that direction as well. And the fact that it did outperform humans in something that was like objectively measurable was quite notable to me. Any other thoughts on that one? Next step is yours.
Chenglei: You know, my definition of a European movement for automating research life work would be if one day you tell me you have an engineer agent that generated idea and automatically executed the idea into your full project and you submit that project to a top conference and that project actually gets us the best paperwork, then I would say that indicates a European movement for me because that means that piece of project is actually way better than the average that you get at the so-called talk conferences. I think the AI Scientist paper from Sakana AI that you mentioned, I think the agent system, that's a nice prototype there, but the outcome, first of all, even using their own AI reviewer, the score isn't really reaching the bar for New York, for example. And then second of all, how much should we really trust the automatic reviewer? That's a question we actually touched on in the paper as well. We did some analysis and find the automatic reviewer at this point isn't really ready yet in the sense that if you measure the correlation between LLM reviewer and the actual Humex reviewer, the correlation is pretty low at the point. So maybe we shouldn't trust that much about automatic reviewer at this point. So those are the comments about the whole AI Scientist thing.
Nathan: Yeah. I like your comment about, first of all, you know, just always one of my refrains is like compared to what? So it's really important that we compare these things to the human alternative. You know, if there is one or whatever is kind of how is it being done now? I think the other thing that people often forget is that like the current thing is not perfect. Often it's actually like quite a bit worse than people imagine it to be. But I think you had a good kind of baseline in the paper on comparing to the inter-rater reliability of actual human judges for the conferences. So somebody might say, you know, the skeptic of the human performance, which is often a role I play, might say like, Well, you know, how consistent are the human raters and is this any worse than that? And the answer here is, you can fill in a little bit of the details, but the human raters actually do agree more, right? Than the AI raters agree with them. There is still something measurably not accomplished with the AI paper evaluator.
Chenglei: Right. So at this point... It's true that human reviewers at the conferences tend to have higher agreement than the agreement between AI reviewer and the human reviewer. So I think that's one evidence that current language models are not ready for automating the reviewing part yet. But even if, let's say, one day language models keep getting better, the agreement somehow gets higher. I think it's still questionable whether we should fully trust the automatic reviewing on such a subjective and pretty high-stake task, I would say. If you look at the actual human agreement, you realize reviewing is really a very subjective task in the sense that even the expert reviewers at top conferences like iCLEAR or NeurIPS, the agreement is still pretty low in general. So I think it really highlights the challenge of evaluating research ideas in general.
Nathan: Yeah. So I'm looking at the paper. This is table 11 in your paper. So you basically are comparing four main results. One is, and I guess, first of all, just to be very clear, like the question here is accept, not accept. Does it boil down to that binary judgment?
Chenglei: The question is actually an even simpler setting than that in the sense that we basically try to get the lower 50% of papers and then we combine that in a binary judgment of something like accept or not accept. And then we measure the decision agreement between two sets of reviewers.
Nathan: Gotcha. So random is 50% in this setup. The NeurIPS judges only got up to 66%. So they're agreeing not one out of two, but two out of three times. The ICLR judges got up to 71.9%. So they get close to three out of four, but still not quite there. And then your implementation gets to 56% agreement with the human judges. So it is an incredibly noisy thing, but there is still like a meaningful delta between what the AI system can do and what the humans could do when it comes to this specific task of judging the papers. Right. Okay, cool. Well, let's go back to the hypothesis generation. Give us the setup for how you sought to answer this question. You already said it took months to build the agent, a lot more months to build the data set from humans. But give us the whole setup of how you set out to answer this question of can LLMs generate novel research ideas?
Chenglei: Right. So given the whole question, I think there are two high level principles that we want to achieve, right? The first thing is, want to set up a proper human baseline? Of course, without a human baseline, you can still measure some numbers on some metrics, but people might ask, what do those numbers really mean? We want a concrete benchmarking point by comparing to a reasonable baseline there. I will feel like, you know, PhD students, they really represent the experts in this task. So we want to compare it to that for sure. And then the second thing is also we want to set up proper design protocol to avoid any potential confounders. And we want to have expert reviewers to do all the judgment, because I mentioned a bunch of caveats of using language models for reviews. all the reviewing is done by expert researchers in this case. So the high-low expert design is then what you have one branch where all the ideas are generated by AI, and another branch where all the ideas are generated by human researchers, and one you have a blind review where expert researchers come in and score all the ideas without knowing which ones which, and then we compare the scores. But in the actual event, you can go to adding a certain condition, which we call the AI ideas plus human re-ranking. The consideration of that is that we have a built-in ranking part in the idea generation agent, where the agent will generate thousands of ideas on each topic. And then the agent will automatically select the best ones. It's a little bit similar to what you mentioned in the Sakana AI paper. So after this ranking, we get the best ideas generated by AI as ranked by the AI itself. Those ideas will be the condition of our AI ideas. And then we realized that the AI ranking isn't really perfect yet. So it's possible that among all the couple of thousand ideas generated by AI, there could be some really good ones, but they're not being ranked as top by the AI because the ranking is not working perfectly yet. So once you have a better estimate of the upper bound, once you find the best ideas being generated by AI, And that's where we manually actually look at all the general AI ideas and select the best ones that we consider as the best. And that's what we call AI ideas plus human re-ranking. So in that condition, all the ideas are still generated by the AI agent, but the re-ranking is done by humans. So that is the third condition. So that's the experiment setup. We'll have one review. We'll assign ideas to all the expert researchers. We'll score ideas along different metrics, including novelty, excitement, feasibility, and effectiveness. And then we'll compare the scores of different conditions. And that's the whole design of the experiment. Gotcha.
Nathan: So let me just say that back to you and make sure I've got it all right. So there's two sources of ideas, obviously the humans and the AIs. You pay the humans for their ideas. They were paid $300 for participating. And if I understand correctly, there was also, I think a couple of thousand dollar prizes for ideas that were judged to be the best among the humans. So the participants are mostly grad students, PhD students in machine learning, right? With a focus on natural language.
Nathan: So, yeah, I think that's notable for a couple of reasons. Obviously, you know, certain amount of expertise there. That's definitely not nothing. And the thousand dollars is not insignificant on the scale of a grad student salary. So presumably people would, you know, at least be somewhat motivated to give a good idea and try to pick up the extra thousand bucks. Then on the AI side, I think there's some really interesting implementation details in the paper. Maybe you can give us a little bit more on how you're prompting the AI for ideas. I know that you're not just asking a simple question like, hey, can you give me some NLP research ideas? You're actually bringing a kernel set of ideas to it for it to expand on. Maybe give us a little bit more on that.
Chenglei: There are a couple of things that we put in the prompt when trying to generate ideas. The first thing is we retrieve realistic papers given the research topic, and then we'll put all those related papers in the prompt to give some inspiration there. Another thing is we give some demo examples. We actually manually constructed six demo examples based on existing papers. So those will be exemplary papers that we put in the prompt. And then all examples will follow the same format. So we define a template for how the idea should be written up. So the template roughly includes things like there's a motivation, there's a problem set up, there's a proposed method, and there's an experiment plan, and then there's also test examples, et cetera, et cetera. So it's a very detailed template for writing the idea. So we provide the template to the agent as well. And then we also include all those demo examples, include all those related work. And then we generate ideas in batches because we want to massively over generate a number of ideas and then select the best ones. There's a limit to how many tokens you can possibly generate in one generation. So we actually generate 16 batches. So we generate five ideas in a batch and then we keep generating. And one issue with this thing is that it's possible that the agent would start repeating itself by generating the same ideas over and over again. And that would be bad for us because we would want a large set of very diverse ideas so that we can possibly find better ones there. So one thing we did is we would actually add all the previous generated ideas titles into the prompt and ask the model, you shouldn't be repeating what's already being generated. So that's one simple attempt on reducing repetition there. But as noted in the paper, there's still a lot of repetition in the idea generation thing. We did some analysis on that. We can touch more on that later. But on a high level, that's the idea generation part. So we have the retrieval part, we have the template, we have the demo examples there. And after generating all the examples, we have the ranking thing to select the best ideas there.
Nathan: And this is with Claude 3.5 Sonnet as the main model. Did you, is there a, any sort of guidance you could offer on like how much, I presume you chose that because you found it to be best. Is there like a measurement versus like GPT-4 or versus Gemini 1.5 that, you know, you could kind of help us calibrate how much better 3.5 Sonnet is?
Chenglei: Right. So we were initially actually using GPT-4 for all, I think, back then. And then I heard a bunch of people in the lab saying, you know, Claude 3.5 is really good on all those science and reasoning and coding of those tasks. So I tried Claude 3.5. It's difficult to give a quantitative number on this, but we look at the two sets of ideas generated by GPT-4 and also Claude 3.5. And intuitively, based on our judgment, it feels like Claude 3.5 does have some advantage in the tasks of generating ideas. That's why we stay with Claude 3.5 for the entire platform. I know o1 just released o1, and I heard o1 is supposed to be pretty good on those recent intensive tasks. It's possible o1 might be even better. We haven't done any benchmarking on that yet, but it's on my list.
Nathan: Yeah, I'll be very interested to see the result of that. Okay, so going back to the core of the setup. Examples are put in structure of the format of what an idea should look like is included for the AI to work from. It's also given the ideas that it previously generated and told not to repeat. It nevertheless does repeat quite a bit, but enough new ideas come out that you can... Tell us about the ratio. If I remember correctly, you generated like 4,000 ideas, but found that like... Only a couple hundred were truly original?
Chenglei: Yeah, around 200 ideas are non-duplicates. So basically we use semantic similarity to judge whether a pair of ideas are too similar to each other. And we will filter this out. So after you do this filtering, you get like 95% of the ideas out because they are duplicates. And we actually have a curve in the paper showing that in initial batches, most of our ideas will be novel. But as you generate more and more in the later batches, most of our ideas will be just repeating previous ideas. So if you calculate the total number of non-duplicate ideas, the number starts to saturate in the later batches. And eventually it totals around 200 or so, among the 4,000 generations.
Nathan: So how do you think about that? I guess for one thing, it looks like a logarithmic curve, right? It looks like, but a logarithmic curve doesn't, it's not a flat line. and maybe you, I don't know if you pushed this any further, but you know, beyond 4,000 ideas, like should we still expect to get some original ideas, albeit increasingly infrequently? Or do you think that you kind of hit a point where it's like, it's just not gonna come up with anything new anymore. And it kind of is what it is beyond a certain point. That seems important to me because like we could, you know, put, we can obviously pump a lot of tokens through these models. And it's like, even if it is logarithmic, you know, with a budget, you could get there. But if it truly kind of flatlines and just doesn't really do anything new beyond a certain point, then that's kind of a qualitatively different situation. So what do you think the reality is?
Chenglei: Yeah, so this is a tricky question because there are two factors here. One possibility is that given the research topic, maybe the nature of the problem space is that there are only a finite number of ideas possible within that space, especially if you give a very specific research question. That's possible. Another possibility is that maybe the problem space is much bigger, possibly even infinite. But the model has a very low upper bound on the number of different ideas it can generate, then this will be a huge issue as you pointed out. But I think this is still open for debate in the sense that we actually didn't push that hard on the generation part in the sense that what we did to avoid repetition is very simple. It's just pending the already generated ideas and asking the model to not repeat. I can imagine smarter ways of doing this. I can imagine ways that you can possibly, let's say, fine-tune the model to increase diversity. You can possibly change the decoding methods in some smart way to increase diversity. If you tried all those things really hard and still see upper bound, then I think it's evidence that maybe the base model, it just has an upper bound on the idea of diversity. And that would be a huge issue that I would agree with. So yeah, that's my take on this.
Nathan: There's also the question too of just like how, you know, the real eureka moment is like, zooming to some place in a vast search space that would have taken forever to find through brute force search that actually happens to be like a really good idea. And that is obviously hard. Not many humans can do it, right? And humans tend to only do it in like very narrow domains where they have like deep expertise. So this is a hard thing to do. With the AI ideas, you know, one wonders, and I guess this is kind of where the evaluation will come in, but one wonders like, okay, let's say you turn up the temperature or apply these different decoding methods or jam all the levers that you have to try to increase diversity. You can presumably increase diversity, but... How does that diversity relate to quality? And do we have any way to determine other than like the, you know, I think your approach here of actually using humans to do the evaluation is really good for creating a grounded baseline that people can understand. hopefully have some level of agreement on, but it's obviously really hard for, I would assume this is like probably why you haven't been able to rerun the thing with a one yet, right? Because you could generate a lot of ideas really quickly, but now it's still going to take a lot of time and effort to do the evaluations. So when you do turn up the diversity, do you have an intuition or do we have any metrics for sort of how pushing these like temperature and other, you know, parameters up impacts the quality of the presumably more diverse ideas generated?
Chenglei: Yep. So I think one analogy here is recent works on inference scaling. So for example, on stuff like code generation, there are experiments showing that if you sample more diverse solutions, the success rate, like the passing rate of the solutions increase. And there's a nice scaling curve there where the more you generate, the more likely that one of the generations will be correct. And that your task will be essentially to find the correct ones among other generations. So that's the analogy there. We didn't do the exact same thing on idea generation in the sense that, like you said, unlike code generation, where you can verify the correctness of the solution. And that means that you can actually know the so-called success rate in all your generations. In our case, the value is really tricky, but still I think we can build proxy rankers like we did in the paper. The ranker may be not perfect. In fact, it's definitely not perfect, as we showed in the paper. But still, as long as there's some positive correlation that the ranker is able to tell really bad papers from really good papers, then I think it's reasonable to expect that if I scale up the diversity and the number of ideas that I generate, this ranker will possibly find better ideas. And among all the top-ranked ideas, I might expect a quality increase as I generate more ideas.
Nathan: Yeah, that reminds me quite a bit of some work that I've done on using a number of different kinds of models to try to evaluate the aesthetic appeal of images. And we found a similar thing there where the human to human consistency of judgment was fairly low. The human to AI agreement was even lower, but not zero. You know, it was better than random. And the general guidelines was, you know, the general kind of takeaway of that work was like, you can definitely tell the difference between the best and the worst. And there's clear signal there, but especially in the middle, right you really can't you know the fact that one was rated higher than the other is not very meaningful right so you have like a you know a general direction but at any given point in the list like it's definitely up to interpretation although I will add one comment here that at this point it doesn't matter that we want to have really good rancor evolving for to find the best ideas but you know if we think bigger and we think about the big picture of automating research.
Chenglei: In the next stage, it's possible that we might start working on automating the execution part of things, where we might build agents that can automatically execute their ideas into experiments. That means the agent will tell us, I will implement the method, I will test on the specified datasets, and I will tell you how well it does as compared to the baseline. And that means I can actually verify whether the proposed idea works or not. and that means for all the general ideas I'm willing to spend all the inference compute money I can just implement all those ideas automatically and then I can automatically verify each idea so in that sense it turns into a setting almost like code generation where I can actually check the credits of each possible idea I think in that setting we'll have some nice inference scaling there Yeah, that's a great point, first of all.
Nathan: And it reminds me of the AI Scientist paper where they reported, I think it was like, in the low tens of dollars per paper generated. And so it was like, you know, okay, sure. They're not that great, but if one in a thousand is great, then it's like clearly economically competitive and the prices are obviously falling quickly. So it seems like that I remember talking to somebody once at one of the leading frontier model developer companies, one of the big three, and the person said, we find over and over again that people just don't have an intuition for what happens when you're willing to spend a lot on inference compute. Obviously, OpenAI has kind of now productized this to a degree with the O1 family of models. But even before that, just people try one, you know, they try two, they try three. It's just not intuitive to think like, what if we tried a thousand or a million, you know, and that really can change the game. So I think that that is quite important. Okay, let's go back. Let me just reset the stage one more time because I know I keep going on all these rabbit hole digressions. So you pay grad students for ideas. You give them the promise of a prize. You run a reasonably high volume of idea generation. You de-dupe the ideas. Is that with an embeddings approach or what's the de-duping method? That's embeddings? Yeah. Okay. Then there's two ways that you can rank the ideas to try to bring the best AI ideas to the top. Those are having the AI do it directly I believe you use Claude with pairwise head-to-head kind of like tournament style for that that's yeah the pairwise conversion thing is what we found you work much better than directly asking the model to give a score let's say on a scale of one to ten yeah that's a really good tip and something I think that is a practical takeaway for all sorts of AI application developers in the audience. And I actually, there was a really good, this is a very in the weeds detail, but what I just said about tournament style is not quite right actually. I had previously been thinking of doing something like this in a project, and I was thinking like almost setting it up like an NCAA March Madness bracket where I would have the field gradually narrowed. But of course, that has the problem of like, well, what if your best two ideas meet in the first round? And so then you're like, well, maybe I'll have a double elimination tournament, but then you could still have similar problems. So I think what you did was just take random pairings and have each idea paired with five other ideas and then just sort them by their win-loss record, right? So 5-0 would be at the top, 4-1 would be next. Is that right?
Chenglei: So instead of random pairing, what we did is basically pair each idea with another idea with the closest accumulated score so far. So let's say we want to pair the top scoring ideas so far with other top scoring ideas, for example. And you're right that it's different from a typical tournament in the sense that we don't have elimination problems, because otherwise you run your problems where two really good ideas meet in the first round, and one of them got eliminated, and they would only get a score of zero in the end, and that would be really bad.
Nathan: Yeah, okay, cool. So that's even one level smarter than I understood. Pairing ideas with others that have the same score so far to try to put the most comparable ideas together. Yeah, you can sort of, it's almost like bubble sort style, or at least that's the visual that comes to mind for me. Okay, so that's cool. That works, but you did find that even better is human re-ranking. Anything we should know about the human re-ranking process?
Chenglei: Ah, it takes forever. I actually mainly looked at all the ideas generated and then I didn't look at the AI ranking. I just, you know, did my own ranking of it. And then I compared my own ranking with the AI ranking and it's somewhat surprising that there are a couple of ideas that I saw be clearly the top ideas that are not really captured by AI as a top idea. So I think the discrepancy here is interesting.
Nathan: Okay, so you personally, individually, are the human re-ranker, is that right? Right, right. Yeah, okay, interesting. Then I think the next thing is a pretty interesting detail. And I guess we should also set the stage and I'll put this in the intro as well so people have this context. The field of study that we are working on here is essentially prompt engineering. It is how can we get the best performance from language models via different prompting techniques. So that's just background baseline. All the ideas all across the board are within that domain. Now, once you have the human ideas, the AI ideas ranked by the AI itself, the AI ideas ranked by you, then you put them through a sort of anonymizer type of thing, right? You basically have the AI... And if I understand correctly, this was done to try to create an even playing field with respect to the human judgment. I guess the worry was the human readers will likely be able to tell the difference between a human written idea and an AI written idea. Right. Even though we know that like AI, large language model detectors in general don't work. And if you're a teacher and you're in a classroom, like you didn't put too much stock in them with that disclaimer, it is reasonable to expect that especially people that work in the field would have an intuition for like, yeah, this kind of reading. How often is the word delve being used, et cetera. So you put just the human ideas through this rewriter or did you put the human and everything? Yeah. So tell us, give us that bit of the setup.
Chenglei: So we pretty much specify the desired writing style and format of the idea. We try to have the writing be more academic and then we specify every single component that the idea should have. And then that writing agent essentially does some minor paraphrase to all the different ideas. set very explicitly in the prompt that you can only do like writing style kind of edits definitely not editing any of the original contents so most of the edits are basically changing certain word choices and I have to say I think this is very important in the sense that if you look at the original ideas submitted by humans some of the word choices can be very obvious that are just written by humans like some of the sentences tend to for example some sentences are not grammatical and that pretty much doesn't give it away and some people like to use more mobile writing styles almost like they're talking to more engaging writing styles and that just feels unlikely as from AI those things don't really affect the content that much but we don't want this to bias people that let's say for example some people just like humans better then maybe if they figure out that this written by humans just have a more engaged writing style and then maybe they would favor those ideas we only avoid this case So we do the style transfer thing for every single idea, including both AI ideas and human ideas. And the hope is that we would avoid all those confounders there.
Nathan: So I think your paper definitely did generate quite a bit of discussion online. This was one of the points of criticism that I think people most focused in on when they were like trying to find a reason not to believe the result. Maybe we should tell the result and then we can kind of come back and dissect the criticism. Sure. The final step after doing this rewriting is then to send all these ideas. And these are now like what, like 500 words. It's like a page or two pages, like a reasonably well fleshed out.
Chenglei: A thousand words?
Nathan: Yeah. Okay. So not an insignificant sketch of the idea. And these are then sent to human evaluators who have to score them. And they are scored on four different dimensions, right? They are novelty, excitement, feasibility, effectiveness, and I guess overall kind of just aggregates those previous four scores, right? So we will ask a separate overall score, but that's the idea. Okay. Gotcha. So five different dimensions. And the humans give each paper five number scores, one to 10, one the worst, 10 the best. How original is this idea? That's novelty. This excitement obviously is kind of subjective, but that's basically the like... does this feel like a really good idea? Feasibility obviously being, maybe you can give a little more color on kind of the guidance that you gave for that.
Chenglei: So for feasibility, we are really asking whether this idea can actually be executed. So for example, some of the non-feasible ideas will be things like something that requires tons of compute, like you want to train GPT-4, that doesn't feel that feasible, or you want extensive human effort, like you want to handle a huge training set, that won't be that feasible, or things like that. or you want to, you know, fine-tune a closed model, like you want to fine-tune some general model that doesn't really have fine-tune access, for example. So things like that. So that's what feasibility. And that's different from effectiveness in a sense that in effectiveness, we're asking the question of, given the proposed method, do you think the proposed method will actually work better than existing baselines? So that's the difference here.
Nathan: Gotcha. So novelty, excitement, feasibility, and effectiveness, feasibility is like, don't go too crazy on the budget. I wonder what would happen if you put this paper, your paper, through the evaluation process. How do you think you would score on these different budgets? Yeah, you'll never get the budget for it. The total budget on this is what? And how does it break down, by the way? I mean, I'd be interested to know kind of total spend and also, yeah, just what percentage of that went to the humans versus what percent went to the AIs.
Chenglei: Yeah, the payment for humans in total is about $30,000. And then for AI, so joining 4,000 ideas will take a couple thousand dollars. And then we did that for seven different topics. So that's some money there. And then there's also some development in the early stage where we, you know, prototype a bunch of different agent systems. So in total, I would say maybe the money that we spend on API credits will be slightly more than what we actually pay humans in our case. So that's the money breakdown there.
Nathan: Interesting. Okay.
Chenglei: That's kind of surprising.
Nathan: Maybe let's dig into that just a little bit more. So what is consuming the majority of the token budget? Because I did a little back of the envelope math and I came out to a lower estimate of what I thought you had probably spent on the APIs.
Chenglei: Yeah. The problem with the ideation part is, first of all, we have to generate all the original ideas first, and then we do the duplication. That means if you want to get, let's say 200 ideas on a topic, you have to actually generate 4,000 ideas. And usually we're doing this almost like screen-locking where we actually over-generate more than 4,000 ideas. We try to test the limit there. So we actually generated a whole bunch of like 8K or even 10K ideas on each topic and did some analysis there. So there are some hyperparameter tuning there. So there we spend a bunch of tokens. And also you do realize the fact that the problem is very long. So the input token is a lot. And then it gets longer and longer as we prepare all the previous general ideas in the problem. Another thing is in the final step, we also have this process of a novelty filter. This part is actually not in the main paper, it's in the appendix. The novelty filter thing is similar to what people have done in the past in the sense that for each generative idea, we will actually retrieve all the relative papers and then we'll compare them, the general idea with each of the related papers and ask the model whether it is too similar. And that actually is pretty costly in the sense that you have to do this comparison between the general idea and each of the retrieved papers. That also costs a bunch of money. So in the end, you can, of course, simplify the pipeline, but only do the generation part and then ignore the ranking and filtering part. That could save you some money. But I would say a lot of the money is really spending on doing the over-generation part, where most of our ideas are actually being duplicates and are being filtered out. And then a bunch of money spent on the novelty check and filter stuff. Gotcha.
Nathan: Okay. Interesting.
Chenglei: But in general, I would say to get one valid non-duplicate idea, you probably need to spend a couple of dollars. That's the scale of how much money you need to spend.
Nathan: Okay. Cool. So we can finally, I think, get to the headline result, which is that when you have humans evaluate the AI rewritten human ideas versus AI ideas versus human with your individual re-ranking ideas. The AI ideas are scoring higher and statistically significantly so for both novelty and excitement, roughly the same with feasibility. It looked like the humans are a little bit higher score there, but not necessarily statistically significant. Roughly the same a little bit higher on effectiveness and higher overall and especially with the human re-rank your re-rank of the AI ideas then you see like a notable difference in the overall score but novelty and excitement definitely stand out to me as like the two of the four that I'm looking for most as somebody who's trying to figure out what the future of the field looks like. So that is a pretty big deal. What additional color commentary would you give us on those results?
Chenglei: Honestly, I was also a bit surprised when I saw AI ideas are getting higher novelty than human ideas when I first saw the results. And one additional comment I want to give is that numbers are just one thing. There are numerous ways that you can possibly manipulate your numbers to tell the story you want. So I just want to say, I personally look at every single idea that's being generated by AI and also submitted by humans. And I think I have a reasonable sense of how they look like. And I think I do see how the actual ideas support the numbers that we are seeing here. I think if you look at all the ideas being generated by AI, you will realize they have this interesting vibe where they tend to be I mean, almost weird in the sense that they're not similar to the typical ideas you will see being published these days. It's maybe less grounded in existing work, but more out of the box in general. And I think I would categorize that as popular novelty. So if you look at all the data, I think it does make sense afterwards that the ideas are getting better novelty there.
Nathan: Okay, that's really interesting. Could you go a little deeper into just personal observations? I mean, even if they're anecdotal, and by the way, speaking of anecdotal, one question that I asked and was definitely interested to see the answer on was, okay, maybe the average score is higher, but what does the standard deviation look like? because you could imagine a higher average, but a sort of narrow band of performance. And that's often, by the way, how I tend to characterize AI performance for people when I'm just trying to educate people about what AI can do and what it can't do. I often, as a rough guideline, say it's going to be... It varies dramatically on different tasks, right? Obviously, you have these sort of areas where AI is superhuman. It can translate from any language to any other language. That's amazing. No human can do that. So... In that way, it's kind of superhuman. At the same time, they're very bad at common sense spatial reasoning in general still, and don't know how you should stack objects on top of one another. My general guideline, though, for people is it's going to be pretty consistent within a task. It might be really good at some tasks, really bad at other tasks. But in general, it's going to be pretty consistent within a task. I was struck by your result that the standard deviation of the scores for the AI ideas versus the human ideas was basically the same. And also that, in fact, with excitement, it was a little wider. The standard deviation was a little greater for the AI ideas. And also that the max score was higher for the AI ideas. And like I'm reminded of Tyler Cowen, who says you read people for their peaks, like what you contribute to the world may in some very practical sense boil down to your very best idea. And it might just be your one idea that was like your, you know, the actual lifetime contribution. If you believe that and you look at the max score, then you're like, huh, it's also quite striking that the max score is higher for the AI ideas than for the humans.
Chenglei: If I give you one more data point there, if you think only look at the single one max score is Two small samples. I actually did this calculation. So I merged all the ideas and then I ranked all of them by the overall score. And I looked at how many of the top 10 ideas are from human versus AI. And I can tell you, among the top 10 ideas, nine of them are actually from AI. And only one of them is from humans.
Nathan: Wow. Okay. That's interesting to say the least. What else? Keep going. I mean, I could happily listen to any and all other observations that you have as somebody who just obviously spent a lot of time looking at the data. This is a good reminder too of our mantra for all this kind of work. Look at your data. So you clearly did that. What else comes to mind that we should be thinking about?
Chenglei: Yeah, I think one thing that maybe people didn't notice that much is I was closely monitoring the whole review process as people submit their reviews. And I think it's very interesting to see that, first of all, there's a big review disagreement as we noted in the paper. Second of all, I think... sometimes there is some randomness in like which ideas like some reviewers have certain preferences and you know let's say they tend to favor certain types of ideas so there's this factor in there where certain reviewers or let's say they will just systematically give higher scores to the ideas that are given to them because they like the topic for example and there are certain cases where certain reviewers just give low scores over a lot because that's one thing I was observing in the process and I was like you know that's really gonna impact the way we make conclusions because we want the conclusion to be based on the actual quality of the idea rather than you know maybe reviewers have biases and then different reviewers just have different uh calibration in general so that's why we kind of did a lot of different tasks in the paper like The first table that you see is really just treating each review as an independent data point. But then we realized there could be all this review of biasing. There could be all sorts of different confounders. That's why we did two additional tests to try to account for all those possible biases. And in the end, I think we realized the difference between AI and human ideas is big enough such that you get a significant difference no matter what kind of task you run. And that's why we are making that conclusion there. So back then, I think what we agreed with my advisor is that we're going to run all the possible statistical tests we can think of to account for all the potential components there. And for even one of the tests that this conclusion doesn't hold, then we won't be making that conclusion. So that's the bottom line there. And then the novelty aspect is just different enough such that it holds robustly across all the different tasks. So that's one thing from the results section.
Nathan: So each idea gets three human evaluations, is that right?
Chenglei: Average is on three, but at least two, almost four.
Nathan: Gotcha. Okay. So two to four with average being three. And then when you talk about doing multiple different statistical analyses, the main two that I saw were one is treating each score as its own data point and measuring that way. And then the other was taking the average score at the ideal level and then treating that as a single data point.
Chenglei: And there's another thing where we actually analyze based on each individual viewer. Because like I said, I observed this thing during the review process where some reviewers, let's say they just really hate prompting and they might give two or three to all ideas to review. Some people, they might really like prompting and, you know, they might give like seven or eight to all ideas they reviewed. And then it becomes a problem of whether you happen to have assigned more ideas to those people or not. That's really not what we want to capture. So what we did is for each individual reviewer we look at all the ideas they reviewed and then we get the average score for the AI condition and the human condition and we compare the difference between them so for each individual reviewer they should have consistent standard when they're reviewing different ideas so if every single reviewer is giving AI ideas better scores than human ideas then that says that okay this thing holds robustly despite all the difference between different reviewers and that's the third test we're doing there Yeah, gotcha.
Nathan: So just to restate it one more time, you've got three ways of analyzing the data. One is every single score from a single reviewer for a single idea is a data point, analyze it that way. Version two is take the average grouped by idea and then analyze them that way. And then the third is aggregate at the level of the reviewer where you basically say, okay, this reviewer A had this average score for AI ideas, this average score for human generated ideas. And now we'll use those as the data points to then do analysis on. And the bottom line is across the bottom line. any of these different lenses that you want to put on the data, you still see a statistically significant edge for the AI ideas, particularly in novelty and excitement.
Chenglei: Yeah, that's the quantitative results we have. I think one additional comment I can add there is that I'm cued pretty strongly that I support the conclusion on the novelty part, but some people could argue maybe being novel isn't, you know, the only important factor there for producing good research ideas. Like I said, some of the AI ideas, they do tend to look more different than what we typically get when reading all those papers. But the question is whether that's a good thing, but you can argue ideas could look very different, could be very novel, but then they turn out to be just where ideas that just don't work at all. That's totally possible. And the current evolution paradigm has just evolution where we are asking the question of, do you think this idea is going to work? Or do you think this idea is feasible? And based on the current evaluation, AI ideas are slightly worse on feasibility, but somewhat similar on effectiveness. So that's not too bad, but you could of course argue that this evaluation is bad because you are essentially asking people to predict whether this idea is going to work, which is an incredibly difficult task, even for expert researchers. And it's highly subjective, but this is something that we could possibly actually measure objectively. where the way is you actually implement the idea and then you can see whether the idea actually works or not. So that's the current project we're working on this color right now. We are actually recruiting a bunch of people to execute all the human AI ideas into food products and that we can verify whether those ideas actually work or not. So that is trying to address this concern that do those weird looking AI ideas actually work in practice. I don't know the results yet. I think either way, it's going to be fun. I think if we get the conclusion that those ideas did not just look normal, they actually also work in fighters, then I think that would tell a pretty strong start.
Nathan: Yeah, I mean, and if it's the reverse, there's still something, like, very interesting going on. where yeah especially because I mean we've covered all the statistics but again you also said nine of the ten highest rated ideas were AI ideas it would seem strange if like I mean that would certainly suggest that like the humans I guess it would kind of call into question like a lot of the human evaluation right if all of a sudden like despite this sort of pretty clear signal that the humans expect better from the AI idea. If that were to actually be reversed, it would be like, yikes.
Chenglei: Yeah, I think that would tell us not only about the characteristics of AI general ideas, it also tells a lot about the evolution protocol that we should actually do. I think, you know, if the reverse is happening when we finish the execution part, I think that's an interesting challenge to whether even human evolution is reliable at the idea stage or not.
Nathan: Yeah, that does remind me of one comment I saw online that I did think was potentially perceptive or sort of something that would warrant some digging. And maybe you did this digging. But the idea was when a language model is prompted to generate a novel idea, there are different ways that it can sort of satisfy the user with that request. And this ties into like a common theme of this show over time, which is that RLHF is obviously very powerful, has worked really well, but has some pretty fundamental problems, one of which is that the language models seem to be starting to understand that pleasing us is a bit different than actually doing something in reality. And this is why we see sycophantic behavior out of Claude and others. And this is one of the reasons people worry about deception from language models long term. But in the specific context here, you might worry that, and presumably the rewriting would try to help fix this, but maybe not entirely, but you would worry that the AI might be really good at using novelty indicators, words and phrases that suggest novelty, even if the underlying idea maybe isn't actually so novel or even coherent necessarily, but just ways of presenting that the models have learned will score well with humans. How do you assess that possibility? Again, to the degree that that is true, it would be operating by means of tricking the humans. But I don't think we can entirely rule that out without at least some digging into the question. So how much digging into that possibility did you do? And what's your current state of understanding of that question?
Chenglei: Right. So my intuition back then was that we are using the exact same, you know, on paraphrasing prompt to standardize the style of every single idea, including AI and human ideas. So the hope is that the kind of writing style of word charts will be similar across conditions. But it's true that some AI ideas, they might have more novel-sounding words in the beginning, and then they didn't really get filtered out by the paraphrasing. I think we didn't really 100% rule out this possibility that maybe the ADS would contain some more novelty, something worse there. I think we have some plans to do some follow-up analysis on the word choice and stylistic analysis thing. I think that will tell us something interesting. But still, I think the only possible way to totally address that thing is by actually finishing the execution study thing and tell us whether Azure ideas tend to work or not. Like that's the most objective thing that will possibly address any concerns about the subjectivity in evolution. Because in the end, I have to agree that there is a strong level of subjectivity in the current way that the ideas are being evaluated. So that's my current status.
Nathan: Were there any ideas generated by the AI that you felt were like just amazing? I mean, some of these ideas that got scored like a 10 on, let me just go back to the table. So individual scores, the highest novelty score for a human idea was an eight. The highest score for an AI idea was 10. For excitement, the highest score for a human idea was eight and the highest for an AI idea was nine. So were there, if you, I'm sure you've looked into the ideas scored 10 in novelty and nine in excitement. What did you see? Like, were those ideas that you were like tempted to go run with yourself? Or I mean, just how good were those top created ideas?
Chenglei: There are a couple of ones that I personally really like, I mean, again, this whole thing is subjective. The ideas that I like the best may not be the ideas that other people like the best, but I personally really like the ideas under the uncertainty topic. Like we have seven topics and one of the topics is on counting methods that can help us calibrate uncertainty and measure confidence. And then we have a bunch of AI-generated ideas on that. I think multiple ideas under that category look pretty interesting. It's this one example that we had in the paper, for example, it's sort of a much more fancy version of this self-consistency idea where you generate many different diverse solutions given the query and then what you do is not really simple majority voting but rather you measure how each solution supports or refutes each other and you kind of construct this graph thing and then you can measure centrality or other graph metrics and as a better way to measure the authenticity of the solution for that query instead of just taking simple majority voting So that's what we need. And there's a couple of other ideas that we put in the paper that I thought was pretty interesting. Like, if it works, then that would be awesome. So one example is, again, on measurement uncertainty. So the idea is, what if we prompt the model to first generate so-called, like, uncertainty examples? Like, given this question, if I tell this answer, it has uncertainty of 10. If I tell a different answer, it has uncertainty of 9. And so on and so forth. So we kind of have this mapping of what would be the answer with uncertainty 1 look like. will be answer which uncertainty do look like and then when given a new query a general response and I basically compare the response with the mapping scene and say you know this closest to a uncertainty 5 answer so uncertainty will be 5 something like that so really weird in the sense that it's quite different from any of the other uncertainty estimation methods that I've seen but you know sounds interesting we didn't really test whether that would work that would be for the next stage of study but if this really weird looking idea actually work and then I think that would be pretty yeah, it's interesting.
Nathan: How did you think about in terms of what to do next, going down the path of actually executing ideas versus other dimensions of generalizing this idea? Like an obvious one would be to rerun a similar experiment in another different domain.
Chenglei: Yeah. It's not another domain. It's also possible to, you know, relax the constraint on prompting. That's a lot of people are really criticizing that. You know, we could totally do something where I solve just prompting ideas. We allow any sorts of ideas. Like, you know, I can imagine you can adapt the same idea-driven system to generate ideas on better ways to construct synthetic training data, for example, better decoding methods, better training objectives for alignment, et cetera, et cetera. You can totally do that. I think it's a matter of which direction you want to head. I think a couple of things that's on top of my head right now. So one thing is a lot of people are saying the human ideas we collected in this study are not really representing the best human ideas, which is totally true. We actually asked the exact same question. The idea writers are saying these ideas represent like around the average of the ideas so this is really average PhD level rather than top PhD level so if you want a stronger human-based center one thing that we are thinking about doing is there will be another upcoming LP conference called MLP and we can compare all our general ideas with papers actually accepted at this top-tier conference. And then we would have a stronger baseline there where we could assume those accepted papers will represent some higher quality work as compared to asking someone to generate ideas on the spot. So that's what I'm saying. Another thing is the execution thing that I was talking about. I want you to address all the subjectivity concerns in the current evaluation by testing whether AI-generated ideas actually work in practice. And I think that will give a much more objective evaluation in the quality of AI ideas. So that's one thing. And both of those are still on evaluating the research ideas. So a bigger step forward would be to complete the other part of the entire pipeline, and that is we have generated all those ideas. We want to find a way whether we can automatically execute each idea so that we can actually verify the effectiveness of each idea. And we did some preliminary attempts on building such an execution agent. I think there's some crucial limitation in the evolution of these agents in the sense that You know, you can have some tricks to have the agent generate other code and the code could run successfully. And then it can actually give you some numbers. We build an agent that can implement the idea and tell you the baseline number and the new proposed message performance. But then the problem is if you look at the actual code implementation, Sometimes the model is not implementing the right baseline. For example, for a test classification problem, the agent was actually implementing a keyword-based method as a baseline, which is totally unreasonable in this era. So that's one thing. And sometimes the agent might skip some steps when implementing the proposed method. And those things are really tricky because it requires evaluation on the intermediate steps of the implementation rather than just the final outcome. So I think we need to do some more work there to figure out what's the best way to do this intermediate evaluation on the correctness of the implementation and also just to improve the agent in general. So that's something that we are thinking about. The hope is we can set up this evaluation of the execution agents so that we can post it as a challenge for the entire community and people can all work on this problem of creating execution agents that can automatically implement research ideas and verify things. You know, if that works out in the end, then we can think of a very exciting combination of idea generation and automatic execution. And then we can scale this up and maybe we can possibly find some best paper level ideas that actually works in practice. And that would approach what I call the eureka moment for automating research in the beginning.
Nathan: Yeah. How much of a difference o1 is going to make here? It strikes me that like in terms of accelerating your process, if o1 could get to human level evaluation and you could demonstrate that, then you would like dramatically reduce your cycle time for one thing. So I wonder how close you think that is to happening. And then, you know, just curious for your intuition in general for where you think this next generation of model will make the most impact. And if you have any ideas around where you think it won't make an impact, that would be definitely really interesting too.
Chenglei: Yeah, I agree with you. I think I really need to try this o1. I think if o1 is able to get better at the automatic evolution thing, it can be a great re-ranker thing that will help us with the inference scaling stuff. I think that's interesting. It's also interesting to test whether it has better diversity in the ID generation as well. I don't have any empirical evidence right now to see whether o1 will actually work better. So... You know, that's an open question that we want to try for sure. I think another interesting thing is also we're doing the prompting in the entire edit generation pipeline. And it seems that it's possible to also try some fine tuning in the sense that we actually have a lot of data, you know, on all the academic papers on those conferences. It feels like we can gather some meaningful data set and it's interesting to see whether smaller language models like plasma after proper fine tuning would give some reasonable results in this pipeline as well.
Nathan: Okay, cool. I think we've covered it. Is there anything else that I haven't got to that you think folks should be aware of with respect to this work?
Chenglei: No, this all sounds good to me. One last advertisement thing, we're doing this execution study right now. If the audience is interested in participating by actually implementing one of the ideas, we pay a lot of money. We'll pay a couple of thousand dollars for implementing an idea. And that would be a great contribution to our study. Cool.
Nathan: Interesting. I like it. What's your expectation for the future? I mean, obviously we're getting into speculative territory here. How do you think about kind of key thresholds to be hit and how soon do you think they're likely to be hit?
Chenglei: So I think a concrete threshold is what I said in the beginning. If you can have an end-to-end system that can automatically think of an idea, execute the idea as the whole project. And that project, I can actually not just get into a top conference, but actually get a paper award at the conference. That will be, I think, the major milestone we want to hit. I think the easier milestone is to just have the automated paper to get into one of the top conferences. You know, I have some hot takes on this. My hot take is it's actually not that far away for that simple milestone to be achieved, like just getting into a top conference. I feel like even some of the ideas we have right now, I feel like the best ones, I could see them getting accepted at a conference like ACL. I've seen similar shape of research being published there. I feel like we're probably getting there if we nail down the execution path. But of course, that will take some time for sure. But the really different question is the whole agenda, like the ultimate vision is not to just mass produce average piece of research. I think the ultimate hope there is really we are able to have this automated system that can generate success beyond the average PhD level. So I really think if we are able to at least demonstrate one case study where an automated idea gets implemented, into something that can win a basketball award at the top conference, I think that's a strong sign. I feel like that's going to take us a couple years. My hope is that by the time I graduate, I mean three or four years, if by that time we are able to achieve this goal, I think that will be pretty exciting. I think that's an aggressive estimate, but that's a goal. Let's see if we can hit that.
Nathan: That's right in line with Leopold's situational awareness timelines. Cool, well, I really appreciate all your time and going into the many, many details and weeds of the implementation with me. Any closing thoughts before we break? Oh, sounds good to me. Alrighty, cool. Well, the paper is, once again, Can LLMs Generate Novel Research Ideas? It's one of the most important questions in the world today. Chenglei Si, PhD student at Stanford, researching the automation of research. Thank you for being part of the Cognitive Revolution.
 
 
 
 
