Context by Cohere
LLM Agents and Evaluation: An Interview With Graham Neubig

LLM Agents and Evaluation: An Interview With Graham Neubig


Graham Neubig is an associate professor at CMU studying natural language processing and machine learning. In this interview, Graham speaks about LLM agents, evaluation, tool use, and the future of neural network architectures besides Transformers.

Here's my full conversation with Graham:

Question: How do you guide people (in both academia and industry) to think about the evaluation of LLMs or AI models in general?

Answer: Yeah. So this is a really good question. I think there's certainly a place for academic datasets, and especially when people are trying to choose the base models that they want to be using, looking or starting out with, the scores and academic datasets is probably not a bad idea. But of course, everybody has their own, you know, specific tasks that they're interested in. And the thing that I definitely recommend that everybody do is, like, look at your data, look at the outputs and, you know, try to identify errors in there. And, you know, even if you start out by looking at the scores, running the actual outputs that you need through the model, looking at the data, iterating on that quickly, finding error cases is really important here.

Question: do you see that academic datasets sort of evolve towards more challenging and maybe more industry representative use cases? Or is there a disconnect?

Answer: Yeah, I'm really hoping that's happening. And I kind of you know, I'm excited by the efforts in academia to keep up with this because, you know, if all the datasets we're working on ourselves, research is no fun anyway. So I think, yeah, we're moving in that direction with things like, you know, agents answering more complex questions, like very multilingual datasets, other things like that.

Question: Agents are a big theme. There are reinforcement learning agents and there is this new rising type of LLM-backed agents. How do you think about agents and where do you see them going?

Answer: Yes. So I see the LLM-backed agents. There's kind of two major categories that people are calling agents, and I actually think they're a little bit different.

The first one is just using tools to solve the task. So, for example, there are people who are trying to solve complex question-answering things like numerical reasoning or retrieval augmented generation. And any time it accesses an external tool, you could also call it an agent or something like that.

Then separately from that, there's also agents that act in the world. They actually act and make some sort of impact on the world. And that's another segment. I'm interested in both of them, but I'm maybe particularly interested in the latter where you can actually ask an agent to go out and do something for you and it will do it for you.

Question: Those agents, are they the domain of reinforcement learning alone or how do you think about non-reinforcement learning sort of methods of training and improving these agents?

Answer: Just to give an example, we're presenting a poster here tomorrow, on an environment called WebArena that we created. It's basically a set of four websites and a few other auxiliary websites, like a shopping site, content management site, a bulletin board site, and GitLab. And the agents, we give them a command like: "Please tell me all of the money I spent on Amazon in January." and it would go to your purchases page. It would filter down all the things in January and it would add up the total amount. And so this is an example of the type of agent that I'm thinking about. And of course, you know, other things go along with that, like, how do you realize this? So the way we realized it was we basically just asked GPT-4. We fed it in something called the accessibility tree, which is used in screen readers for blind people or visually impaired people. And then we asked it to click on the next button. No reinforcement learning whatsoever. However, you could do reinforcement learning, of course, but then you need a model that you can train. Probably wouldn't work so well as GPT-4. So I think there's a lot of headroom to combine together LLMs with reinforcement learning in the settings.

Question: Okay. And with the mention of GPT, the T in GPT is Transformer. I'm curious to see what do you think about the architectures in general? Are the transformers here to stay? Are there architectures that you think are starting to rival transformers?

Answer: Yeah, so like the big news in architecture recently is Mamba, I think, which was also talked about at the keynote this morning. But basically it's an architecture that is not a transformer. It's based on state-space models and it has, you know, it's faster in some ways. The results are impressive and a lot of people are looking at that and thinking, you know, maybe this is the first strong evidence that transformers might not be all that we need.

One interesting thing also from that paper, the Mamba paper, is actually even when we see a transformer, there's a lot of different varieties of transformers and there's been a lot of architectural innovations within the transformer paradigms such as RoPE, positional embeddings and better activation functions, better training algorithms and stuff like that.

So I think we will continue architectural innovation. I don't know if it's going to be big innovation or little innovation, but I think that's definitely still really important.

Question: What research areas are you curious about in 2024?

Answer: So one is agents. I still think evaluation is going to be really, really important. How do we handle that? How do we make it reliable when we have models that are trained on all of the Internet and have seen a lot of our data? So like, for example, if we build a benchmark based on Wikipedia, how much is the data leakage from Wikipedia into our models being able to handle that? How can we quantify it? And then separately from that, I'm pretty interested in the small open source models that everybody's working with. So how can we make small models that maybe don't rival big models on all tasks, but at least for a certain subset of tasks, actually do really well, or they're easily domain adaptable or other things like that. So those are some of the things I'm personally interested in.

Keep reading