Context by Cohere
Scaling Laws for AI: A Chat with MIT’s Neil Thompson

Scaling Laws for AI: A Chat with MIT’s Neil Thompson

In this interview with the Director of MIT’s FutureTech Lab, we discuss the growing pains of generative AI and how to avoid the innovation trap of specialization.

Share:

Neil Thompson has been studying the impact of scaling computing power for some time. With degrees in computer science, statistics, economics, business, and public policy, Thompson offers a holistic perspective into the study of generative AI and how this emerging technology might scale. 

Raised in Canada and infused with endless curiosity, he has traveled the world in pursuit of ways to make the world a better place while exploring innovations that in his own words “benefit and empower lots of people to do incredible things.” This journey led to his current role as the Director of MIT FutureTech Lab, where a group of researchers and scientists explore technologies that underpin major economic and social transitions and have a big impact on our collective futures. Recently, he co-authored the paper The Grand Illusion: The Myth of Software Portability and Implications for ML Progress with Fraser Mince, Dzung Dinh, Jonas Kgomo, and Cohere For AI’s Sara Hooker.  

We sat down with Thompson to talk about the paper and the broader research implications for AI. Below is a series of video and text excerpts of the conversation.

0:00
/2:12

With your extensive background, what is it about AI that piques your curiosity?

Neil Thompson: I am originally from Toronto, Canada. That's where I grew up, but I lived all around the world for a while after college when I worked as a developmental economist. I spent quite a few years learning about government aid and private sector action and exploring the question of how we can help the world. I came to the conclusion that the biggest thing that we can do to help the world is to have innovation that benefits lots of people and can empower lots of people to do incredible things. 

The more I studied innovation, the more I realized that computing in general, and now AI specifically, are areas with such enormous power for change that understanding them better and figuring out how we could harness them was particularly important.

I had been studying Moore’s Law and the increase in computing power when I attended a couple of talks discussing AI systems and the relationship between bigger models and better performance. This sparked my interest in AI. There seems to be a very interesting relationship between advances in hardware, the algorithms that are underpinning the work, and the progress that we're seeing.

0:00
/1:42

You recently co-authored the paper The Grand Illusion on software portability. How did the paper come about?

Neil Thompson: The motivation for the paper came out of work that Sara Hooker and I had published independently with the Association for Computing Machinery (ACM). In both cases, we discussed the trend towards specialized computing. 

Historically, central processing units (CPUs) were designed to be very general and do lots of different things. Increasingly we have seen a move toward specializing chips by creating graphics processing units (GPUs) or tensor processing units (TPUs), which are tailored to be good at certain tasks. In essence, there is a trend of picking and choosing where to enhance performance and where to give it up.

The consequence of moving to specialized chips means that we have less flexibility. For example, GPUs allow us to do many more calculations in parallel, but at the cost of it being harder to manage very unpredictable workloads. These imposed limitations can impact our ability to innovate and explore uncharted areas downstream. This can be a problem. With emerging technologies, we don’t necessarily know where innovation will come from or how we may end up wanting to use the technology. For example, when personal computers were invented, who would have realized that the spreadsheet was going to be the killer app.

One of my colleagues here at MIT, Joel Emer, likes to call them computational black swans. The paper then sprung from these concerns on the impact of specialized chips and software programs and whether it hurts our ability to explore different use cases and innovation down the line. We pulled together a team of experts from the community to work on it, including Fraser Mince, Dzung Dinh, and Jonas Kgomo.

0:00
/3:11

The paper reveals that software portability is indeed very hard. What are the implications to AI development? 

Neil Thompson: As the trend to co-design software with hardware increases, we wanted to see whether this customization and matching, where the hardware and software are specifically designed to work well together, would make it harder to explore different solutions with either hardware or software. 

So, we tested how portable software really is when it is unpaired from a co-designed hardware and moved to a different hardware. How often does the program fail? How often does the program run slower? These are the two modes of failure. With the second mode, although it still runs, you may end up paying such a performance penalty that it's no longer competitive. 

We found that in many cases this portability is very hard, and in particular, moving between GPUs and TPUs can be a very difficult thing, causing many individual functions to fail as you go.

As we continue to see an overwhelming amount of people choosing GPUs as the platform, we will see a whole ecosystem built around it, and that's exactly the sort of co-design specialization that we're studying. This raises some concerns as it may prevent the kind of future exploration that we may want to do and could potentially be very important if the best solution is off the path that we're currently on right now. It is incredibly hard to course correct. 

0:00
/3:57

Have you seen this level of specialization in other emerging technologies or is this an unusual situation?

Neil Thompson: Most technologies, as they develop, go through a life cycle, where initially you have a lot of exploration, and then people discover the best routes to go down, and then they end up pretty locked into these ecosystems. A nice example of this is motors. There was a time when people thought every home should have a home motor that was going to be used to power lots of different things. For example, you could attach it to something and it would be used for a blender. You could attach it to something else and it would be used for your vacuum cleaner or fan. Everyone would have one motor that provided utility for all of these different things.

And that's a little bit the way the computers were with CPUs. It doesn’t matter if you're watching Netflix, or you're doing something else, you have this computing capability that provides utility to do all those things. 

But what happened with motors? Well, what happened was people realized that for a blender, you really want a much more powerful engine than for a fan, where you might need a very light one, and so you'd rather use an inexpensive one that doesn't use as much power. We then saw a splitting up of the technologies, and each of these became more specialized companies that built custom engines.

The unusual thing was that in computing, we didn't fall into this path, and that really was because of Moore's law. Even if you needed to get an extra bit of performance like the blender and fan example, Moore's law meant that all of our chips were getting better, cheaper, and more powerful all at the same time, and that kept us on this one unified platform that we could work on. 

The fact that computing has been so general for so long has been very useful for society. It meant that, for example, dentists, not considered the target market for early computing, were able to get powerful solutions anyway. If you go into a dentist's office today, they can take a digital X-ray and move it around and manipulate it on their computer. That kind of spill over into different use cases is really valuable. It's an area that many of us who study innovation and productivity think has a really important effect on the rest of society. 

What's happening now is that Moore's law has slowed and the drive towards greater efficiency, which dominates the rest of innovation, is now kicking in with computing and breaking apart into co-designed, specialized areas. Although not unusual for technologies, it is a worrisome thing. We should worry that potential use cases may be lost as more and more specializations emerge in computing.

0:00
/2:56

How is specialization impacting chip makers and LLM providers, particularly for enterprise use? 

Neil Thompson: We see a growth in the manufacturing of specialized chips. At FutureTech, we're studying how many people are starting to build their own chips. And we are seeing a real change. Enterprises like car companies, that used to buy only from other players, are now saying, actually we want to be designing our own chips. 

This will likely create a backdrop of chips that are even more efficient than the ones we have now. We're going to see a shift from generality to more speed or other particular things we care about. We already see hardware designers going from large 64-bit calculations to 32-bit to 16-bit. That form of specialization is great because it gives you a lot more efficiency, but it narrows what you can do with these systems. We're going to increasingly trade-off generality for performance. And that's going to lock us in ever more to very specific types of calculations that we can do.

In parallel, there’s a real question hanging over AI models. Does the future look like one big, very general model, or is it a world where we're going to see lots of specialized models. For example, ChatGpt may be really good for one overall thing, but actually in lots of other areas, where you have particular data or particular characteristics, it’s maybe better to train a specialized model. If we think about the application of AI, we're going to be faced with this question about how much specialization versus how general it will be.

For many companies, there's going to be not only a desire, but in fact, a need to build specialized models or models that are trained on their own data.

If you look at the ongoing research that is happening right now, it is comparing the value of data versus the value of big models. If data ends up being the really dominant factor in performance, then we'll see lots more separate models. And if size is the most important factor, we’ll see a few bigger models. Though there's still quite a good possibility that we'll see a lot of models being built either way.

0:00
/4:06

If we expect a proliferation of AI models, how will these models differentiate and scale in the market?

Neil Thompson: I think probably the most interesting area of research right now in AI is around scaling laws – the idea that as you increase the size of the network, the number of parameters in the network, and the amount of data that you're feeding it, you get this very predictable increase in the performance of these systems. As it turns out, that predictability and that scaling law exist across many different areas. They can apply to text, images, and video. 

That raises some very interesting questions about what is it about the match between the underlying problem that you're trying to model and the system that you're using to model it, which produces this behavior. A lot of research has explored this phenomenon, but none definitively tells us how specific scenarios will scale. 

Understanding this is so important because if you look across some areas of deep learning progress, and in particular image recognition, an area my lab has focused on, you can see that 70% of all of the change in performance that we've seen in these systems has come just from being able to harness computing power better. If there are mainly two options to improve performance, harness more computing power, or tweak elements upstream with the models without changes to computing power, we’ve seen that the dominant form of improvement has come from increasing the size of these models and therefore the computing power. For example, OpenAI made a very clear decision that scaling was going to be how they improved their systems.

But we know that comes with costs. Costs of increasing the amount of compute needed. Costs in terms of the amount of the energy that you're using, and of course, depending on the energy source you use, the amount of carbon dioxide that you're producing. All of these things come with real economic and environmental costs. 

We spend a lot of time thinking about what kind of functionality we are going to get out of these systems as we scale them up. How much more will AI systems get better at math? At computer vision, summarization, etc. as you give them more computing power? How much more productive are people going to get? Can it be automated to different tasks in the real economy? How much job replacement are we going to have?

And then, when you think about those tasks, you want to think not just could you replace the human doing the task, but would it be economically advantageous to do so? As the models get bigger, that question becomes more relevant. My lab is doing a lot of work right now trying to make these connections and to understand the economic feasibility of those solutions. Because as AI gets bigger, the question is: are we going to see big labor effects? Are we going to see huge increases in the functionality of these systems? Or are we going to run into barriers where the systems get so expensive that we stop incrementing on them as much?

0:00
/3:08

Do you have any advice for executives in the face of this emerging AI ecosystem? 

Neil Thompson: I have three recommendations for senior leaders trying to understand this area. 

First, when you think about deep learning, think about it as a continuum. Think about it as existing on this framework, where at the start, there are simple models you might typically have used with a regression or some simpler technique, and as you progress on the continuum, you are adding more and more flexibility to that model.

When do you need that flexibility? When you have a problem that is nonlinear and has lots of interactions between different parts and different kinds of data. That's when deep learning is very useful. This is a very helpful framework for understanding deep learning and moving it out of a sort of magic, AI-pixie dust, and into the existing world of data, analysis, and statistics.

The second piece of advice I would give to leaders is to think about the scale of these models. There’s this very natural relationship between how big a model is and how well it does. And there's also a relationship between how big a model is and how much it costs to run. From the point of view of a business, if you can think in that context, you can think through the business implications of model size, cost, and performance. Does it make sense from the economics of it? It’s a way to think about AI without having to think about all of the details of these models, which can be very complicated if you think about them overall.

The last piece of advice I would give people, and this is a little bit more speculative, is around data. We see the incredible importance of high-quality data and proprietary data that allows models to be trained in a much more efficient way. Increasingly, there will be more focus on who has the really good data in some areas that will allow them to build better AI systems. That's going to be a real shift that executives have to think about when they're thinking about the future of AI.


About the Contributors

Neil Thompson is the Director of the FutureTech research project at MIT's Computer Science and Artificial Intelligence Lab.

Sara Hooker is the Vice President of independent research lab Cohere For AI

Astrid Sandoval is the Executive Editor of Content Marketing and Thought Leadership at Cohere. 

Keep reading