Over the last few months, speculation has been rife about the exact configuration of GPT-4. GPT-4 has proven itself to be the most capable LLM by far, but OpenAI has never revealed the exact parameters or size of the model. But it appears that some people are in the know of what GPT-4’s parameter size is.
TinyGrad founder George Hotz has said on a podcast that GPT-4 is a 220 billion parameter 8-way mixture model. “Yeah, yeah, we could build (a similar model). So like the biggest training clusters today, I know less about how GPT-4 was trained. I know some rough numbers on the weights and stuff,” he said on the Latent space podcast.
“A trillion parameters,” prompted the host. “Well, okay, so GPT-4 is 220 billion in each head,” Hotz replied. “And then it’s an eight-way mixture model. So mixture models are what you do when you’re out of ideas. So, you know, it’s a mixture model. They just train the same model eight times, and then they have some little trick. They actually do 16 inferences,” he said.
A Mixture of Experts model chooses different models for different inputs. Models often reuse the same parameters for all inputs. But Mixture of Experts models uses different parameters based on the example.
Hotz wasn’t the only one with this bit of information. “I might have heard the same,” tweeted Meta AI’s Soumith Chintala, who leads PyTorch. “I guess info like this is passed around but no one wants to say it out loud. GPT-4: 8 x 220B experts trained with different data/task distributions and 16-iter inference. Glad that Geohot said it out loud. Though, at this point, GPT-4 is probably distilled to be more efficient,” he added.
It’s impossible to confirm if this is true. OpenAI, paradoxically, has been very closed and tight-lipped about their newest models, and will likely not comment on the latest rumours. But OpenAI has nearly 350 employees, and it’s very possible that broad details of the model have leaked, and are being passed around in the AI community. 220 billion parameters is a lot less than the trillion parameters GPT-4 had long been rumoured to have, but if an 8-way mixture of a 220 billion parameters is the secret sauce, it’s quite likely that open-source models will soon try to adopt similar approaches.