MIT might be one of the hardest colleges to get into, and Mathematics and Computer Science might be the most rigorous degrees to attain from its campus, but it seems like it’s no match for GPT-4.
GPT-4 is able to solve as many as 90% of questions from MIT’s Mathematics, Electrical Engineering, and Computer sciences, a study has found. This number rose to as high as 97% after using some specific techniques to improve GPT-4’s performance, and was also able to obtain a perfect score with . Other models didn’t fare quite as well — LLaMA was able to solve only 39 percent of the problems, while Stable Vicuna was able to solve 48%.
“We curate a comprehensive dataset of 4,550 questions and solutions from problem sets, midterm exams, and final exams across all MIT Mathematics and Electrical Engineering and Computer Science (EECS) courses required for obtaining a degree,” the paper by MIT and Harvard researchers says. “We evaluate the ability of large language models to fulfill the graduation requirements for any MIT major in Mathematics and EECS. Our results demonstrate that GPT-3.5 successfully solves a third of the entire MIT curriculum, while GPT-4, with prompt engineering, achieves a perfect solve rate on a test set excluding questions based on images,” it adds.
The researchers discovered that GPT-4’s performance improved after employing some specific techniques. Few-shot, in which GPT-4 was given a few examples of similar problems before asking it to solve a particular problem, got the accuracy up from 90% to 93%. This added with Chain of Thought — in which LLMs are made to reason out problems step by step, managed to take it up to 95%. When self-critique, or asking an LLM why it might be wrong, was added to the mix the accuracy rose to 97%. And finally, when the LLM was asked to name an expert in the field, and then told to pretend to be the expert and answer the question, the researchers were able to attain 100% accuracy.
It’s a pretty interesting result. For starters, it highlights how much daylight there is between GPT-4 and other models — while GPT-4 was able to solve 90% questions, LLaMA managed just 39%, and Stable Vicuna-13B managed 48%. But more importantly, the result shows how a general model is now able to do better than some of the smartest humans on domain-specific tests: GPT-4 performs a wide variety of tasks, but also happens to be really good at some of the toughest questions which are taken by some of the smartest humans on the planet. These are still early days, but results like these indicate that an intelligence that’s general, or AGI, might not be such a distant dream after all.