Thus far, the conventional wisdom appears to have been that larger models and larger fine-tuning datasets would lead to more sophisticated models, but this is increasingly being put to the test.
A new model named LIMA claims to provide performance comparable to GPT-4 after being fine-tuned on only a set of 1000 curated examples. LIMA, which stands for “Less Is More for Alignment”, was developed by researchers from Meta AI, Carnegie Mellon University and other institutions. The model could be evidence that it could take relatively small datasets to fine-tune foundational models to specific use cases.
“Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences,” the paper says in its abstract. “We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling,” it continues.
Researchers found that LIMA’s performance was comparable to other bigger models, which had been fine-tuned on larger datasets and had also been improved with RLHF. “In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback. Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output,” the model adds.
“LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. Moreover, the model tends to generalize well to unseen tasks that did not appear in the training data,” the paper says.
LIMA was trained on data examples from three websites: Stack Exchange, wikiHow, and the Pushshift Reddit Dataset. The model was also trained on 200 prompts and answers which were carefully written by the authors themselves. The set also included some examples from natural language generation tasks such as summarization, paraphrasing, and style transfer.
The authors say that the LIMA model performed well with this relatively small training set. This can have huge implications — it shows that with an open-source foundational model, it’s possible to create a more specific finetuned version with relatively few examples. It could certainly be possible for individual entities and companies to come up with a thousand examples of their use-case, and they can fine-tune a model to suit their own needs. There’s still independent evaluation required on the results — some on Twitter are skeptical of the human evaluations that were done — but models like LIMA show that it could be soon possible for small entities to build their own LLMs for their own use-cases relatively cheaply and easily.