QLoRA Uses 4-Bit Representation To Efficiently Fine-Tune Large LLMs, Dramatically Reduces Memory Requirements

Even as companies like OpenAI and Anthropic resolutely keep their models behind closed doors, rapid advancements are being made to fine-tune the growing numbers of open-source models quickly and cheaply.

A new technique named QLoRA has demonstrated impressive abilities to fine-tune LLMs much faster and much more cheaply than training them from scratch. QLoRa builds on LORA (low-rank adaption), which freezes most parameters of an LLM, and only trains a small number of parameters.

“QLORA reduces the average memory requirements of finetuning a 65B parameter model from >780GB of GPU memory to <48GB without degrading the runtime or predictive performance compared to a 16- bit fully finetuned baseline,” the paper says. “This marks a significant shift in accessibility of LLM finetuning: now the largest publicly available models to date finetunable on a single GPU,” it adds.

QLoRA manages to bring about efficiencies in three ways. It uses Block-wise k-bit Quantization, which essentially compresses some internal calculations from 16-bit to 4-bit without much loss in data, which saves memory. It also uses Low-Rank Adapters, which freezes most weights in the model, which means only a small fraction of weights are trained.

QLoRA also uses something known as Paged Optimizers, to
prevent memory spikes during gradient checkpointing from causing out-of-memory errors that have traditionally made finetuning on a single machine difficult for large models.

And QLoRA seesm to have the results to show that it works. “Using QLORA, we train the Guanaco family of models, with the second best model reaching 97.8% of the performance level of ChatGPT on the Vicuna benchmark, while being trainable in less than 12 hours on a single consumer GPU; using a single professional GPU over 24 hours we achieve 99.3% with our largest model, essentially closing the gap to ChatGPT on the Vicuna benchmark. When deployed, our smallest Guanaco model (7B parameters) requires just 5 GB of memory and outperforms a 26 GB Alpaca model by more than 20 percentage points on the Vicuna benchmark,” the paper says.

This seems very promising, and could democratize LLM finetuning. Right now, it’s cumbersome to fine-tune large models, and takes a lot of time and computational resources. But if techniques like QLoRA could make this process quicker and cheaper, it could pave the way for many open-source LLMs are are personalized for specific use cases: users would be able to train LLMs to meet their needs, and entirely avoid large companies like OpenAI and Google. It’s still early days for such technologies, but there are some very encouraging signs that open-source models could end up being quite competitive to massive corporate models in the years to come.