GPT-3.5 More “Creative” Than 63% Of Humans, GPT-4 More “Creative” Than 95.6%

There are already indications that LLMs can come up with creative outputs — they can write poems and compose short stores, for instance. But a Twitter user has said that GPT-4 beats nearly all humans in a test that’s designed to test creativity.

GPT-3.5 and GPT-4 have managed to get impressive scores in the Divergent Association Task, a test that aims to test verbal creativity, a Twitter user says. The test involves thinking of 10 unrelated words. People who are more creative tend to think of words with greater “distances” between them. For example, the words cat and dog are similar, but the words cat and book are not. People who are more creative tend to generate words that have greater distances between them. These distances are inferred by examining how often the words are used together in similar contexts.

As per Twitter user @antrupad, GPT-3 got a score of 80.19 on the test, which was higher than 62.86% of humans who had taken the test. This was lower than the average score of 78. “Most people score between 74 and 82,” the test says. “The lowest score was 24 and the highest was 96 in our published sample. Although the scores can theoretically range from 0 to 200, in practice they range from 6 to around 110 after millions of responses online,” the test adds.

If this wasn’t impressive by itself — GPT 3.5 was doing better on a creativity test than the average humans — GPT 4 blows the test out of the water. GPT-4 scored 89.39 on the test, which was higher than an astonishing 95.6% of all people who’d taken the test.

These are pretty astonishing results. While there are several approaches to measuring creativity, and this one test doesn’t test for all kinds of creative output, it is still a pointer in the direction things are heading. It was already clear that LLMs were “creative”, but it turns that they are already creative than all but 5% of humans. And the rate of progress is also remarkable — OpenAI went from being better than 60% of humans to being better than 95% of humans in a single iteration. These are still early days, but all trends indicate that we might already be at the cusp of creating an AGI.