OpenAI has already built the world’s best AI products by training them on large parts of the internet, but now it’s revealed how users and websites can prevent their data from being used to train further iterations of its models.
OpenAI has launched a web crawler named GPTBot that will collect data from the internet to train its AI models. “Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies. Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety,” OpenAI said.
But OpenAI has also given users the option to opt out of having the bot collect data from a website. “To disallow GPTBot to access your site you can add the GPTBot to your site’s robots.txt:
User-agent: GPTBot
Disallow: /
The robots.txt file is a file that tells automated crawlers how to interact with a website. It’s used to prevent search engines from accessing certain websites, or preventing access to certain parts of a website. For GPTBot, webmasters will be able to allow or disallow certain parts of a website by updating their robots.txt as follows:
User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/
The announcement of GPTBot gives users control over how their data is used by OpenAI to train its AI programs such as ChatGPT. There has been some controversy over programs such as ChatGPT being trained on large amounts of text without permission — it’s still not clear as to what sort of data ChatGPT was trained on, but users have found that it’s able to list out large porions of Harry Potter books sentence by sentence, indicating that copyrighted works might’ve been included in the training set. There’s no clarity on how legal it is: ChatGPT and other comapnies haven’t compensated any authors or websites for using their data, but have made money from the products that were created by using their data. It remains to be seen how these issues are litigated in the future, but for now, OpenAI seems to be giving users an option to opt out of having their data used to train AI models.