Data protection advocate Max Schrems has lodged a complaint with the Austrian Data Protection Authority against OpenAI, the company behind ChatGPT, the popular AI assistant. The proceedings could be a milestone in applying data protection laws to large language models that are processing personal data.
The name Max Schrems is most often associated with litigation against Facebook in the wider context of personal data protection in the European Union. In the cases Schrems I and II, the privacy advocate brought down the data transfer agreement between the EU and the United States. The first case was brought on the basis of Mr Schrems’ complaint about his personal data being transferred to the US by Facebook Ireland. The timing of the complaint coincided with the moment in which the US’s surveillance programs were leaked to the public by Edward Snowden in 2013. In the end, the Court of Justice of the European Union acknowledged Mr Schrems’ concern about US authorities potentially accessing his personal data and declared the adequacy of the “Safe Harbour Framework” invalid. In its place, the “Privacy Shield” was adopted, but this was yet again challenged by Mr Schrems, and ultimately also invalidated in 2020.
Earlier this year in April, the Austrian activist and lawyer, now at the helm of Noyb (short for “none of your business”), his Vienna-based data protection organisation, lodged an official complaint against OpenAI with the Austrian data protection authority. The complaint centres around the issue that ChatGPT, the company's massively popular Artificial Intelligence-powered chatbot, could not accurately provide Schrems’ birthdate when prompted. The correct date being undisclosed online, ChatGPT repeatedly asserted false dates in its response. To Noyb's request to rectify the mistake, OpenAI responded that “factual accuracy in large language models remains an area of active research”.
Noyb alleges two GDPR violations in their complaint: first, a breach of Article 5(1)(d) requiring that processed personal data be accurate and up-to-date, erasing or rectifying it where this is not the case; and second, a failure to observe transparency obligations and provide data subjects with access to their data under Articles 12(3) and 15.
“AI can do anything” – but can it comply with data protection law?
Presently, correcting small factual inaccuracies such as an individual birthdate in large language models (LLMs) such as ChatGPT is difficult. To illustrate the technical challenge of doing this, a brief explanation of the nature of LLMs will follow. They are a popular application of machine learning technology (commonly referred to as artificial intelligence), which can produce text responses in a natural language to the user’s prompt. AI chatbots are therefore widely used as an alternative for search engines. At a first glance, they closely resemble the Google search line, consisting of an interface that takes a query, to which a structured, plausible answer is shortly given. But while both are common use cases of AI, there are important differences between search engines (another application of AI) and chatbots based on LLMs. It is much closer to reality to think of an LLM as a “calculator for words”. They do nothing more than convert words in a text to numeric values and then estimate which words are most likely to appear next, based on an enormous database consisting of texts available on the Internet. In fact, they are very good at word prediction, which explains not only the humanlike quality of the responses they generate, but also their lack of understanding of complex topics from a wider perspective. Real understanding is therefore far from AI chatbots, no matter how smart or confident their responses may sound.
The problem with using a database consisting of a myriad of texts from the web is its size. This massive wealth of data taking up a huge amount of storage space can only be stored after putting it through compression, that is, reducing its size by simplifying or erasing redundant information. Anyone who has ever sent an image through an instant messaging app may have noticed that its quality on the receiving end tends to decrease noticeably due to compression. For example, compression of photo files often has the result that dark parts of the image are reduced to blobs of the most frequent hue in that area, containing far fewer colours and therefore less information than the original file. This reduction of complexity through compression is what is referred to as “loss”. It is much the same with lossy text compression, in that the compressed file contains much less information than the original. This also means that an exact piece of text can never be retrieved, only an AI-generated replica of it, with the most likely words to occur in a given context. Lossy compression is therefore the cause of "subtle mistruths" in the chatbot's responses. In sum, the technical feasibility of full factual accuracy remains a big problem for LLM providers to solve. Noyb's case against OpenAI is bringing the issue into the spotlight, and they predict that it might only be resolved in front of the CJEU.
In fact, the notorious problem of “AI hallucinations” seems at first glance irreconcilable with EU data protection law as it currently stands. Article 5(1)(d) GDPR in particular requires personal data to be accurate and up-to-date and imposes an obligation on controllers to take “every reasonable step” to erase or rectify inaccurate personal data without delay. While this must certainly be feasible for controllers using other technologies, the nature of LLMs stands in direct contradiction with the level of control required to erase or change small bits of data individually. This is because the processes within an AI system are not as transparent as they are in other technologies, because they are not restricted to the human-written code at their core. Instead, they are capable of learning autonomously, on the basis of large datasets as described above. Machine learning has therefore often been described as a “black box”, since no one knows for certain what exactly goes on inside the “machine”.
The same problem arises with regards to Article 15 GDPR granting data subjects the “right of access” to not only the data itself, but the purpose of processing, categories of data processed, potential third-party recipients of the data and various other information. Once again, the construction typical of LLM systems makes it exceedingly difficult to manipulate, erase or give access to data points in isolation. Therefore, programming an AI chatbot not to display inaccurate information still falls short of controllers’ obligations with regards to accuracy and the right of access, because the falsehood remains in the system, despite being hidden from users.
What Noyb hopes to do about it
During the writing process of the present article, Noyb were asked via email to comment on the legal problems outlined above. In their response, they asserted that they are in no position to comment on what is and is not possible when it comes to designing LLMs, and instead referred to the position of some companies who claim that making such systems GDPR-compliant is downright impossible. For Noyb, it follows that if this is indeed the case, such technologies should simply not be used until they can be made compliant, or until legislation is adapted to cover every possibility. Noyb also warned against a narrative commonly used with emerging technologies (listing big data, cryptocurrencies and AI as examples), according to which new technologies fall outside of the law simply because legislators have not reacted to them quickly enough. However, it is a legal fallacy to claim that any emerging technology is not covered by law just because it is new.
The NGO was also asked to speak on the potential impact of the case now being brought against OpenAI. As far as they know, the Austrian DPA plans to send the complaint to the Irish Data Protection Commission, where, as Noyb predict rather pessimistically, “it will inevitably die like most of the data protection cases that reach that authority”. From a wider perspective, they are aware of several similar complaints brought by individuals across Europe at the same time. Finally, Noyb hope that the European Data Protection Board will presently adopt a “more helpful position” than that of the recent report drawn up by the Board’s ChatGPT taskforce.
Conclusion
In conclusion, the existence of this case illustrates yet again that there is still a considerable amount of work to be done by both legislators and innovators. Small factual inaccuracies in LLMs are difficult to correct, largely due to the nature of LLMs itself. The complaint lodged by Noyb thrusts into sharp focus the areas of both data protection legislation and AI solutions where adjustments are needed to ensure the protection of data subjects’ rights. While Noyb’s complaint in particular is not expected to turn the world of AI upside down, it might nevertheless be an important step on what will no doubt be a long and winding road towards data protection in the face of emerging technologies.
Comments