and what does OpenAi do with my data?
The simple answer to the question ‘Who owns the inputs of LLMs?’ is: nearly all of us do, to some extent. Training an LLM tends to involve feeding it vast amounts of information, including whole chunks of personal information and personal data. And this use of personal details is governed by legislation, although this doesn’t seem to have received much serious consideration so far.
According to various data protection laws (such as the Privacy Act and the GDPR) there needs to be some legal justification for personal data/information to be used. Consent’s an important example of this justification but we can pretty much assume that the individuals in question won’t all have given their valid consent to LLMs. In Australia, the Privacy Act also allows personal information to be used where this might be reasonably expected providing it’s not too far removed from the reason the data was collected in the first place – and that might be hard to identify or prove. Under GDPR, there’s an alternative in the form of ‘legitimate interests’ but this involves balancing the interests of the LLM (or its operator) against those of the individual.
Not all laws are equal, of course. The USA has a fairly laissez-faire attitude to data protection, stemming from its First Amendment right of free speech (and let’s remember that most currently available LLMs are based in the US). The UK and EU on the other hand have probably the strongest data protection laws in the world in the form of the GDPR, and the EU seems especially keen to enforce them. China has its own privacy and cybersecurity laws, modelled closely on GDPR. Australia’s laws are in the process of being actively revised and updated.
So most LLMs might not be fully compliant with all relevant worldwide laws. This isn’t just a matter of principle – it can have serious practical consequences. In 2022, French authorities fined Clearview AI EUR20 million and banned it from collecting and processing personal data without a legal basis. Clearview was also ordered to delete some data from its model, and was then fined a further EUR5.2 million after it failed to comply. In 2023, Italy temporarily banned ChatGPT completely for similar reasons.
No action has been taken in Australia yet, but it’s not just a question of justification – there are also serious concerns about the accuracy of data. LLMs are notorious for ‘hallucinating’ and it might not be clear whether any particular ‘fact’ is correct or simply made up.
As far as privacy and data protection are concerned, it doesn’t matter how big or complicated your dataset is, you’re still required to comply with the law and to respect the rights of individuals as regards their data. And the bigger you are, the bigger your problems get. But it’s no excuse to hide behind the complexity of AI and say it’s all just too difficult. In theory OpenAI, for instance, need to be set up to handle millions of data subject access requests. Can they explain to each and everyone of us what data of ours they hold and how they justify processing it? What might it cost them to have to keep cleansing and retraining ChatGPT to implement our ‘right to be forgotten’ claims?