On The Dangers of Stochastic Parrots: Can Language Models Be Too Big?
In March this year, the ACM FAccT conference published the paper titled above, authored by Emily Bender, Timnit Gebru, Angelina McMillan-Major and “Shmargaret Shmitchell”. This paper attracted intense controversy, and led to the exit of Gebru and Margaret Mitchell (“Shmargaret Shmitchell” in the paper) from the Google AI Ethics team. I discuss here the ideas espoused in this paper, which are critical today at a time when R&D in AI continues at a skyrocketing pace.
Stochastic parrots refers to language models (LMs) trained on enormous amounts of data. Language models are, basically, statistical models of language, and help us predict the likelihood of the next token given some context, whether previous or surrounding. This paper analyzes the usage of such models, and enumerates on the various limitations of using unimaginably large datasets to train them, and deploy them in real-life applications. It also discusses the present direction of research (which is highly focused on various numeric metrics as proof of competency) and alternative approaches for adopting a more ethical, carefully-considered mindset towards various tasks employing LMs.
Many researchers and companies are now building larger and larger models with billions of parameters, like GPT-3, MegatronLM and Switch-C, which has a whopping 1.6 trillion parameters. These models work on massive amounts of (mostly English language) data obtained by crawling the internet for years. Models like transformers have demonstrated that they perform better with dataset size increase, so the trend is to collect more and more data and keep growing these models. This is expected to continue in the near future.
The authors, however, present various problems with this approach. Some of these are:
- Environmental and financial costs: As larger models continue to be developed, the environmental and financial costs behind these are also rising. According to the authors, training a single BERT base model on GPUs was estimated to require as much energy as a trans-American flight. Besides, the burden of the environmental consequences is disproportionately borne by marginalized communities. The discussion of such factors is slowly coming to the fore, with efficiency being suggested as an evaluation metric, online tools to analyze energy usage and pressure on companies to devise sustainable training techniques.
- Enormous size of training data: The large amounts of data available for training today has a lot of unforeseen consequences, especially towards people who will not be on the receiving end of the benefits. Some of these are listed below.
- Just because a dataset is large and covers much of the internet, doesn’t mean it is representative of our collective humanity. For one, most datasets are in English (and a significant fraction of the world doesn’t speak in English). Secondly, in several regions, only the privileged have access to the internet and can voice their opinions. So datasets like the Common Crawl mostly contain data sourced from a privileged, English-speaking population. This leads to the perpetuation of hegemonic mindsets.
- Large datasets are mostly static: they were collected at one point in time and stored away. However, social norms across the world are in a constant state of flux, and such static data thus cannot be considered as an accurate representative of our times. But it isn’t feasible even for large corporations to fully re-train such models frequently on new data, owing to the size.
- LMs trained on such datasets exhibit a wide variety of biases, because they simply ingest whatever is given to them and output it. This is extremely problematic, especially considering the fact that such models are deployed on a large scale and reach people across the world. Training on noxious data will serve to enforce biases (whether racial, sexist, ageist etc.) and perpetuate injustice towards commonly-targeted and marginalized groups. There have been instances when language models/bots have been released on social media, and soon started emitting racist and problematic language. Such technology is especially concerning if handled by bad actors for things like extremism. - The people from whom the data is sourced, are nowhere in the pipeline of language model construction. So anyone who sees the output may interpret it as what the person herself is saying. This is a problem especially in machine translation. If a sentence in one language is wrongly translated to something else, and the translation is read by someone, they would interpret it wrongly and consider it as the original speaker’s opinion. This has, and will continue to cause unintended problems in various situations.
- Language models don’t really understand language. They simply analyze what limited data is given as input, and use it to output text. So while the sentences may make sense grammatically and semantically, there is no actual human meaning to them, because LM comprehension is not rooted in humans, it is rooted in whatever data it is fed. It cannot, and must not, be treated like an authority or truthful entity in any circumstances.
Finally, after elaborating on these issues and the subsequent risks associated with massive LMs, the authors have suggested several paths forward. These include an emphasis on data documentation, ethical and fair data curation and a reassessment of whether large language models are really needed for each and every task. Furthermore, they advocate a re-alignment of research goals: instead of focusing on the highest accuracy etc. for each model, researchers must focus on how the model works in real life systems, how it is able to achieve the task and how efficient it is.