On The Dangers of Stochastic Parrots: Can Language Models Be Too Big?

In March this year, the ACM FAccT conference published the paper titled above, authored by Emily Bender, Timnit Gebru, Angelina McMillan-Major and “Shmargaret Shmitchell”. This paper attracted intense controversy, and led to the exit of Gebru and Margaret Mitchell (“Shmargaret Shmitchell” in the paper) from the Google AI Ethics team. I discuss here the ideas espoused in this paper, which are critical today at a time when R&D in AI continues at a skyrocketing pace.

Source: Unsplash

Stochastic parrots refers to language models (LMs) trained on enormous amounts of data. Language models are, basically, statistical models of language, and help us predict the likelihood of the next token given some context, whether previous or surrounding. This paper analyzes the usage of such models, and enumerates on the various limitations of using unimaginably large datasets to train them, and deploy them in real-life applications. It also discusses the present direction of research (which is highly focused on various numeric metrics as proof of competency) and alternative approaches for adopting a more ethical, carefully-considered mindset towards various tasks employing LMs.

Many researchers and companies are now building larger and larger models with billions of parameters, like GPT-3, MegatronLM and Switch-C, which has a whopping 1.6 trillion parameters. These models work on massive amounts of (mostly English language) data obtained by crawling the internet for years. Models like transformers have demonstrated that they perform better with dataset size increase, so the trend is to collect more and more data and keep growing these models. This is expected to continue in the near future.

The authors, however, present various problems with this approach. Some of these are:

  1. Environmental and financial costs: As larger models continue to be developed, the environmental and financial costs behind these are also rising. According to the authors, training a single BERT base model on GPUs was estimated to require as much energy as a trans-American flight. Besides, the burden of the environmental consequences is disproportionately borne by marginalized communities. The discussion of such factors is slowly coming to the fore, with efficiency being suggested as an evaluation metric, online tools to analyze energy usage and pressure on companies to devise sustainable training techniques.

Finally, after elaborating on these issues and the subsequent risks associated with massive LMs, the authors have suggested several paths forward. These include an emphasis on data documentation, ethical and fair data curation and a reassessment of whether large language models are really needed for each and every task. Furthermore, they advocate a re-alignment of research goals: instead of focusing on the highest accuracy etc. for each model, researchers must focus on how the model works in real life systems, how it is able to achieve the task and how efficient it is.

Part-time graduate student at University of Washington | Software Engineer at Paytm, India | I try not to sweat it. Meanwhile, I write on NLP research!