I don’t often write long reviews of single papers. Maybe I should.
Stochastic Parrots have finally launched into mid-air. The paper at the heart of the huge brouhaha involving Google’s ‘resignating’ of Timnit Gebru back in December is now available, and will appear at FAccT 2021.
Reading papers in this space is always a tricky business. The world of algorithmic fairness is much broader than either algorithms or fairness (and stay tuned for my next post on this). Contributions come in many forms, from many different disciplinary and methodological traditions, and are situated in different contexts. Identifying the key contributions of a paper and how they broaden the overal discussion in the community can be tricky, especially if we define contributions based on our own traditions. And then we have to critique a paper on its own terms, rather than in terms of things we want to see. And distinguishing those two is apparently hard, as evidenced by the hours of argument I’ve been having*.
So what are the contributions of this paper? I view it as making two big arguments (containing smaller arguments) about large language models. Firstly, that the costs of building large language models systematically ignore large externalities which if factored in would probably change the cost equation. And secondly that the benefits associated with these models are based on claims of efficacy that don’t incorporate many known issues in the model building process. There’s a third related point about the way in which research in LLMs itself runs the risk of distracting from NLP research.
Let’s take these one by one. The first externality is clear: the now well-documented environmental cost associated with building larger and larger models. The paper doesn’t necessarily add new information to this cost, but it points out that in the context of language models, it is unclear whether the added accuracy improvements that might come from large models are worth the environmental cost. And especially since one of the arguments made for language models is the ‘democratization of NLP tech’, the authors also remind us (Bender Rule alert!) that much of the benefit of work on language models is will accrue to English-speaking (and therefore allready well-resourced) populations while the environmental harms will accrue to people in small island nations that are unlikely to benefit at all.
Are language models the biggest contributor to climate change? No. but I don’t think that’s the point here. The point is that pursuing size as a value in and of itself (see the gushing headlines over the new trilliion parameter model that just came out) as a proxy for better conveniently misses out on the harms associated with it, something that everyone who thinks about the calculus of harms and benefits in societyi knows they shouldn’t be doing. Further, in the spirit of ‘constraints breeding creativity’, why NOT initiate more research into how to build small and effective language models?
The second externality is what the authors partly refer to as the ‘documentation debt’ of large and complex data sets. This is not directly linked to trillion-state TMs per se, but it is clear that the collection of and access to huge text corpora enables (and justifies) the building of LLMs. Many of the problems with large data sets are well documented and are explored in detail here in an NLP context: the shifting nature of context, the problems with bias, the self-selection and non-uniformity in how these sets are collected and curated, and so on.
So much for the costs. What about the benefits? The section that gives its name to the title of the paper goes into great detail on the ways in which the perceived benefits of language models come from a deep tension between the weak and strong AI standpoints: does a language model actually understand language or not? The authors go into a number of cases where LLMs have demonstrated what appears to be an understanding of language, particularly in the case of question-answering. They argue that the apparent understanding is illusory and subjective, and this is a real problem because of our human tendency to ascribe meaning and intent to things that look like communication. They document in excruciating detail (yes with lots of references, ahem) cases where this perception of meaning and intent can be and has been weaponized.
A more subtle danger of the focus on LLMs, one that I am ill equipped to comment on since I’m not an NLP researcher, is the argument that the focus on large language models is distorting and takes away from efforts to truly understand and build general language systems. One might very well respond with “well I can chew gum and walk at the same time”, but I think it’s fair game to be concerned about the viability of research that doesn’t solely focus on LLMs, especially as DL continues its Big Chungus-ifying march.
CS papers are considered incomplete if they raise problems and don’t present solutions. It’s beyond the scope of this post to address the many problems with that take. However, the authors don’t want to fight another battle and dutifully present ways to move forward with LLMS, amassing a thoughtful collection of process suggestions that synthesize a lot of thinking in the community and apply them to language model development.
So. How does this paper add to the overall discussion?
The NLP context is very important. It’s one thing to write a general articulation of harms and risks. And indeed in a very broad sense the harms and risks identified here have been noted in other contexts. But the contextualization to NLP is where the work is being done. And while the field does have a robust ongoing discussion around ethical issues, thanks in no small part to the authors of this paper and their prior work, this systematic spelling out of the problems is both important to do, not something anyone can do (I couldn’t have done it!) and valuable as a reference for future work.
Does the paper make sound arguments? Yoav Goldberg has posted a few of his concerns, and I want to address one of them, which comes up a lot in this area. Namely, the idea that taking a “one-sided” position weakens the paper.
In my view, the paper is taking a “position” only in the sense of calling out the implicit positioning of the background discussions. There’s a term in media studies coined by Barthe called ‘exnomination‘. It’s a complex notion that I can’t do full justice to here, but in its simplest form it describes the act of “setting the default” or “ex-nominating” as an act of power that forces any (now) non-default position to justify itself. In the context of language, you see in this in ‘person’ vs ‘black person’, or (my go-to example) in Indian food of “food” vs “non-vegetarian food”.
What has happened is that the values of largeness, scale, and “accuracy” have been set as the baseline, and now any other concern (diversity, the need to worry about marginalized populations, concern for low-resource language users etc) becomes a “political position” to be justified.
Does this paper indulge in advocacy? Of course it does. Does that make it less “scientific”? Setting aside the normative loadedness of such a question, we should recognize that papers advocate for specific research agendas all the time. The authors have indeed made the case (and you’d have to be hiding under a rock to not know this as well) that the focus on size obscures the needs of populations that aren’t represented by the majority as seen through biased data collection. You might disagree with that case and that’s fine. But it’s well within scope of a paper in this area to make the case for such a focus without being called ‘political’ in a pejorative manner.
And this brings me back to the point I started with. In the space of topics that broadly speak to the social impact of machine learning, we don’t just have a collection of well defined CS problems. We have a bewildering, complex and contested collection of frames with which we think about this landscape. A paper contributes by adding a framing discussion as much as by specific technical contributions. In other words, it remains true that “what is the right problem to solve” is one of the core questions in any research area. And in that respect, this paper provides a strong corrective to the idea of LLMs as an unquestionably valuable object of study.
* My initial knee-jerk reaction to this paper was that it was nice but unexceptional. Many arguments with Sorelle and Carlos have helped move me away from that knee-jerk response towards a more careful articulation of its contributions.