The machine learning community has published a paper earlier this month based on their survey of research on dataset collection. According to the survey, they say that they have a data culture problem in the fields of computer vision and language processing.
They say that we need a shift from reliance on the poorly curated datasets for new ML training models. Instead, their study recommends representation of a culture that cares for the people, respects their privacy, and property rights in their datasets.
However, survey authors said that in today’s ML environment “anything goes.”
University of Washington linguists, Amandalynne Paullada and Emily Bender along with Mozilla Foundation, fellow Inioluwa Deborah Raji, and Emily Denton and Alex Hanna, Google research scientists, wrote “Data and its (dis)contents: A survey of dataset development and use in machine learning”.
The paper concluded that traditional language models can perpetuate prejudice and bias against marginalized communities. It also states the poorly annotated datasets as a part of the problem.
They call for more rigorous documentation practices and data management. They say that the datasets made this way will require more time, effort, and money. However, they will encourage work on approaches to ML that go beyond the current paradigm.
The paper reads, “We argue that fixes that focus narrowly on improving datasets by making them more representative or more challenging might miss the more general point raised by these critiques, and we’ll be trapped in a game of dataset whack-a-mole rather than making progress, so long as notions of ‘progress’ are largely defined by performance on datasets.”
“Should this come to pass, we predict that machine learning as a field will be better positioned to understand how its technology impacts people and to design solutions that work with fidelity and equity in their deployment contexts.”
In the past few years, events have brought to light the shortcomings of the ML community that might harm people from marginalized communities.
On Wednesday, Reuters reported that Google has started carrying out reviews on sensitive topics of research papers on at least three occasions. Besides, according to internal communications and people familiar with the matter, the authors have also been asked to not put Google technology in a negative light.
This came after Google fired Timnit Gebru due to an incident which the employee refer to as a case of unprecedented research censorship and yet a Washington Post profile of Gebru revealed that Jeff Dean had asked for an investigation on the negative impact of language models.
The decision to censor AI researchers carry policy implications. But right now, Google, MIT, and Stanford are some of the most influential producers of AI research published in academic conferences.
Earlier this month, in a Surveys and Meta-analyses workshop at NeurIPS, an AI research conference that attracted 22,000 attendees, “Data and its (Dis)contents” received an award from ML Retrospectives organizers.
This year alone, NeurIPS published nearly 2,000 papers which also included work related to methods for faster, more efficient backpropagation; failure detection for safety-critical systems; and the beginnings of a project that treats climate change as a machine learning grand challenge.