Welcome to Import AI, a newsletter about artificial intelligence. Forward this email to give your chums an AI upgrade. Subscribe here.
Want data for your NLP model? Get 600 million words from Spotify podcasts:
...ML-transcribed data could help people train better language models, but worth remembering scale dwarfed by spoken language…
Spotify has released a dataset of speech from ~100,000 podcasts on the streaming service. The data consists of the audio streams as well as their accompanying text, which was created through transcription via Google's speech-to-text API (therefore, this isn't gold standard 'clean' text, but rather slightly fuzzy and heterogeneous due to slight errors in the API). The dataset consists of 50,000 hours of audio and 600 million words. Spotify built the dataset by randomly sampling 105,360 podcast episodes published between January 2019 and March 2020, then filtered for English (rather than multilingual) data, length (cut out 'non-professionally published episodes' longer than 90 minutes), and also speech (optimized for podcasts where there's a lot of talking).
Why this matters: There's a lot of text data in the world, but that text data is absolutely dwarfed by the amount of verbal data. Corpuses like this could help us figure out how to harness fuzzily transcribed audio data to train better models, and may provide a path to creating more representative models (as this lets you capture people who don't write words on the internet).
Spotify versus New York City: verbal versus text scale: To get an intuition for how large the space of verbal speech is, we can do some napkin math: one study says that the average person speaks about 16,000 words a day, and we know the population of New York City is around 8.5 million. Let's take a million off of to account for non-verbal young children, old people that don't have many conversations, and some general conservative padding. Now let's times 7.5 million by 16,000: 120,000,000,000. Therefore, though Spotify's 600 million words is cool, it's only 0.5% of the size of the speech said in New York in a given day. Imagine what happens if we start being able to automatically transcribe all the words people say in major cities - what kind of models could we make?
Find out more about the dataset: 100,000 Podcasts: A Spoken English Document Corpus (ACL Anthology)
Get the data via requesting via a form here (replies may take up to two weeks): Spotify Podcast Dataset (Spotify).
###################################################
Ousted facial recognition CEO returns to Kairos to work on bias:
...Brian Brackeen returns for "Kairos 3.0"…
Brian Brackeen, former CEO of facial recognition company Kairos, has returned to the company that let him go in 2018, to lead an advisory council focused on AI bias.
An ethical stance that led to an ousting: Back in mid-2018, Brackeen said he thought the use of facial recognition in law enforcement and government surveillance "is wrong - and that it opens the door for gross conduct by the morally corrupt" (Import AI 101). Brackeen's comments were backed up by him saying Kairson wouldn't sell to these entities. By October of that year, Kairos had fired Brackeen and also sued him (Miami Herald). Now, the lawsuits have been settled in Brackeen's favor, the board members and employees that fired him have left, and he is back to work on issues of AI bias.
A "Bias API": Brackeen will help the company develop a "Bias API" which companies can use to understand and intervene on racial biases present in their algorithms. "This is Kairos 3.0", Brackeen said.
Read more: 'This is Kairos 3.0': Brian Brackeen returns to company to continue work on AI bias (Refresh Miami).
###################################################
Multilingual datasets have major problems and need to be inspected before being used to train something - researchers:
...Giant team looks at five major datasets, finds a range of errors with knock-on effects on translation and cultural relations writ large...
An interdisciplinary team of researchers has analyzed 230 languages across five massive multilingual datasets. The results? High-resource languages - that is, widely spoken and digitized languages such as English and German - tend to be of good quality, but low-resource languages tend to do poorly. Specifically, they find the poorest quality for African languages. They also find a lot of errors in datasets which consist of romanized script from languages commonly written in other scripts (e.g, Urdu, Hindi, Chinese, Bulgarian).
What they did: The researchers looked at five massive datasets - CCAligned, ParaCrawl v7.1, WikiMatrix, OSCAR, and mC4, then had 51 participants from the NLP community go through each dataset, sampling some sentences from the languages, and grading the data on quality.
An error taxonomy for multilingual data: They encountered a few different error types, like Incorrect Translation (but the correct language), Wrong Language (where the source or target is mislabeled, e.g, English is tagged as German), and Non-Linguistic Content (where there's non-linguistic content in either the source or target text).
How bad are the errors: Across the datasets, the proportion of correct samples range from 24% (WikiMatrix) to 87% (OSCAR). Some of the errors get worse when you zoom in - CCAligned, for instance, contains 7 languages where 0% of the encountered sentences were labelled as correct, and 44 languages where less than 50% of them were labeled as such.
Porn: >10% of the samples for 11 languages in CCAligned were labelled as porn (this problem didn't really show up elsewhere).
Standards and codes: There are other errors and inconsistencies across the datasets, which mostly come from them using wrong or incorrect labels for their language pairs, sometimes using sign language codes for high-resource languages (this was very puzzling to the researchers), or using a multitude of codes for the same language (e.g, Serbian, Croatia, Bosnian, and Serbo-Croation all getting individual codes in the same dataset).
Why this matters: Multilingual datasets are going to be key inputs into translation systems and other AI tools that let us cross linguistic and cultural divides - so if these multilingual datasets have a bunch of problems with the more obscure and/or low-resource languages, there will be knock-on effects relating to communication, cultural representation, and more.
"We encourage the community to continue to conduct such evaluations and audits of public datasets – similar to system comparison papers – which would help anyone interested in using these datasets for practical purposes," the researchers write.
Read more: Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets (arXiv).
###################################################
Supervillains, rejoice - you now have data to help you make a robotic cheetah:
...Finally, a solution to a problem everyone encounters…
For decades, AI researchers have looked to the natural world for inspiration. This is particularly true of locomotion, where our planet is full of creatures that hop, skip, jump, and sprint in ways we'd like our machines to emulate. Now, researchers with the South African National Research Foundation, University of Cape Town, University of Tsukuba in Japan, and Ecole Polytechnique de Lausanne in Switzerland, have recorded ten cheetahs running, so they can build a dataset of cheetah movement.
Why cheetahs are useful: Cheetahs are the fastest land mammal, so it could be useful to study how they run. Here, they create a large-scale annotated dataset, consisting of ~120,000 frames of multi-camera-view high speed video footage of cheetahs sprinting, as well as 7588 hand-annotated images. Each annotated image is annotated with 20 key points on the cheetah (e.g, the location of the tip of the cheetah's tail, its eyes, knees, spine, shoulders, etc). Combined, the dataset should make it easier for researchers to train models that can predict, capture, or simulate cheetah motion.
Read more: AcinoSet: A 3D Pose Estimation Dataset and Baseline Models for Cheetahs in the Wild (arXiv).
Get the data from here when it's available (African Robotics Unit).
###################################################
Testing robots by putting them in a dreamworld:
...ThreeDWorld asks simulated robots to play virtual cleanup…
MIT and Stanford researchers have built ThreeDWorld, a software environment for testing out virtually embodied AI agents. They're also hosting a challenge at this year's CVPR conference to figure out how close - or far - we are from building AI systems that can autonomously navigate around simulated houses to find objects and bring them to a predetermined place. This is the kind of task that our AI systems will have to be able to solve, if we want to eventually get a home robot butler.
What's ThreeDWorld like? You wake up in one of 15 houses. You're a simulated robot with 2 complex arms capable of 9-DOF each. You can move yourself around and you have a mission:find a vase, two bottles, and a jug, and bring them to bed. Now, you explore the house, using your first person view to map out the rooms, identify objects, collect them, and move them to the bedroom. If you succeed, you get a point. If you fail, you don't. At the end of your task, you disappear.
^ the above is a lightly dramatized robot-pov description of ThreeDWorld and the associated challenge. The simulation contains complex physics including collisions, and the software provides an API to AI agents. ThreeDWorld differs to other embodied robot challenges (like AI2's 'Thor' #73, and VirtualHome by modelling physics to a higher degree of fidelity, which makes the learning problem more challenging.
Reassuringly hard: Pure RL systems trained via PPO can't easily solve this task. The authors develop a few other baselines that play around with different exploration policies, as well as a hierarchical AI system. Their results show that "there are no agents that can successfully transport all the target objects to the goal locations", they write. Researchers, start your computers - it's challenge time!
Read more: The ThreeDWorld Transport Challenge: A Visually Guided Task-and-Motion Planning Benchmark for Physically Realistic Embodied AI (arXiv).
More information at the CVPR 2021 challenge website.
Get the code for ThreeDWorld and the data for the challenge from here (GitHub).
###################################################
What can we learn from 209 robot delivery drone flights?
...We're about to live in the era of low-flying robots, so we better understand them…
Right now, hundreds (and probably thousands) of different companies are using drones around the world to do increasingly complicated tasks. Many companies are working on package delivery, e.g, 7 of the 10 companies working with the US FAA to gain expanded drone licenses are working on some form of delivery, Import AI #225. So it'd be helpful to have more data about delivery drones and how they work in the (air) field.
Enter researchers from Carnie Mellon, the University of Pennsylvania, and Baden-Wuerttemberg Cooperative State University, who have recorded the location and electricity consumption of a DJI Matrice 100 quadcopter during 209 delivery flights, carried out in 2019.
What's the data useful for? "The data available can be used to model the energy consumption of a small quadcopter drone, empirically fitting the results found or validating theoretical models. These data can also be used to assess the impacts and correlations among the variables presented and/or the estimation of non-measured parameters, such as drag coefficients", the researchers write.
Read more: In-flight positional and energy use data set of a DJI Matrice 100 quadcopter for small package delivery (arXiv).
Get the drone flight telemetry data from here (Carnegie Mellon University).
###################################################
Tech Tales:
[Database on an archival asteroid, 3200 AD ]
Energy is never cheap,
It always costs a little.
Thinking costs energy,
So does talking.
^ Translated poem, told from one computational monolith to a (most translators agree there's no decent English analog for the term) 'child monolith'. Collected from REDACTED sector.
Thanks for reading. If you have suggestions, comments or other thoughts you can reach me at jack@jack-clark.net or tweet at me@jackclarksf
|