The enormous computing resources required to train state-of-the-art artificial intelligence systems means well-heeled tech firms are leaving academic teams in the dust. But a new approach could help balance the scales, allowing scientists to tackle cutting-edge AI problems on a single computer.
A 2018 report from OpenAI found the processing power used to train the most powerful AI is increasing at an incredibly fast pace, doubling every 3.4 months. One of the most data-hungry approaches is deep reinforcement learning, where AI learns through trial and error by iterating through millions of simulations. Impressive recent advances on videogames like Starcraft and Dota2 have relied on servers packed with hundreds of CPUs and GPUs.
Specialized hardware such as the Cerebras System’s Wafer Scale Engine promises to replace these racks of processors with a single large chip perfectly optimized for training AI. But with a price tag running into the millions, it’s not much solace for under-funded researchers.
Now a team from the University of Southern California and Intel Labs have created a way to train deep reinforcement learning (RL) algorithms on hardware commonly available in academic labs. In a paper presented at the 2020 International Conference on Machine Learning (ICML) this week, they describe how they were able to use a single high-end workstation to train AI with state-of-the-art performance on the first-person shooter videogame Doom. They also tackle a suite of 30 diverse 3D challenges created by DeepMind using a fraction of the normal computing power.
“Inventing ways to do deep RL on commodity hardware is a fantastic research goal,” says Peter Stone, a professor at the University of Texas at Austin who specializes in deep RL. As well as leaving smaller research groups behind, the computing resources normally required to carry out this kind of research have a significant carbon footprint, he adds. “Any progress towards democratizing RL and reducing the energy needs for doing research is a step in the right direction,” he says.
The inspiration for the project was a classic case of necessity being the mother of invention, says lead author Aleksei Petrenko, a graduate student at USC. As a summer internship at Intel came to an end, Petrenko lost access to the company’s supercomputing cluster putting unfinished deep RL projects in jeopardy. So he and colleagues decided to find a way to continue the work on simpler systems.
“From my experience, a lot of researchers don’t have access to cutting-edge, fancy hardware,” says Petrenko. “We realized that just by rethinking in terms of maximizing the hardware utilization you can actually approach the performance you will usually squeeze out of a big cluster even on a single workstation.”
The leading approach to deep RL places an AI agent in a simulated environment that provides rewards for achieving certain goals, which the agent uses as feedback to work out the best strategy. This involves three main computational jobs: simulating the environment and the agent; deciding what to do next next based on learned rules called a policy; and using the results of those actions to update the policy.
Training is always limited by the slowest process, says Petrenko, but these three jobs are often intertwined in standard deep RL approaches, making it hard to optimize them individually. The researchers’ new approach, dubbed Sample Factory, splits them up so resources can be dedicated to get them all running at peak speeds.
Piping data between processes is another major bottleneck as these can often be spread across multiple machines, Petrenko explains. His group took advantage of working on a single machine by simply cramming all the data to shared memory where all processes can access it instantaneously.
This resulted in significant speed-ups compared to leading deep RL approaches. Using a single machine equipped with a 36-core CPU and one GPU, the researchers were able to process roughly 140,000 frames per second while training on Atari videogames and Doom, or double the next best approach. On the 3D training environment DeepMind Lab, they clocked 40,000 frames per second—about 15 percent better than second place.
To check how frame rate translated into training time the team pitted Sample Factory against an algorithm Google Brain open-sourced in March that is designed to dramatically increase deep RL efficiency. Sample Factory trained on two simple tasks in Doom in a quarter of the time it took the other algorithm. The team also tested their approach on a collection of 30 challenges in DeepMind Lab using a more powerful 36-core 4-GPU machine. The resulting AI significantly outperformed the original AI that DeepMind used to tackle the challenge, which was trained on a large computing cluster.
Edward Beeching, a graduate student working on deep RL at the Institut National des Sciences Appliquées de Lyon, in France, says the approach might struggle with memory intensive challenges like the photo-realistic 3D simulator Habitat released by Facebook last year.
But he adds that these kinds of efficient training approaches are vitally important for smaller research teams. “A four-fold increase compared to the state of the art implementation is huge,” he says. “This means in the same time you can run four times as many experiments.”
While the computers used in the paper are still high-end workstations designed for AI research, Petrenko says he and his collaborators have also been using Sample Factory on much simpler devices. He’s even been able to run some advanced deep RL experiments on his mid-range gaming laptop, he says. “This is unheard of.”