Together with EPFL scientists, our IBM Research team has developed a scheme for training big data sets quickly. It can process a 30 Gigabyte training dataset in less than one minute using a single graphics processing unit (GPU) — a 10× speedup over existing methods for limited memory training. The results, which efficiently utilize the full potential of the GPU, are being presented at the 2017 NIPS Conference in Long Beach, California.
Training a machine learning model on a terabyte-scale dataset is a common, difficult problem. If you’re lucky, you may have a server with enough memory to fit all the data, but the training will still take a very long time. This may be a matter of a few hours, a few days or even weeks.
Specialized hardware devices such as GPUs have been gaining traction in many fields for accelerating compute-intensive workloads, but it’s difficult to extend this to very data-intensive workloads.
In order to take advantage of the massive compute power of GPUs, we need to store the data inside the GPU memory in order to access and process it. However, GPUs have a limited memory capacity (currently up to 16 GB) so this is not practical for very large data.
One straightforward solution is to process the data on the GPU sequentially in batches. That is, we partition the data into 16 GB chunks and load these chunks into the GPU memory sequentially.
Unfortunately, it is expensive to move data to and from the GPU, and the time it takes to transfer each batch from the CPU to the GPU can become a significant overhead. In fact, this overhead is so severe that it may completely outweigh the benefit of using a GPU in the first place.
Our team set out to create a technique that determines which smaller part of the data is most important to the training algorithm at any given time. For most datasets of interest, the importance of each data-point to the training algorithm is highly non-uniform, and also changes during the training process. By processing the data-points in the right order, we can learn our model more quickly.
For example, imagine the algorithm were being trained to distinguish between photos of cats and dogs. Once the algorithm can distinguish that a cat’s ears are typically smaller than a dog’s, it retains this information and skips reviewing this feature, eventually becoming faster and faster.
Dünner (right) writes the scheme she will present with Parnell at NIPS 2017.
This is why the variability of the data set is so critical, because each must reveal additional features that are not yet reflected in our model for it to learn. If a child only looks outside when the sky is blue, he or she will never learn that it gets dark at night or that clouds create shades of gray. It’s the same here.
This is achieved by deriving novel theoretical insights on how much information individual training samples can contribute to the progress of the learning algorithm. This measure relies heavily on the concept of duality gap certificates and adapts on-the-fly to the current state of the training algorithm. In other words, the importance of each data point changes as the algorithm progresses. For more details about the theoretical background, see our current paper.
Taking this theory and putting it into practice, we have developed a new, re-useable component for training machine learning models on heterogeneous compute platforms. We call it DuHL for Duality-gap based Heterogeneous Learning. In addition to an application involving GPUs, the scheme can be applied to other limited memory accelerators (for example systems that use FPGAs instead of GPUs) and has many applications, including large data sets from social media and online marketing, which can be used to predict which ads to show users. Additional applications include finding patterns in telecom data and for fraud detection.
In the figure at left, we show DuHL in action for the application of training large-scale Support Vector Machines on an extended, 30 GB version of the ImageNet database. For these experiments, we used an NVIDIA Quadro M4000 GPU with 8 GB of memory. We can see that the scheme that uses sequential batching actually performs worse than the CPU alone, whereas the new approach using DuHL achieves a 10× speed-up over the CPU.
The next goal for this work is to offer DuHL as a service in the cloud. In a cloud environment, resources such as GPUs are typically billed on an hourly basis. Therefore, if one can train a machine learning model in one hour rather than 10 hours, this translates directly into a very large cost saving. We expect this to be of significant value to researchers, developers and data scientists who need to train large-scale machine learning models.
At the annual Conference on Empirical Methods in Natural Language Processing (EMNLP), IBM Research AI is presenting 30 papers in the main conference and 12 findings that together aim to advance the field of natural language processing (NLP).
Capturing and structuring common knowledge from the real world to make it available to computer systems is one of the foundational principles of IBM Research. The real-world information is often naturally organized as graphs (e.g., world wide web, social networks) where knowledge is represented not only by the data content of each node, but also […]
Launched in 2018, the Rensselaer-IBM Artificial Intelligence Research Collaboration (AIRC) is a multi-year, multi-million dollar joint venture boasting dozens of ongoing projects in 2020-2021 involving more than 80 IBM and RPI researchers working to advance AI.