Researchers dealing with big data often feel the need for a supercomputer to assist them.
That’s why Niall Gaffney, a former Hubble Space Telescope scientist who now heads the Data Intensive Computing group at the Texas Advanced Computing Center (TACC), led a team to build a new kind of supercomputer called Wrangler.
The supercomputer can be compared to the western cowboys that tamed wild horses, also called Wranglers, because the computer is capable of taming big data, such as computer problems that require analysis of thousands of files that need to be quickly opened, examined and cross-correlated.
Wrangler was designed to work with the Stampede supercomputer, the tenth most powerful in the world according to the bi-annual Top500 list, and the flagship of TACC at The University of Texas at Austin (UT Austin). Since its origin in 2013, Stampede has computed over six million jobs for open science.
"We kept a lot of what was good with systems like Stampede," said Gaffney, "but added new things to it like a very large flash storage system, a very large distributed spinning disc storage system, and high- speed network access. This allows people who have data problems that weren't being fulfilled by systems like Stampede and Lonestar to be able to do those in ways that they never could before."
Wrangler is equipped with 600 terabytes of flash memory that’s shared via PCI interconnect across 3,000 Haswell computer cores. This large amount of flash storage comes from DSSD, a startup co-founded by Andy Bechtolsheim of Sun Microsystems.
"All parts of the system can access the same storage," said Gaffney. "They can work in parallel together on the data that are stored inside this high-speed storage system to get larger results they couldn't get otherwise."
DSSD managed to create a shortcut between the CPU and the data, without the need for any translation in between. This allows people to computer directly and eliminates the bottleneck.
Applications in gene analysis
One example of how Wrangler is helping the medical field is assisting scientists with code called OrthoMCL, which combs through DNA sequences to find common genetic ancestry in unrelated species. The problem was that OrthoMCL created a very large database and then ran the computational programs outside, so it could not interact with the database.
At first, when biologist Rebecca Young of the Department of Integrative Biology and the Center for Computational Biology and Bioinformatics at UT Austin ran OrthoMCL with online resources, she was only able to pull out 350 comparable genes across 10 species.
"When I run OrthoMCL on Wrangler, I'm able to get almost 2,000 genes that are comparable across the species," said Young. "This is an enormous improvement from what is already available. What we're looking to do with OrthoMCL is to allow us to make an increasing number of comparisons across species when we're looking at these very divergent, these very ancient species, separated by 450 million years of evolution."
The cases were completed between 15 minutes and six hours, allowing scientists to explore new questions with larger collections of data, and possibly attaining previously out-of-reach knowledge.
The supercomputer is also being used to study human origins buried in fossil data and tune energy efficiency in buildings.