Carolina is at “the epicenter of computer science and data science” because of the work of the Renaissance Computing Institute, Executive Vice Chancellor and Provost Robert A. Blouin said before a RENCI presentation to the University Board of Trustees earlier this year.
RENCI, a collaborative data science lab launched by Carolina, Duke and NC State in 2004, has emerged as one of the country’s leading institutions for collecting, managing and analyzing a wide variety of data from scientists, research institutions, universities and businesses.
For example, RENCI plays a key role in the Biomedical Data Translator Consortium, a project funded by the National Center for Advancing Translational Science that hopes to overcome challenges in discovering insights from the plethora of biomedical datasets available today. The work of the Consortium involves over 200 members spanning 11 teams and 28 institutions across the globe. RENCI director Stanley Ahalt serves as the lead PI on the UNC Translator team, but points out that the team is a cross-campus initiative, including researchers from the NC TraCS Institute at the School of Medicine and the Institute for the Environment.
In one particular Translator use case, RENCI’s job is to create a knowledge network about asthma connecting data from all these sources: electronic health records of asthma patients, data about exposure to various environmental factors, studies about how each factor affects different human genes, and information linking genes to the development of asthma.
RENCI’s work is even more critical in today’s knowledge economy and in a world awash in data, Ahalt told the trustees. In his presentation, he said that 90% of all data existing today was created in the last two years and that people create 2.5 quintillion bytes of data per day. (To visualize this number, according to a post from Yappn Corp, imagine covering the surface of the Earth with pennies — five times.)
“The amount of information that exists, and the speed it continues to be generated, is too vast,” he said. And for the information to be truly useful, it must be connected — disciplines and around the world — and arranged in a way that helps humans navigate it with the help of algorithms. An algorithm is a procedure that describes the exact steps needed for the computer to solve a problem or reach a goal. A simple example would be collecting users’ email addresses on a website.
But algorithms for the work RENCI is doing are a bit more complicated. That’s where data scientists come in.
“You are seeing a movement, particularly among elite universities at the top of their game, to take full advantage of the data that their scientists, clinicians and researchers create and put the data together with other people’s data, not just locally, but in a global fashion,” Ahalt said.
Ahalt calls this serendipity, but this kind of data science doesn’t happen by accident. Call it serendipity by design — they organize books so that browsers find what they need and also stumble upon related information housed on nearby shelves.
‘From paper to digital’
Now more information is stored digitally online than on library shelves. “Between 2002 and 2003, we shifted from collecting data and wisdom from paper and into a digital format,” Ahalt said. “And it has changed everything we do.”
A key challenge in the digital age is to find ways to foster intentional connections among research scientists across different disciplines and universities. At Carolina, RENCI’s powerful and advanced information technology systems make this serendipity by design possible. Data science tools, search engines like Google, internet analytics and cloud services are part of this system.
Each source of information has its own way of collecting and analyzing data, and “it takes a lot of mundane work to clean things up because data doesn’t come into the world tidy and neat or usable,” Ahalt said. “Arranging data in a way that researchers can make subsequent use of it is an important part of what we do.”
RENCI helps to arrange the data so that algorithms can be created to make it easier for researchers to spot important connections in what once seemed random data. The information can be used by researchers, doctors and policymakers — if the algorithms are done well.
“Not everybody is going to trust the arrangement of knowledge by algorithms, nor should they,” Ahalt said. “As academicians, we have to be aware that the algorithms and the way we do computations have the inherent bias both of the data that is used to train the algorithms and the biases of the people who created them.”