In today’s computer chips, memory management is based on what computer scientists call the principle of locality: If a program needs a chunk of data stored at some memory location, it probably needs the neighboring chunks as well.
But that assumption breaks down in the age of big data, now that computer programs more frequently act on just a few data items scattered arbitrarily across huge data sets. Since fetching data from their main memory banks is the major performance bottleneck in today’s chips, having to fetch it more frequently can dramatically slow program execution.
This week, at the International Conference on Parallel Architectures and Compilation Techniques, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) are presenting a new programming language, called Milk, that lets application developers manage memory more efficiently in programs that deal with scattered data points in large data sets.
In tests on several common algorithms, programs written in the new language were four times as fast as those written in existing languages. But the researchers believe that further work will yield even larger gains.
The reason that today’s big data sets pose problems for existing memory management techniques, explains Saman Amarasinghe, a professor of electrical engineering and computer science, is not so much that they are large as that they are what computer scientists call “sparse.” That is, with big data, the scale of the solution does not necessarily increase proportionally with the scale of the problem.
“In social settings, we used to look at smaller problems,” Amarasinghe says. “If you look at the people in this [CSAIL] building, we’re all connected. But if you look at the planet scale, I don’t scale my number of friends. The planet has billions of people, but I still have only hundreds of friends. Suddenly you have a very sparse problem.”