
Memory LocalityHow can a Java code be 85x slower than a C++ code solving the same problem? This post is trying to answer this question. Why am I asking this question in the first place? It all started with a seemingly simple exercise. We were working on a large scale analytics (aka big data) project and had trouble agreeing on what results should a particular analysis return. I decided to write a C++ code for it, and a colleague decided to use Java. The goal was to use two completely independent implementations for cross validation: if they agreed on the result then the the likelihood of having correct implementations would be quite high. So we went with our implementations, and we were happy to see that we agreed on results. We had nailed down the problem we were facing, end of story. Wait a minute, if that was it, why am I blogging? Because, to our surprise, my C++ code was running 85 times faster than my colleague's Java code. And the running times were large enough (minutes) to offset any warmup required in Java (when the byte code is first compiled). We cannot explain the difference by a general speed difference between C++ and Java. Java is not 85x slower than C++. We can discuss about the relative speed of both languages, but it is nowhere near 85 times. All I can say is that the web is full of small code examples where Java is as fast as C++. Other comparisons using larger test programs, like this Google paper, and this comparison, say that Java is between 2x and 4x slower than C++, depending on the program used for testing. This is consistent with my own experience. So, we were facing something else that was causing a much larger speed difference.
Before diving into our specific project let me be clear here. I don't think that C++ should be the language of choice in general. As a matter of fact, I am quite in agreement with Why
It so happen that my analytics use case fits the first use case. Note that another good choice in that case is the good old C language if you don't care for object oriented programming. The point I'd like to make now is that the first and last use cases are deeply related. Indeed, one way to achieve better performance is to think about how data is organized in memory. Why is data layout relevant to performance? Because modern CPUs manage data via various caches. When the processor needs to read from or write to a location in main memory, it first checks whether a copy of that data is in the cache. If so, the processor immediately reads from or writes to the cache, which is much faster than reading from or writing to main memory. More on this topic can be found on wikipedia, and here. The latter contains this nice description: Having caches only helps if when the processor needs to get some data, it is already in the cache. Thus, the first time the processor access the memory, it must wait for the data to arrive. On subsequent reads from the same location, there is a good chance that the cache will be able to serve the memory request without involving main memory. Of course, since the cache is much smaller than the main memory, it can't store all of main memory. The cache is constantly throwing out information about memory locations in order to make space for new data. The processor only gets speedup from the cache if the data fetched from memory is still in the cache when it is needed again. When the cache has the data that is needed by the processor, it is called a cache hit. If not, it is a cache miss. The ratio of the number of hits to misses is called the cache hit ratio. Because memory is so much slower than the processor, the cache hit ratio is critical to overall performance. For example, if a cache miss is a hundred times slower than a cache hit, then a cache miss ratio of 1% means that half the time is spent on cache misses. Of course, the real situation is more complex, because secondary and tertiary caches often handle misses in the primary cache. The cache records cached memory locations in units of cache lines containing multiple words of memory. A typical cache line might contain 4–32 words of memory. On a cache miss, the cache line is filled from main memory. So a series of memory reads to nearby memory locations are likely to mostly hit in the cache. When there is a cache miss, a whole sequence of memory words is requested from main memory at once. This works well because memory chips are designed to make reading a whole series of contiguous locations cheap. Therefore, caches improve performance when memory accesses exhibit locality: accesses are clustered in time and space, so that reads from memory tends to request the same locations repeatedly, or even memory locations near previous requests. Caches are designed to work well with computations that exhibit locality; they will have a high cache hit ratio.
It is a project requiring the implementation of various graph analytics algorithms on very large graphs (graphs having more than a trillion edges). We are using shar Using Java, one is tempted to use objects all over the place. Before you think of it, you have a node class when you deal with graphs. Given we wanted to scale to large graphs, representing edges as objects was not a good idea. Moreover, our graphs are sparse. It means that the number of edges in the graph is small compared to the number of node pairs. In order to store information on edges, my colleague used sparse data structure based on a hash map. Then his algorithm was quite similar to mine. Issue is that such high level approach is unlikely to lead to memory locality. Indeed, there is little control of how Java allocates objects in memory. As a result, data representing neighbor nodes and their edges may be scattered in memory, resulting in a low cache hit ratio.
I used a very different approach in my C++ code, designing a data structure that favors locality: data representing edges incident to a given node is stored in consecutive memory locations. This data structure is a derivative of the way CPLEX stores the LP matrix. It can be seen as a way to represent sparse graphs using an adjacency list, albeit more compact and efficient than well Three arrays are used (indexing starts at 0):
This data structure can be made more compact. We get rid of the EdgeFrom array. Instead of using it we store for each node n the index of where the first edge incident to n appears in the EdgesTo array. This information is stored in the EdgesBegin array:
Example.
This representation has many interesting properties, including:
It also has some drawbacks. For instance, updates like adding an edge require lots of value copying. We clearly designed for "write once read often" kind of computation. The use of the above data structure is what explains most of the difference in running time between C++ and Java. We could now implement a similar data structure in Java to actually check that we get much faster running times. However, given we have a correct and fast implementation in C++ we did not bother do it yet. Indeed, we are now able to process graphs having 1 billion edges in about 8 GB memory. What can we conclude from this discussion? I'd say that if one cares about performance, then one should think about locality more than about programming languages selection. |