Subversion (SVN) is an open source version control system that facilitates storage, access, and parallel development of source, scripts, binaries, and other file types. While Subversion is very popular, many of its users have experienced unacceptably long wait times when importing or checking-in binaries, as well as exporting or checking them out. Fortunately, once you understand what causes this particular performance problem, it is possible to avoid it in your system.
In this article, I relay my experience investigating binary-related performance degradation in a real-world Subversion file system. I explain the basic problems encountered by the system's users and administrators and then show the results of specific investigations into the cause of those problems. I conclude with an overview of the findings and suggestions for optimizing Subversion for shorter access times and/or less space consumed on the server.
This article is primarily intended for system administrators who use Subversion for version control and would like to improve its performance when storing binary files. It may also interest anyone wanting to set up a Subversion system that will store binary files. See Resources if you need an introduction to Subversion.
Version control systems are typically used for file backup, parallel development, and change management. They are most often used by development teams to manage application source files. They are also sometimes used to manage tools and, occasionally, to store binaries. The downside of using a system like Subversion to store binaries is access time: Fetching a binary file from a version control system is usually much slower than simply copying or FTPing it from another machine or a shared drive. On the other hand, version control systems typically require less storage space than other types of file systems.
Subversion and other version control systems can save files using an algorithm that stores only the differences between a new version and the previous one, not the entire file. The saved differential data is known as a delta, or, more loosely, as the file's deltas. Because a version control system does not store each new version in its entirety, it requires less disk space for data storage than a standard file system.
Most version control systems cannot store binary files as deltas, but Subversion can. Many system administrators like the idea of saving disk space while also keeping source and binary files together and in sync in the same system. If only Subversion's binary storage worked as well in the real world as it does in theory.
I recently investigated Subversion's binary-storage-related performance problems in a real-world development system. The system had been up and running for a few months at the time of the investigation. The development team universally accepted the benefits of having source, scripts, and binaries together and in sync in the same version-controlled file system. Being able to fetch an entire development environment with a single command made for a significantly less error-prone environment. It also helped minimise the barrier to entry for new developers.
The team had two growing concerns, however. The first problem, experienced by all users of the system, was the time required to check out or export the binaries. Using Subversion for this purpose was orders of magnitude slower than simply copying the items from an alternative machine or a shared big disk. Only the system's administrators were aware of the second problem, which was the amount of space consumed by the binaries.
Once these two issues were identified, we began to investigate the situation. We hoped to continue storing the binaries in our Subversion file system, but first we needed to find a workaround for the time and space issues involved.
Binary storage formats compared
Our first step in the investigation was to examine how various storage methods impacted the time required to place a binary file into Subversion, fetch it out, and put it back in its original form. We tried four methods: putting the binaries into Subversion as a large directory structure, creating a single file containing the directory structure and then putting that into Subversion, compressing the single file, and saving the binaries as deltas rather than putting an entire new version into Subversion each time.
Table 1 shows a number of alternative methods of storing a binary in Subversion. It also shows the time consumed by moving the binaries in and out of Subversion. Details of the storage methods are as follows:
- Compressed tar - import - export: The binary directories were
combined into a single compressed tar file (a tar.gz file) and then put into
Subversion using the
importcommand. The file was then fetched from Subversion using theexportcommand, and then the original directories were retrieved from the compressed tar file (that is, the file was untarred). - Tar - import - export: Almost the same as above, but in this case, the file was not compressed so it was a tar file, rather than a tar.gz file.
- Import - export: The binary directories were put into Subversion
as they were, using the
importcommand. They were then retrieved using theexportcommand. - Efficient check-in: An efficient check-in script was used to put
the binary directories into Subversion. The binaries were retrieved using
the
exportcommand. (See below for more about the efficient check-in script.)
We gathered numerous results to establish proven findings. Table 1 shows a single representative example:
Table 1. Storage formats compared for time
| Method | Input time | Output time |
|---|---|---|
| Compressed tar - import - export | 1m 28s | 0m 30s |
| Tar - import - export | 1m 51s | 0m 47s |
| Import - export | 28m 0s | 9m 30s |
| Efficient check-in - export | 2h 15s | 9m 30s |
Note that whenever an item is put into Subversion using the import command, a whole new copy of the item is stored,
with no attempt made to save it as deltas. As a result, the import command is quick but not space efficient. Subversion
comes with a script that attempts a space-efficient check-in. The efficient
check-in script compares the version to be put into Subversion with a version of
the item already there. The new version is then stored as the deltas between the two.
The results displayed in Table 1 clearly demonstrate that the binary storage format used significantly affects the time required to move binaries into and out of Subversion. The most time-efficient method is to create a single, compressed file containing the binary. Even creating a single, uncompressed file containing the binary takes less than one tenth of the time required to import the binary in its initial structure.
These conclusions make sense because much of Subversion's import processing time is spent recursing the directories to be processed, so creating a single file leads to dramatic savings. Using Subversion's efficient check-in script with binaries resulted in unacceptable wait times. The script takes so much time because it actually involves exporting a full copy of the binary to the local disk for comparison.
These findings only account for how different storage methods impact the time it takes to store, access, and retrieve binary files in Subversion. We still needed to investigate the amount of server disk space used to store binaries in different storage formats.
Storage formats and space consumption
Table 2 shows a number of alternative methods of storing a binary in
Subversion. It also shows the space used on the Subversion server when using the
import command versus the efficient check-in
script. The first column in Table 2 describes how the binaries were stored by the
Subversion user when they were put into the server. The second column shows
the size of the binaries on the local system. The final two columns show the
size of the binaries on the server, first using the Subversion import command, and then using the Subversion
efficient check-in script.
Once again, numerous results were gathered to establish proven findings, but only a single representative example is shown.
Table 2. Input and output time for various methods
| Stored locally as | Size locally (Mb) | Size on server (Mb) | |
|---|---|---|---|
| Imported | Efficient check-in | ||
| Directories | 285 | 128 | 61 |
| Tar file | 219 | 103 | 102 |
| Compressed tar file | 75 | 75 | 75 |
Table 2 shows that in terms of server space, when using the import command, the most space-efficient storage method
is to store the binaries in a single, compressed file. This consumes roughly
75 percent of the space consumed by a single non-compressed file, and roughly
60 percent of the space used when importing the binary as a normal directory
structure. Using efficient check-in yields even better results, however.
Efficiently checking in a directory structure uses less than 50 percent
of the space required when importing a normal directory structure.
That said, efficiently checking in the binaries as a single uncompressed
file gains very little over importing, and efficiently checking in the
binaries as a compressed tar file gains nothing at all.
These results indicate that Subversion's own compression algorithm is
slightly better at compressing binaries than the gzip command that was used to compress the files
locally. It's also clear that Subversion cannot compress an already
compressed file. Perhaps most interestingly, the most space-efficient method of storing a set of binaries in Subversion is to use efficient check-in on a regular directory structure.
Authentication and performance
Next, we investigated the effect of various server authentication methods on the time required to move binaries in and out of Subversion. Table 3 shows a number of alternative methods of authenticating with the Subversion server. For each method, we measured the time required to import the binaries as a full directory structure, as an uncompressed tar file, and as a compressed tar file. The authentication methods are as follows (note that these are incremental, so "ldap_group" includes the settings for "no_path_auth disabled" and "Basic"):
- No auth uses file-system authentication only, accessed locally.
- Basic means Subversion was accessed via the Apache Web server using HTTP.
- no_path_auth disabled means a large amount of path processing was turned off.
- ldap_group means LDAP groups were set up and used.
- ssl uses the HTTPS protocol.
A single representative example of our findings is provided in Table 3.
Table 3. Input time for various authentication methods
| Authentication method | Directories | Tar file | Compressed tar file |
|---|---|---|---|
| No auth | 29m 25s | 2m 20s | 1m 17s |
| Basic | 44m 23s | 2m 51s | 1m 25s |
| no_path_auth disabled | 44m 28s | 2m 53s | 1m 24s |
| ldap_group | 45m 21s | 2m 53s | 1m 24s |
| ssl | 45m 27s | 2m 52s | 1m 25s |
As Table 3 shows, we achieved the fastest import times by having no authentication on the Subversion server. In the majority of cases, however, some form of authentication is required. We found that the method of authentication used does little to affect the time required to move binaries in and out of Subversion.
We wanted to determine whether hardware should be a major consideration when addressing Subversion performance issues. For this test, we hosted the Subversion server on a variety of machines and compared the time required to import binaries into each one. With a near-infinite range of machine types available, we limited our investigation to more general considerations, as shown in Table 4. Note that Desktop 1 was a shared machine, meaning that it was being utilised for other tasks in addition to hosting the Subversion repository. Desktop 2 was a dedicated machine, used only to serve the Subversion repositories.
Table 4. Input times for various machines
| Machine type | Machine specification | Time |
|---|---|---|
| Desktop 1, shared | CPU: 2x500MHz PIII, RAM: 500Mb | 28m |
| Desktop 2, dedicated | 1x3200MHz P4, RAM: 2000Mb | 2m |
| Server, shared | 0.2 CPUs | 28m 0s |
| Server, shared | 1 CPU | 13m 19s |
| Server, shared | 2 CPUs | 13m 19s |
The results in Table 4 demonstrate that the machine used to run the Subversion server hugely affects the time required to import binaries. Comparing the two desktop machines, the dedicated, powerful Desktop 2 was 14 times faster than the shared, less powerful Desktop 1. In fact, the dedicated desktop was much faster than even the large server machine, though increasing the server's CPU power initially doubled the speed of the import.
Clearly, choosing the right machine is a very important part of minimising the time taken to move binaries into and out of Subversion.
The findings of this investigation are clearly specific to the system being investigated, so it is unlikely that the actual values shown are of much significance to other systems. The patterns are more important because they will be replicated in any Subversion system. According to our findings, when storing a set of binaries in Subversion:
- The most time-efficient method is to create a single, compressed file containing the binary.
- The most space-efficient method is to use Subversion's efficient check-in script on a regular directory structure.
- Using any form of authentication on the Subversion server will result in performance loss.
- A dedicated, powerful machine is optimal for running Subversion.
Generally speaking, these findings serve as a framework for optimizing Subversion performance when storing and retrieving binary files. The findings are complicated by the following considerations, however:
- The investigation into hardware was particularly specific to the system
being investigated. The binaries used in the test system are very large
compared to those used by many projects. The patterns identified do still
apply, but perhaps to a lesser extent. Given that some of the findings are
dealing in magnitudes of difference, it will be to your advantage to make educated
decisions based on the findings in the investigation.
- The investigation into authentication methods was carried out by
comparing the authentication method against the time taken to move the
binaries into Subversion. This measure was the most important to the system
under test, but moving files into Subversion is actually the process least
affected by the authentication method. Commands such as
svn log, which access many different paths and revisions, are much more dependent on the authentication method. It is therefore worth noting that while the choice of authentication method is not particularly important in improving performance with binary files, it may be important in other areas. - The benefits of using the efficient check-in script will also be
determined by the nature of the project. The amount of space required on the
Subversion server when using efficient check-in is highly dependent on the
amount of change that occurs in the binaries between each version checked
in. The space efficiency afforded by efficient check-in is greatly increased
in cases where the amount of change between versions is relatively low (as
was the case in the development system under test).
- It is questionable whether combining a directory structure into a single file, compressed or not, is a viable option for every system. One drawback of this setup is that it prevents the binary being browsed in most circumstances. It would also prevent you from directly processing the binary in Subversion. You would have to remove the binary from Subversion and revert it to its original form before altering it. In some cases, these issues will rule out the directory approach or render it irrelevant to the system being considered.
The performance conclusions in this article are broadly applicable to most any system using Subversion to store and retrieve binary files. The decision of how to optimize Subversion must, however, be based on the circumstances and resources of a given system. For example, in the case of the system under test, time was a much higher priority than server space. It made sense, therefore, to store binaries as compressed tar files before importing them into and exporting them out of Subversion. Compressing the binaries yields the shortest possible wait times, thus meeting the requirements of the project. The repercussions for server space usage are negligible to the system, and in any case compressed tar files take up only a little more space than using efficient check-in.
In cases where time is the only consideration, combining a fast, dedicated machine with a compressed tar file of the smallest binary possible is the ideal optimization. Disabling authentication would save even more time, though most systems require authentication. The choice of authentication method does not much affect the time required to store and retrieve binaries.
If server space is the only motivator, then using efficient check-in on regular directory structures would use the least amount of space. This method becomes more beneficial if binaries change very little between versions. In systems with large binaries, time would have to be of very little concern for this to be the best solution.
In many systems, a compromise may be possible. When time is of the essence, try importing a single compressed file. This will allow you to put the file into Subversion as quickly as possible but also allow others to export it from Subversion quickly. However, at times when speed is not an issue, you could employ efficient check-in. This would be sensible in cases where the binaries are unlikely to be required but must be stored in case they are ever needed.
Knowing how to effectively store binaries in Subversion can save hundreds of hours of team members' time and gigabytes of server space. Making an educated decision based on the details and requirements of the system in question is the most sensible approach to take. This article should help users and system administrators alike do just that.
Learn
- "Introducing Subversion" (Elliotte Rusty Harold, developerWorks, June 2006): Includes a brief history of version control and an exercise in checking files into and out of Subversion.
- "How to use Subversion with Eclipse" (Chris Herboth, developerWorks, July 2006): Easy instructions for switching Eclipse from CVS to Subversion.
- "Version control for Linux" (M. Tim Jones, developerWorks, October 2006): An overview of software configuration management systems for Linux.
- "Create a blog from scratch with PHP and Subversion" (Tyler Anderson, developerWorks, February 2006): A tutorial introduction to using Subversion in a simple Web development project.
- developerWorks Java technology zone: Hundreds of articles about every aspect of Java programming.
Get products and technologies
- Download Subversion 1.4.3: The latest and best version of Subversion to date.
Discuss
- developerWorks blogs:
get involved in the developerWorks
community.
David Bell has worked for the IBM Java Technology Centre for the last five years. He has performed build, test, and development roles, has owned the Net and Nio Java components, and has designed and developed numerous development processes and automated tools, some of which have made use of Subversion. He is currently a senior Java developer for the latest release of the IBM JDK.