Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Performance tuning Subversion

Store and handle binaries without the performance drag

David Bell (david.bell@uk.ibm.com), Senior Java Developer, IBM
David Bell has worked for the IBM Java Technology Centre for the last five years. He has performed build, test, and development roles, has owned the Net and Nio Java components, and has designed and developed numerous development processes and automated tools, some of which have made use of Subversion. He is currently a senior Java developer for the latest release of the IBM JDK.

Summary:  Subversion is one of the few version control systems that can store binary files using a delta algorithm. Unfortunately, users have discovered that doing so results in a significant performance hit. In this article, senior developer David Bell explains why Subversion's performance suffers when handling binaries and suggests several ways to work around the problem.

Date:  22 May 2007
Level:  Introductory

Comments:  

Subversion (SVN) is an open source version control system that facilitates storage, access, and parallel development of source, scripts, binaries, and other file types. While Subversion is very popular, many of its users have experienced unacceptably long wait times when importing or checking-in binaries, as well as exporting or checking them out. Fortunately, once you understand what causes this particular performance problem, it is possible to avoid it in your system.

In this article, I relay my experience investigating binary-related performance degradation in a real-world Subversion file system. I explain the basic problems encountered by the system's users and administrators and then show the results of specific investigations into the cause of those problems. I conclude with an overview of the findings and suggestions for optimizing Subversion for shorter access times and/or less space consumed on the server.

This article is primarily intended for system administrators who use Subversion for version control and would like to improve its performance when storing binary files. It may also interest anyone wanting to set up a Subversion system that will store binary files. See Resources if you need an introduction to Subversion.

Why store binaries?

What is a binary file?

In the context of this article, a binary file is a file that has been compiled. You can also think of a binary file as one that is not stored as readable text. Executables, native libraries, and Java class files are all binary files. A collection of files being moved in and out of Subversion is also sometimes referred to as a "binary." Binary files cannot be diff'ed in the same way readable files can, and therefore have different characteristics from standard files when stored in a version control system.

Version control systems are typically used for file backup, parallel development, and change management. They are most often used by development teams to manage application source files. They are also sometimes used to manage tools and, occasionally, to store binaries. The downside of using a system like Subversion to store binaries is access time: Fetching a binary file from a version control system is usually much slower than simply copying or FTPing it from another machine or a shared drive. On the other hand, version control systems typically require less storage space than other types of file systems.

Subversion and other version control systems can save files using an algorithm that stores only the differences between a new version and the previous one, not the entire file. The saved differential data is known as a delta, or, more loosely, as the file's deltas. Because a version control system does not store each new version in its entirety, it requires less disk space for data storage than a standard file system.

Most version control systems cannot store binary files as deltas, but Subversion can. Many system administrators like the idea of saving disk space while also keeping source and binary files together and in sync in the same system. If only Subversion's binary storage worked as well in the real world as it does in theory.


The real-world situation

I recently investigated Subversion's binary-storage-related performance problems in a real-world development system. The system had been up and running for a few months at the time of the investigation. The development team universally accepted the benefits of having source, scripts, and binaries together and in sync in the same version-controlled file system. Being able to fetch an entire development environment with a single command made for a significantly less error-prone environment. It also helped minimise the barrier to entry for new developers.

The team had two growing concerns, however. The first problem, experienced by all users of the system, was the time required to check out or export the binaries. Using Subversion for this purpose was orders of magnitude slower than simply copying the items from an alternative machine or a shared big disk. Only the system's administrators were aware of the second problem, which was the amount of space consumed by the binaries.

Once these two issues were identified, we began to investigate the situation. We hoped to continue storing the binaries in our Subversion file system, but first we needed to find a workaround for the time and space issues involved.

Details of the investigation

The development system was a heterogeneous environment consisting of multiple platforms and operating systems, but no part of the system was immune to performance issues related to Subversion's treatment of binary files. Accordingly, we started our investigations by moving various binaries into and out of Subversion in various ways, stored in various formats, on various machines, and using various forms of authentication. We did all of this in a controlled environment so that we could assess the impact of each variable. In this article, I do not describe the inner workings of the investigations, but I do explain the results and conclusions.

Binary storage formats compared

Our first step in the investigation was to examine how various storage methods impacted the time required to place a binary file into Subversion, fetch it out, and put it back in its original form. We tried four methods: putting the binaries into Subversion as a large directory structure, creating a single file containing the directory structure and then putting that into Subversion, compressing the single file, and saving the binaries as deltas rather than putting an entire new version into Subversion each time.

Table 1 shows a number of alternative methods of storing a binary in Subversion. It also shows the time consumed by moving the binaries in and out of Subversion. Details of the storage methods are as follows:

  • Compressed tar - import - export: The binary directories were combined into a single compressed tar file (a tar.gz file) and then put into Subversion using the import command. The file was then fetched from Subversion using the export command, and then the original directories were retrieved from the compressed tar file (that is, the file was untarred).
  • Tar - import - export: Almost the same as above, but in this case, the file was not compressed so it was a tar file, rather than a tar.gz file.
  • Import - export: The binary directories were put into Subversion as they were, using the import command. They were then retrieved using the export command.
  • Efficient check-in: An efficient check-in script was used to put the binary directories into Subversion. The binaries were retrieved using the export command. (See below for more about the efficient check-in script.)

We gathered numerous results to establish proven findings. Table 1 shows a single representative example:


Table 1. Storage formats compared for time
MethodInput timeOutput time
Compressed tar - import - export1m 28s0m 30s
Tar - import - export1m 51s0m 47s
Import - export28m 0s 9m 30s
Efficient check-in - export2h 15s9m 30s

Note that whenever an item is put into Subversion using the import command, a whole new copy of the item is stored, with no attempt made to save it as deltas. As a result, the import command is quick but not space efficient. Subversion comes with a script that attempts a space-efficient check-in. The efficient check-in script compares the version to be put into Subversion with a version of the item already there. The new version is then stored as the deltas between the two.

Time test results

The results displayed in Table 1 clearly demonstrate that the binary storage format used significantly affects the time required to move binaries into and out of Subversion. The most time-efficient method is to create a single, compressed file containing the binary. Even creating a single, uncompressed file containing the binary takes less than one tenth of the time required to import the binary in its initial structure.

These conclusions make sense because much of Subversion's import processing time is spent recursing the directories to be processed, so creating a single file leads to dramatic savings. Using Subversion's efficient check-in script with binaries resulted in unacceptable wait times. The script takes so much time because it actually involves exporting a full copy of the binary to the local disk for comparison.

These findings only account for how different storage methods impact the time it takes to store, access, and retrieve binary files in Subversion. We still needed to investigate the amount of server disk space used to store binaries in different storage formats.


Storage formats and space consumption

Table 2 shows a number of alternative methods of storing a binary in Subversion. It also shows the space used on the Subversion server when using the import command versus the efficient check-in script. The first column in Table 2 describes how the binaries were stored by the Subversion user when they were put into the server. The second column shows the size of the binaries on the local system. The final two columns show the size of the binaries on the server, first using the Subversion import command, and then using the Subversion efficient check-in script.

Once again, numerous results were gathered to establish proven findings, but only a single representative example is shown.


Table 2. Input and output time for various methods
Stored locally asSize locally (Mb)Size on server (Mb)
ImportedEfficient check-in
Directories285 128 61
Tar file219 103 102
Compressed tar file757575

Space test results

Table 2 shows that in terms of server space, when using the import command, the most space-efficient storage method is to store the binaries in a single, compressed file. This consumes roughly 75 percent of the space consumed by a single non-compressed file, and roughly 60 percent of the space used when importing the binary as a normal directory structure. Using efficient check-in yields even better results, however. Efficiently checking in a directory structure uses less than 50 percent of the space required when importing a normal directory structure. That said, efficiently checking in the binaries as a single uncompressed file gains very little over importing, and efficiently checking in the binaries as a compressed tar file gains nothing at all.

These results indicate that Subversion's own compression algorithm is slightly better at compressing binaries than the gzip command that was used to compress the files locally. It's also clear that Subversion cannot compress an already compressed file. Perhaps most interestingly, the most space-efficient method of storing a set of binaries in Subversion is to use efficient check-in on a regular directory structure.


Authentication and performance

Next, we investigated the effect of various server authentication methods on the time required to move binaries in and out of Subversion. Table 3 shows a number of alternative methods of authenticating with the Subversion server. For each method, we measured the time required to import the binaries as a full directory structure, as an uncompressed tar file, and as a compressed tar file. The authentication methods are as follows (note that these are incremental, so "ldap_group" includes the settings for "no_path_auth disabled" and "Basic"):

  • No auth uses file-system authentication only, accessed locally.
  • Basic means Subversion was accessed via the Apache Web server using HTTP.
  • no_path_auth disabled means a large amount of path processing was turned off.
  • ldap_group means LDAP groups were set up and used.
  • ssl uses the HTTPS protocol.

A single representative example of our findings is provided in Table 3.


Table 3. Input time for various authentication methods
Authentication methodDirectoriesTar fileCompressed tar file
No auth29m 25s2m 20s1m 17s
Basic44m 23s 2m 51s 1m 25s
no_path_auth disabled44m 28s2m 53s1m 24s
ldap_group45m 21s2m 53s1m 24s
ssl45m 27s2m 52s1m 25s

Authentication test results

As Table 3 shows, we achieved the fastest import times by having no authentication on the Subversion server. In the majority of cases, however, some form of authentication is required. We found that the method of authentication used does little to affect the time required to move binaries in and out of Subversion.


Hardware and performance

We wanted to determine whether hardware should be a major consideration when addressing Subversion performance issues. For this test, we hosted the Subversion server on a variety of machines and compared the time required to import binaries into each one. With a near-infinite range of machine types available, we limited our investigation to more general considerations, as shown in Table 4. Note that Desktop 1 was a shared machine, meaning that it was being utilised for other tasks in addition to hosting the Subversion repository. Desktop 2 was a dedicated machine, used only to serve the Subversion repositories.


Table 4. Input times for various machines
Machine typeMachine specificationTime
Desktop 1, sharedCPU: 2x500MHz PIII, RAM: 500Mb 28m
Desktop 2, dedicated1x3200MHz P4, RAM: 2000Mb2m
Server, shared0.2 CPUs28m 0s
Server, shared1 CPU13m 19s
Server, shared2 CPUs13m 19s

Hardware test results

The results in Table 4 demonstrate that the machine used to run the Subversion server hugely affects the time required to import binaries. Comparing the two desktop machines, the dedicated, powerful Desktop 2 was 14 times faster than the shared, less powerful Desktop 1. In fact, the dedicated desktop was much faster than even the large server machine, though increasing the server's CPU power initially doubled the speed of the import.

Clearly, choosing the right machine is a very important part of minimising the time taken to move binaries into and out of Subversion.


Evaluating the results

The findings of this investigation are clearly specific to the system being investigated, so it is unlikely that the actual values shown are of much significance to other systems. The patterns are more important because they will be replicated in any Subversion system. According to our findings, when storing a set of binaries in Subversion:

  • The most time-efficient method is to create a single, compressed file containing the binary.
  • The most space-efficient method is to use Subversion's efficient check-in script on a regular directory structure.
  • Using any form of authentication on the Subversion server will result in performance loss.
  • A dedicated, powerful machine is optimal for running Subversion.

Generally speaking, these findings serve as a framework for optimizing Subversion performance when storing and retrieving binary files. The findings are complicated by the following considerations, however:

  • The investigation into hardware was particularly specific to the system being investigated. The binaries used in the test system are very large compared to those used by many projects. The patterns identified do still apply, but perhaps to a lesser extent. Given that some of the findings are dealing in magnitudes of difference, it will be to your advantage to make educated decisions based on the findings in the investigation.

  • The investigation into authentication methods was carried out by comparing the authentication method against the time taken to move the binaries into Subversion. This measure was the most important to the system under test, but moving files into Subversion is actually the process least affected by the authentication method. Commands such as svn log, which access many different paths and revisions, are much more dependent on the authentication method. It is therefore worth noting that while the choice of authentication method is not particularly important in improving performance with binary files, it may be important in other areas.

  • The benefits of using the efficient check-in script will also be determined by the nature of the project. The amount of space required on the Subversion server when using efficient check-in is highly dependent on the amount of change that occurs in the binaries between each version checked in. The space efficiency afforded by efficient check-in is greatly increased in cases where the amount of change between versions is relatively low (as was the case in the development system under test).

  • It is questionable whether combining a directory structure into a single file, compressed or not, is a viable option for every system. One drawback of this setup is that it prevents the binary being browsed in most circumstances. It would also prevent you from directly processing the binary in Subversion. You would have to remove the binary from Subversion and revert it to its original form before altering it. In some cases, these issues will rule out the directory approach or render it irrelevant to the system being considered.

Optimizing Subversion

The performance conclusions in this article are broadly applicable to most any system using Subversion to store and retrieve binary files. The decision of how to optimize Subversion must, however, be based on the circumstances and resources of a given system. For example, in the case of the system under test, time was a much higher priority than server space. It made sense, therefore, to store binaries as compressed tar files before importing them into and exporting them out of Subversion. Compressing the binaries yields the shortest possible wait times, thus meeting the requirements of the project. The repercussions for server space usage are negligible to the system, and in any case compressed tar files take up only a little more space than using efficient check-in.

In cases where time is the only consideration, combining a fast, dedicated machine with a compressed tar file of the smallest binary possible is the ideal optimization. Disabling authentication would save even more time, though most systems require authentication. The choice of authentication method does not much affect the time required to store and retrieve binaries.

If server space is the only motivator, then using efficient check-in on regular directory structures would use the least amount of space. This method becomes more beneficial if binaries change very little between versions. In systems with large binaries, time would have to be of very little concern for this to be the best solution.

In many systems, a compromise may be possible. When time is of the essence, try importing a single compressed file. This will allow you to put the file into Subversion as quickly as possible but also allow others to export it from Subversion quickly. However, at times when speed is not an issue, you could employ efficient check-in. This would be sensible in cases where the binaries are unlikely to be required but must be stored in case they are ever needed.


In conclusion

Knowing how to effectively store binaries in Subversion can save hundreds of hours of team members' time and gigabytes of server space. Making an educated decision based on the details and requirements of the system in question is the most sensible approach to take. This article should help users and system administrators alike do just that.


Resources

Learn

Get products and technologies

Discuss

About the author

David Bell has worked for the IBM Java Technology Centre for the last five years. He has performed build, test, and development roles, has owned the Net and Nio Java components, and has designed and developed numerous development processes and automated tools, some of which have made use of Subversion. He is currently a senior Java developer for the latest release of the IBM JDK.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology
ArticleID=224777
ArticleTitle=Performance tuning Subversion
publish-date=05222007
author1-email=david.bell@uk.ibm.com
author1-email-cc=jaloi@us.ibm.com

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).