Skip to main content

Winning Database Configurations: An IBM Informix Database Survey

Marty LurieIBM, Software Group
Marty Lurie started his computer career generating chads while attempting to write Fortran on an IBM 1130. His day job is in Systems Engineering for IBM's Informix organization, but if pressed he will admit he mostly plays with computers. His favorite program is the one he wrote to connect his Nordic Track to his laptop (the laptop lost two pounds, and lowered its cholesterol by 20%). Marty is an IBM-certified DB2 DBA, IBM-certified Business Intelligence Solutions Professional, and an Informix-certified Professional. He can be reached at lurie@us.ibm.com .

Summary:  This article presents the results of a survey that was performed to learn from a number of database deployments. Information is included about system size, capacity planning, partitioning, indexing, backup procedures, normalization, and more.

Date:  10 Jan 2002
Level:  Introductory
Activity:  564 views

© 2002 International Business Machines Corporation. All rights reserved.

Introduction

How are database servers used and configured? How much hardware does it take to process a large amount of data? The answers often lie with the database administrators (DBAs), who, through many hard-earned lessons, have found configurations that work well. In this article, I present the results of a survey on more than 60 IBM Informix servers deployed at over 40 organizations.

Why do a database survey?

There are two primary reasons for performing the survey:

  • To develop sizing metrics to define how many CPUs, how much memory, and how many disks are needed for a given workload.

  • To find out what people are actually doing with the database, including which features are being used, such as mirroring.

One of the most common questions that database administrators ask is "How many CPUs do I need for this server?" Another question is "What is the best way to back up the server?" Consultants and pre-sales tech support staff base their responses on what has succeeded at other accounts, with a liberal dose of what the product design team recommends.

By examining the existing configurations we can get a good idea of what a system is capable of handling. Using statistical methods (don't worry, this won't hurt a bit), I'll give you a formula that you can use to determine how many CPUs you need, based on the amount of data you have and the version of the IBM Informix © server you are using.

Database servers have a wide range of complex features, and DBAs sometimes use features in ways that were certainly not one of the intended uses the Informix developers had in mind. Features are typically added to products based on customer requests or programmer inspiration. Once features are added, there is no control over how they are deployed. A great example of this is a table I encountered that had over 100 columns and an index on every column. This is a legal configuration, but not a recommended one.

Of course, I'm not naming names about who had a sub-optimal configuration. There is always a reason why a server has a particular configuration, and it may be that the DBA has an additional set of responsibilities that take up the other 60 hours in his or her work week.


Survey Methodology

To have a survey, you have to have participants. The International Informix Users Group (IIUG) , was very helpful in promoting and encouraging participation in the survey. As with any survey, without a minimal level of participation the results are not meaningful. A statistician calls this "statistically significant". Many thanks to the IIUG for their help with the survey.

The survey was a shell script that ran queries and commands on the system being examined. No modifications could be made to any of the servers under test; it was a query-only process with no temp tables. The output was a series of files created in the /tmp directory. You can download the script if you are interested. The results were tar'ed into a single file and e-mailed to me.

With over 60 database instances sampled, there was quite a lot of data. Clearly this called for a database (I trust you're not surprised). It also called for a parser to extract the nuggets of information from the files. To celebrate the purchase of Informix by IBM I loaded the survey results into DB2 © . The parser scripts were written in Python. This is an up and coming challenger to Perl. It is worth experimenting with Python-- don't be put off by the use of spaces to delimit blocks of code; it actually is a good way to force better indentation style.

The survey sample was not random. The servers included were from organizations I've worked with, organizations that my colleagues have worked with, or IIUG members, I won't even try to guess how this sample has skewed the results, but in the interest of full disclosure for the statisticians in the audience, this is not a random sampling.


Theory versus practice

As I mentioned before, software is not always used as it was designed or intended to be used. I use a graphical representation to highlight when I'm presenting examples of theory versus the results of the survey, which are examples of current practices.

  • When you see the "professor" I'm describing the recommended usage of the database technology.

Professor icon
  • When you see the "multi-tasking dba", it means that these are results gathered in the survey.

Multitasking icon

Please understand I hold professors and DBAs in the highest regard. I'm not taking sides on the discussions, but you should know that I've never met a "work-around" that I didn't like. Also, the results presented here are observations, not criticisms.

Before diving into the details, let's take a look at a high level view of the demographics of the survey participants.


Demographics

Multitasking icon

Who is in the survey, and what are they running? Over 40 organizations participated in the survey and they submitted over 60 different database instances for analysis. Let's look at some interesting results based on workload, backup procedures, which server versions are used, and the up-time since last restart.

Workload

The first surprise comes from looking at workload as shown in Figure 1 . Mixed OLTP and DSS processing is a hard problem. Server resources are allocated differently for intensive decision support queries than for simple but high volume OLTP workloads. The survey finds that 39% of participants have already found an acceptable resource balance for their servers. The practitioners have found a solution to a problem that the IBM Informix product developers thought was still a theoretical discussion! It is not fair to oversimplify - to their credit, the product developers are working on new algorithms to process mixed workloads that avoid the compromises currently required for concurrent hosting of DSS and OLTP.

Workload distribution

Figure 1. Workload distribution

Backup procedures

All those transactions and queries have to be backed up, and the survey results showed an interesting pattern here as well. The robust full-featured backup utility is onbar , which is highly recommended. Surprisingly ontape , the basic, simple backup utility, is used by almost half of all servers. A poll at the IBM Informix Users Conference in the session on backup revealed an even larger percentage of ontape users. A full 5% of systems had no backup! Shocking? No, not really. For large warehouses a viable strategy for recovery is to rebuild the system from the source data files. This strategy is most effective when the data refresh methodology is to drop the database and repopulate all tables. It effectively tests the backup process every time the warehouse is refreshed. A question from the "professor": When was the last time you tested your backup?

Servers used

The sample included version 7.x (IBM Informix © Dynamic Server tm ), 8.x (IBM Informix © Extended Parallel Server tm ), and 9.x (IBM Informix Dynamic Server) products. Version 7.x is the OLTP workhorse. Version 9.x has extensibility and all of version 7.x features. Version 8.x is the data warehousing champion.

Server participation in percent

Figure 2. Server participation in percent

Data volume by server type

Figure 3. Data volume by server type

The demographics again shattered some rules of thumb. As shown in Figure 2 , of the 60+ servers in the survey, only 15% were XPS servers. But Figure 3 reveals that these handful of servers accounted for a full 60% of the data in the survey, or 8.4 terabytes. Remember, the survey is not a random sample - both larger and smaller database servers exist in the server population. Drilling down on the specific servers, we find two 7.x OLTP instances that are well over 1 terabyte in size. So much for the myth that version 7 is limited to approximately 500GB. The average server size for version 7.x and 9.x was 110GB. The average XPS instance was 900GB.

Up-time since last restart

How long has your database been up? The top ranking databases were 79, 78, and 53 days on AIX © , Solaris, and AIX, respectively. A well-managed instance can clearly run for months without a restart, even significantly longer than the servers in this survey.


System Sizing - How much computer do I need?

Professor icon

Figuring out how much hardware is needed for a particular database size is a real challenge. The classic answer from theoreticians is, "It depends. How fast did you want to go?" Fortunately, theoreticians can model the data samples from a survey to develop a formula to compute both the CPU count and the memory needed for a given amount of data. The estimates for resources are dependent on which database server you use. There really is a difference between servers, even though some would argue that databases are a commodity.

This section includes:

Average sizes: the mean and median

The simplest way to size a system is to multiply the amount of the data you have by the ratio of CPU/data to arrive at the number of CPUs you need. This approach gives a basic sizing but is subject to skew from large systems in the sample. The data is interesting because it shows a big difference between the server versions, and also a difference between the values in the survey and those used for running a competitive benchmark (see http://www.tpc.org ).

The amount of random access memory (RAM) can be modeled as the ratio of data in megabytes to RAM in megabytes. Because this is a ratio of megabytes to megabytes the value is a number, and doesn't have units (like miles per gallon, or even mHz/shekel etc).

The mean value or average is calculated by adding up all the samples and then dividing by the number of samples. The median value is calculated when the mean value didn't turn out the way you wanted. Let me restate that definition. The median value is the value such that half of the sample values are greater than the median and half of the observations are less than the median. The following table shows the average number of bytes per CPU results by server version, for production systems in the survey:

Server Version

Average GB/CPU

Median GB/CPU

Sample Size

7.x and 9.x

17.6

8.5

36

8.x (XPS)

58.3

62.2

10

Interestingly, the XPS server clearly has larger data volume capabilities. If we look at the distribution of observed values: the 7.x/9.x mean was raised by a few large server instances. The XPS mean was lowered by a few small instances.

The memory ratio, comparing the RAM to the data volume is shown in the table below:

Server Version

Average RAM in GB/Data in GB

Median RAM in GB/Data in GB

7.x and 9.x

0.069

0.041

8.x (XPS)

0.022

0.013

TPC-H XPS Benchmark

0.128

In a benchmark environment, more RAM is used to enhance performance. There is a significant difference between the RAM ratios deployed in customer sites as compared to that used in the TPC-H benchmark. When a database server runs out of RAM, it mimics demand paging and uses disk as a temporary storage area. This carries a performance penalty.

One could easily size a system using the above average values, but we would miss the opportunity to provide a much better estimating tool.

A better system sizer: regression analysis

Using a straight-line approximation of the relationship between the amount of data and the number of CPUs gives a very good estimating tool for system sizing. This is called regression analysis. (See Related information: What is regression analysis?) For the programmers and quality testers in the audience, this regression analysis has nothing to do with "programming regression," which is when an old bug finds a new home in the latest release.

Figure 4 shows the regression model for XPS:

The regression model for XPS

Figure 4: The regression model for XPS

Let's examine this graph. As the legend shows, the actual CPUs/data points from the survey are plotted using a diamond on the graph in Figure 1. The "predicted values", shown in Figure 1 as square symbols, are described by the following formula:

XPS_CPUs = 0.01102 * Data_in_GB + 4.7

If this is confusing please go and read the side bar on regression. It really will help you understand this model.

Along the X axis of the graph is the amount of data in gigabytes. The Y axis shows the number of CPUs. The largest value (labeled point 1 ) is an example of a surveyed customer using more CPUs than the model predicts. For 2.3 terabytes (2300 gigabytes), the model suggests using about 30 CPUs. The actual number of CPUs used by the customer is 40. This particular site has very complex SQL, and they run many queries at the same time, so it makes sense for them to have more CPUs than predicted.

For the next smallest server, at 1.8 terabytes, the 24 CPUs suggested by the model is equal to the actual 24 CPUs in use. The square and diamond fall on the same place on the graph.

How good is the model? The R 2 for this model is 64%. This means that if you use the above model to choose the number of CPUs, 64% of the differences in the values among XPS sites can be explained by the model. There is a 64% probability that you are choosing the correct value, where "correct" is defined by what other sites are using.

What about the other 36% of differences between the sites? Different sites have different requirements for response time, and the complexity of queries varies greatly between sites. For some strange reason, budget allocations also play a large role in defining the number of CPUs.

The same regression technique applied to the 7.x and 9.x servers reveals the model shown in Figure 5 . But this model is overly optimistic, and we'll see why in the next section.

The regression model for Versions 7.x and 9.x

Figure 5: The regression model for Versions 7.x and 9.x

The regression formula is:

7.x/9.x_CPUs = 0.0106059 * Data_in_GB + 3.9

This formula is misleading because it calls for fewer CPUs than XPS. The very large servers have skewed the regression analysis to the wrong conclusion.

We now have to ask the question: Is there any statistical relationship in the 7.x and 9.x data, or are we looking at a scatter plot that could just as well be a shotgun blast?

Another look at 7.x and 9.x servers with ANOVA

The regression model for 7.x and 9.x has a problem that is illustrated in the graph: There is a heavy cluster of results under 300 gGigabytes of data, and the R 2 is not robust. Let's use another analysis tool to look for any statistical relationship. The analysis of variance (ANOVA) test asks the basic question: Is there a statistical difference in the data on a server based on looking at the number of CPUs? If the ANOVA analysis has a significant F value, then there is a difference in the data volumes; if the F value is not significant, then the average value of volume of data for different CPU configurations is the same. We sure hope not!

Looking at the ANOVA summary helps us understand why the regression analysis is too optimistic for CPU capacity. The average gigabyte value of 137.7 for a system with 4 CPUs has a significant spike in comparison to its neighboring values, as does the average value for 7 CPUs and 24 CPUs. The 4 CPU spike is very aggressive-- these systems are driving more data, on average, than some 10 CPU deployments. This highly skews the regression analysis curve.

Anova: Single Factor

SUMMARY

Groups

Count

Sum

Average

Variance

2 CPUs

10

132.3

13.2

63.3

3 CPUs

8

262.4

32.8

755.5

4 CPUs

5

688.4

137.7

10848.0

5 CPUs

3

145.2

48.4

5100.9

6 CPUs

1

36.5

36.5

N/A

7 CPUs

3

956.9

319.0

233203.9

8 CPUs

2

98.0

49.0

702.2

10 CPUs

2

159.2

79.6

2432.5

16 CPUs>

2

805.5

402.7

198101.1

24 CPUs

1

1798.3

1798.3

N/A

The ANOVA test indicates that there is a difference between the volume of data that different numbers of CPUs can handle. This seems pretty obvious, but it is nice when we have a value of F=13.6 compared to a Fcritical=2.3, which is the ANOVA calculation validating our instinct. If the ANOVA failed, it would indicate a lack of scalability.

What's a DBA to do?

What can you do to best model 7.x and 9.x CPU requirements? The regression model will give a very lean CPU count, and you may need to add additional processors to the configuration if performance is not adequate. The ANOVA summary chart gives more granular averages. If you have 50 gigabytes of data, you could choose from a 4-way SMP up to an 8-way SMP based on the averages.

These formulas are models. They are not carved in stone. Your mileage will vary, and you may have a set of good reasons not to match the average ratio of CPUs to data. The models do not control for a number of factors, including CPU speed. If you are getting the latest processors with over 1 GigaHz clock speeds, then clearly fewer CPUs are required. Because system sizing isn't an easy problem to solve, DBAs who can exercise good judgment are in high demand.


Some principles of database design

Professor icon

We hold some truths of database design to be self-evident:

  • Data is stored in tables
  • Tables should be normalized for OLTP processing.
  • For performance some tables should be denormalized.
  • Indexes should be created to enhance performance.
  • Partition or fragment tables for performance.
Multitasking icon

The survey showed some interesting results in the database design area.

Data is stored in tables

If data is stored in tables, one would expect that tables would contain data. Surprisingly, in our survey, 37,000 tables contained data, while 120,000 existing customer tables were empty! This mystery is easily solved. Several servers run third-party Materials Requirements Planning (MRP) applications. An apparel business will not have many rows in the tires inventory table. In many cases, vendors standardize the schema for their MRP package across several industries, so many tables are empty.

Normalization and indexing

As expected, the OLTP databases had significantly more tables than decision support databases-- an average of 616 for OLTP compared to 106 for DSS. As tables get denormalized for decision support, the table count goes down.

The big surprise in database physical design is the number of indexes used. The OLTP applications averaged 1.5 indexes per table. The decision support environments averaged 1 index for every 5 tables! This equates to 0.2 indexes per table. Heresy!

The answer to this anomaly is the IBM Informix hash-join technology. Table scans and hash joins are so fast that creating an index is a waste of time and disk space unless you have simple lookup queries that return only a handful of rows.

Partitioning

To avoid being limited by disk input/output, also known as I/O-bound , IBM recommends using three partitions for each CPU for IBM Informix servers. This allows three disks to feed a CPU and to avoid making the CPU wait for the disks to finally fetch the data it requested.

Surprisingly, only 20% of the tables were partitioned (also know as fragmented, where "fragmented" means partitioned which is good, as opposed to disk fragmentation which is bad). The systems in the survey can deliver better performance, more efficiently, by using data fragmentation more extensively.

A good thing can go too far. One table in the survey had 3669 fragments. A DBA somewhere is working really hard! Why the effort? If the fragments are small, and the optimizer knows which fragment it needs, small lookups can be very fast. This particular server can deliver response times of less than one minute for queries against 300GB of data with no indexes ! The tradeoff is the additional workload of creating all these partitions, but it does avoid the entire overhead and disk space of indexes.

The process of scanning only the disk fragments needed for the query is known as "fragment elimination". The example above should be called "ultra-fragment elimination".


Some additional tuning parameters

Multitasking icon

The survey script queried system tables and found a number of settings that were interesting. Internal estimates for how many of these features are really used by customers vary widely, and the survey, while not a random sample, provides much better insights.

  • IBM Informix mirroring is enabled on 40% of servers: The IBM Informix server can mirror disks internally, so that a disk failure doesn't bring the server down. Most modern UNIX operating systems have their own mirroring capabilities. It is surprising that this large a percentage of servers have the feature enabled.

  • NOAGE is used on 65% of servers: The NOAGE parameter directs the operating system not to reduce the database server priority even if it has used more CPU cycles than other processes. It is a popular way of keeping the database at a more privileged level in the operating system scheduler.

  • Forced residency is enabled on 58% of servers. When forced residency is enabled, the database cannot be swapped out of RAM. Bad things generally happen when memory is over-committed. Having the database pinned in memory provides better performance, but it requires system level discipline to avoid other applications using significant memory and thus getting frequently swapped. Other applications will have less real memory to work with since the database can't be swapped, and they will suffer accordingly. If the database owns a significant part of the real memory, it may be better to host other applications on a different platform, or to be sure that the other applications will never ask for more memory than remains available.

  • A non-default fill factor is used on 24% of servers. The index fill factor parameter, FILLFACTOR in the onconfig file, specifies what percent of the B-Tree index space should be saved for new entries in the index. Some servers have reserved more space than the default in the index B-Trees for new entries; one server wanted to save on disk and reserved zero space for new inserts.


Conclusions

The survey provided interesting insights on how IBM Informix servers are configured and used. The regression model provides a very useful sizing tool for defining the number of CPUs needed for a given volume of data.

The survey only provides guidelines, not rules. I hope you find this information helpful, and I am interested in your feedback.

Top of page


Related information

What is regression analysis?

Regression analysis provides a formula to estimate a relationship - in our case, the number of CPUs for a given amount of data. For simple linear, or straight-line, regression the formula is:

Y = m*X + B

Or in our case:

CPUs = m*amount_of_data + B

Where m is the relationship, and B is a constant. Why would you have a non-zero constant? Computers for production environments are typically multi-processor machines (SMP). So even with small amounts of data, many environments have a 4-CPU or 8-CPU minimum server configuration.

How good is your regression model? This is measured by a value called the R 2 (pronounced "are-square"). An R 2 of 75% means that 75% of the differences in the number of CPUs can be explained by the "m" and "B" model we have created.

Return to previous section | Top of page


About the author

Marty Lurie started his computer career generating chads while attempting to write Fortran on an IBM 1130. His day job is in Systems Engineering for IBM's Informix organization, but if pressed he will admit he mostly plays with computers. His favorite program is the one he wrote to connect his Nordic Track to his laptop (the laptop lost two pounds, and lowered its cholesterol by 20%). Marty is an IBM-certified DB2 DBA, IBM-certified Business Intelligence Solutions Professional, and an Informix-certified Professional. He can be reached at lurie@us.ibm.com .

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=14076
ArticleTitle=Winning Database Configurations: An IBM Informix Database Survey
publish-date=01102002
author1-email=
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers