Read the article, or skip right to the installation directions below.
What is MPP and what does it have to do with my Linux computer?
Massively parallel processing (MPP) holds the key to making many inexpensive computers work together to solve a large problem. Some problems are very difficult to break into pieces and farm out to a cluster of machines. We are fortunate that relational databases lend themselves to parallelization.
Partitioning and clustering are the keys to database scalability. You can get started now with low-cost technologies and refute any argument that MPP means massive purchase price. This article shows you how to get DB2® Extended Enterprise Edition (EEE) Version 7.2 running on Linux on your laptop. Then we"ll explore how to set up two database partitions on a single computer (known as multiple logical nodes) and create tables partitioned across these nodes.
These nodes are how a multi-million dollar cluster processes data in DB2 EEE. But a single CPU machine can run multiple nodes and simulate having computers connected together in a cluster.
Parallel cluster multiprocessing is a very exciting area of computing, and learning about partitioning and clusters now is a great way to get ready for the future.
The history and technology of parallel database
A very brief history of MPP database technology
One of the first successful designs of parallel databases dates back to 1986, see [Dewitt 1986]. It took some time for these databases to become a commercial reality - with products on UNIX® platforms in the mid 1990s. MPP databases and Linux combined to conquer a commercial data warehousing benchmark for the first time in 2001. The combination of low cost Intel based hardware, IBM's DB2 EEE, and Linux were used to complete the TPC-H data warehousing benchmark.
How does a parallel database work?
There are two main approaches to making multiple computers work on the same problem at the same time.
Shared nothing parallel architecture
A 'shared nothing' architecture means that each computer has its own CPU, memory, and disk. The computers are connected together with a high speed interconnect as shown in Figure 1. When processing a query, each node works on the rows in a table that it has locally, and then passes the node"s partial result back to the coordinator node. The coordinator combines all the results from all the nodes into the final result set. The nodes do not have to be independent computers - multiple partitions can live on a single computer.
Figure 1. Shared nothing architecture
The high speed interconnect is critical to the performance of any computer cluster (database or not) See my previous DB2 Developer Domain article on this topic to understand the performance impact on a clustered relational database when changing from a 100 megabit switch to a 10 megabit hub. Current interconnect technology, Infiniband, is vastly faster than 100 megabits and offers performance between 500 megabytes to 6 gigabytes per second, which is ample bandwidth.
Distributed lock (shared disk) parallel architecture
A shared disk architecture (Figure 2) uses a lock manager to arbitrate among the computers, which all access a common disk pool. This architecture is fundamentally flawed because as the number of nodes increases, the cluster drowns in lock requests. Faster hardware hides a multitude of sins, but even with lock caching at each node, this approach can"t defy the laws of physics and will not scale without specific hardware to do the job as is implemented on DB2 for z/OSÂ with Parallel Sysplex® technology.
Figure 2: A shared disk approach with a lock manager cannot scale
Database access in this environment requires that each node request a block of data on the shared disk. If another node already has this data locked for update, the requesting node must wait for the updating node to finish. This is a ripe environment for the lock condition known as a 'deadly embrace'. Node 1 locks resource A and asks for resource B. Node 2 has a lock on resource B and requests a lock on resource A. Neither node releases their lock, so both sit and wait forever.
Data partitioning in a shared nothing environment
You already know that each node in the cluster processes its local data and passes its work back to the query coordinator node. But how did each node get assigned a subset of the data in the relational table? This is what hash partitioning does in an MPP cluster.
To distribute a table across multiple nodes, a partitioning key is selected. If a primary key exists this is a very good choice for a partitioning key. A column with very low cardinality, for example gender with only two values, "M" and "F", is a poor partitioning key.
The table partitioning is specified in the CREATE TABLE statement. Here is an example:
CREATE TABLE marty ( foo INTEGER ) PARTITIONING KEY(foo) USING HASHING;
The "marty" table will be spread across all the nodes in the MPP cluster (or virtual cluster) based on calculating the hash value of the column named foo. When the hash value is calculated, the database computes which node (or partition, in the case of a virtual cluster) owns the rows for that particular hash value and the rows are stored on the appropriate computer. You"ve probably determined that the hash algorithm must be very fast.
A low cardinality column is a poor choice for partitioning a table because it will result in very few unique values. If there are 10 nodes in an MPP cluster and only 5 unique values in the partitioning column, only 5 nodes will have data, and possibly less! But don"t worry, multiple columns can be used for a partitioning key.
Let's try it! A step-by-step installation guide
Let"s try simulating the shared nothing approach on a single computer using DB2 EEE, which is a shared-nothing relational database. A uni-processor computer is capable of providing a test environment that fully simulates a shared-nothing cluster.
Each of the installation steps below begins with a link to descriptive information found in the step-by-step commentary section that follows. (A reciprocal link at the beginning of the step-by-step commentary will return you to the installation step.) If you run into problems, see Troubleshooting. After you have finished installing DB2, you can partition data between the two partitions (simulated nodes) in your database server. You can use the schema and program available in the Download section to populate the table if you want something quick and easy to start with.
- Install Linux on your Intel architecture machine. It can be a laptop or any other PC. The tests for this article were run on SuSE 7.2 and Red Hat 7.0, 7.1, and 7.2.
- Change inetd/xinetd to accept rsh.
- On SuSE, edit the /etc/inetd.conf and uncomment the rsh lines. Make sure to restart inetd after making the changes to the configuration file.
- On Red Hat edit the file /etc/xinetd.d/rsh and change "disable = yes" to "disable = no".
As root, restart xinetd (your exact file names may vary) using the following sequence of commands:
# /etc/rc.d/rc3.d/S56xinetd stop # /etc/rc.d/rc3.d/S56xinetd start
- Change the kernel parameter for SuSE 7.2, or for Red Hat 7.1 and 7.2, as follows:
sysctl -w kernel.shmmax=1073741824 sysctl -w kernel.msgmni=1024 sysctl -w fs.file-max=8192 sysctl -w kernel.sem='250 32000 32 1024'
- Install public domain ksh (pdksh):
rpm -i pdksh-5.1.14-8.i386.rpm
Your release of pdksh may vary.
- Install DB2 EEE v7.2.
- Logged in as root, install the DB2 EEE v7.2 product using the following command:
- Select the option to create a sample database and an administrative server. Use the defaults for user IDs, because these instructions are based on the defaults. I prefer not to take the 'auto-start at reboot' option.
- Logged in as root, install the DB2 EEE v7.2 product using the following command:
- Edit the /etc/services file and add the line below shown in bold:
DB2_db2inst1 60000/tcp # Connection port for DB2 instance db2inst1 # line below added to provide ports for MLN DB2_db2inst1_END 60004/tcp # end of range of MLN ports # make sure the next port is greater than the 60004 above # no overlaps are allowed
- Log in as user db2inst1, and set up rsh for this user. Create a file .rhosts with permissions 600 and the following entries:
Test the rsh setup using the following command:
db2_all echo hi
The output will be the string "hi" echoed to standard out (stdout).
- Test your installation by entering the following sequence of DB2 commands:
db2start db2 connect to sample db2 connect reset db2stop
These commands start, connect you to the sample database, and then stop the database.
- With the database stopped, edit the file ./sqllib/db2nodes.cfg and add another virtual node. My file on a host with TCP/IP name "mikado" with two logical nodes looks like this:
0 mikado 0 1 mikado 1
- Use the db2start command to start your multiple logical nodes. Your output will look like this:
[db2inst1@mikado db2inst1]$ db2start 01-01-2002 20:30:38 1 0 SQL1063N DB2START processing was successful. 01-01-2002 20:30:39 0 0 SQL1063N DB2START processing was successful. SQL1063N DB2START processing was successful. [db2inst1@mikado db2inst1]$
- Create a table that is partitioned across the logical nodes. If you need a refresher on partitioning concepts see Data partitioning in a shared nothing environment.
connect to sample; CREATE TABLE marty ( foo INTEGER ) PARTITIONING KEY(foo) USING HASHING;
- Insert some data into the table. You can use the Java sample program to do this or do it yourself.
- Look at your partition distribution using the following SQL statement:
select partition ( foo) as Partition_Hash, count(*) as Hash_Count from marty group by partition ( foo) order by count(*) desc fetch first 10 rows only ;
- (The last step.) Congratulations! Enjoy the libation of your choice!
- Install Linux on your Intel architecture machine.
The installation process for Linux is vastly improved over earlier editions. I find it much easier than Windows®, with only one reboot required. A complete development environment is installed in one process rather than lots of add-ons.
Pick a release of SuSE (www.suse.com) or Red Hat (www.redhat.com), or whatever, but stick with it and learn how to administer it well. The Red Hat version 7.2 has updated lilo, the boot manager, to recognize a Windows 2000® partition for dual boot. This is much less work than using the Windows boot manager, which requires taking the first 512 bytes of the boot sector and making it an option in the boot.ini file. Red Hat 7.1 also works with EEE, but the PCMCIA drivers in the standard distribution caused a kernel panic on one machine. Both 7.1 and 7.2 come with the new Linux kernel, version 2.4. This has lots of significant improvements, and it is worth the effort of getting the new distribution.
I prefer the custom install option, which permits file system allocations. It also provides screens to select which software packages you want installed. Plan on 1.3 GB of space for a very comprehensive Linux image. This includes a complete development environment, Web server, Perl, Python, etc.
I was pleasantly surprised to find IP version 6 entries in the /etc/hosts file on a default installation of SuSE. As we run out of addresses on the Internet, getting people going on ipv6 is a good thing since it offers several orders of magnitude more addresses. I also tested SuSE"s boot manager against a Windows XP partition and it boots to either Linux or Windows without any issues.
See www.linuxdoc.org for an excellent collection of Howto"s on most configuration questions.
- Change xinetd to accept rsh.
The current Linux releases have done a good job of bolting down the network access to a new installation. But this means you do have to go and explicitly enable the features you want to use. You may think this is overly paranoid, but you"d be wrong. I left an old slow 486 PC hooked to my cable modem with telnet and ftp available. Sure enough, someone broke into the machine. Even an old tired box is not immune to hackers" attention.
The /etc/rc.d/rc3.d directory contains all the scripts that are run at run-level 3. Each script that starts with "S" is run with the parameter "start". You can browse these scripts to get a better understanding of how all those messages appear at startup and shutdown and what they really mean.
- Change the kernel parameter for SuSE or Red Hat 7.1 and 7.2.
This is luxury! You can make changes without having to recompile the kernel. The troubleshooting section describes the error messages you"ll get if you don"t make this change. You will be able to start the engine, but attempts to connect to the database will fail without increasing the parameter msgmni.
- Install public domain ksh.
The db2_all script is dependent on public domain ksh. You can work around this by linking the bash executable and calling it ksh, but this subterfuge only works for some of the commands. Go ahead and install pdksh. Stay in bash for everything you do if that is what you prefer.
- Logged in as root, install DB2 EEE v7.2.
The best document to work with is the quick start - see:
DB2 for UNIX Quick Beginnings GC09-2970, which has good step-by-step installation information.
The installation screen looks like Figure 3:
Figure 3. DB2 installation screen on Linux
Please take all the defaults for user IDs, because the remaining instructions use the defaults. Figure 4 shows the default user for both the database server and for the admin server.
Figure 4. Default user IDs
- Edit the /etc/services file and add the line in bold.
Each logical node needs a communications port to listen for connection requests and to talk to the other nodes. Remember that when a node processes its local data, it needs to communicate with the other nodes to pass back the result set. The range of ports used by the logical nodes are reserved with the _END syntax in /etc/services.
You can"t start more than one logical node without the additional listener ports!
- Log in as user db2inst1 and set up rsh for this user.
For the nodes to talk they have to trust each other. You can use a .rhosts file or have the system admin (that would be you on your own Linux box) put entries in /etc/hosts.equiv.
- Test your installation.
At this point you have connected to a single node instance of DB2 EEE. Your next step is to shut down the instance and add another logical node. If you have trouble connecting, make sure the database called 'sample' exists.
- With the database stopped edit the file ./sqllib/db2nodes.cfg.
This is how you define an additional node. The first number is the node number, the next entry is the TCP/IP node name, and the third entry is the port that the node will use to communicate. This is an offset to the port number in /etc/services for the server instance. Don"t worry, in the example file shown I"m not trying to use the real port 0 and the real port 1 for DB2.
- Start up your multiple logical nodes.
Congratulations! You"ve done it. You are now simulating two different computers running in a shared nothing cluster. As the database comes up it will create a copy of the data for the second node.
- Create a table that is partitioned across the nodes. Notice the new syntax:
PARTITIONING KEY(foo) USING HASHING;
This is how you specify which column is used to partition the data. Remember the values in the column foo should have high cardinality (a fancy way of saying not a lot of duplicate values).
- Insert some data into the table.
If you"d rather use another method for inserting rows, please feel free. This program inserts an incrementing integer into the column foo.
- Look at your partition distribution.
The query shown in the step shows you how the rows are distributed among the 4096 hash buckets. The hash bucket grows as more key values hash to the particular bucket. This means that some hash buckets can have more values than others. This is inherent in any hash algorithm and doesn"t have a negative impact unless there is dramatic hash skew. In the event of significant skew, you"ll probably want to look at the cardinality of your partitioning key and pick something else that has more distinct values.
select partition ( foo) as Partition_Hash, count(*) as Hash_Count from marty group by partition ( foo) order by count(*) desc fetch first 10 rows only ; PARTITION_HASH HASH_COUNT -------------- ----------- 2626 17 3464 15 307 13 329 13 746 13 1913 13 107 12 492 12 847 12 1021 12 10 record(s) selected.
- (The last step.)
Congratulations! Enjoy the libation of your choice! If you got here with no errors I"m thrilled. If not, I hope the section that follows minimizes your pain.
Just in case you miss a step, here is a reference for various error messages, and the steps required to fix them.
SQL6031N Error in the db2nodes.cfg file at line number "2". Reason code "12".
Diagnosis: To look up the text of this error use the db2 command environment
$ db2 ? SQL031N
Reason code '12' states: (12) Ensure that you only use port values that are specified in the services file (/etc/services file on UNIX-based systems) for your instance. Solution: To fix the problem update the /etc/services per Step 6 described above.
This message can occur during start and stop processing as described in Step 10. The source of the problem is back in Step 2. The other possible problem that can cause this message is pdksh problem which is discussed below.
[db2inst1@poohbah db2inst1]$ db2start 12-31-2001 09:29:32 0 0 SQL6048N A communication error occurred during START or STOP DATABASE MANAGER processing. 12-31-2001 09:29:32 1 0 SQL6048N A communication error occurred during START or STOP DATABASE MANAGER processing. SQL1032N No start database manager command was issued. SQLSTATE=57019 SQL6048N A communication error occurred during START or STOP DATABASE MANAGER processing.
The nodes can"t talk to each other. The nodes must communicate in order transmit the SQL queries and to receive back the result sets.
The full text of the error message appears below:
Explanation: A TCP/IP communication error occurred while the START or STOP DATABASE MANAGER command was trying to establish connection with all the nodes defined in the sqllib/db2nodes.cfg file.
User Response: Do the following:: Ensure that the node has the proper authorization defined in the .rhosts or the host.equiv files.
If using multiple logical nodes, ensure the DB2NODE environment variable is set correctly. The DB2NODE environment variable indicates the node the application will attempt to connect to. DB2NODE must be set to the node number of one of the nodes defined on the same host as the application.
Add a .rhosts file in the db2inst home directory and verify rsh is running by entering the following command:
$ rsh localhost echo hi
The text "hi" will appear on stdout. You can further verify the correct function with
$ db2_all echo hi
You should get the string "hi" on stdout, and the successful operation of db2_all indicates that you have also properly installed pdksh. If db2_all fails, read the next error messageÂ
We"ve seen this error message as possibly due to a problem with the rsh setup in Step 2. This error can also occur if pdksh has not been installed as described in Step 4.
[db2inst1@poohbah db2inst1]$ db2start
12-31-2001 10:21:42 0 0 SQL6048N A communication error occurred during START or STOP DATABASE MANAGER processing.
12-31-2001 10:21:42 1 0 SQL6048N A communication error occurred during START or STOP DATABASE MANAGER processing.
SQL1032N No start database manager command was issued. SQLSTATE=57019
These errors are due to missing /bin/ksh.
Install public domain Korn shell per below:
rpm -i /mnt/cdrom/RedHat/RPMS/pdksh-5.2.14-13.i386.rpm
The database starts up, but when connecting an error occurs. The errors differ between the Red Hat 7.1 and Red Hat 7.2 platforms.
Kernel parameter error on Red Hat Linux 7.2:
[db2inst1@poohbah db2inst1]$ db2start 01-03-2002 10:41:45 1 0 SQL1063N DB2START processing was successful. 01-03-2002 10:41:46 0 0 SQL1063N DB2START processing was successful. SQL1063N DB2START processing was successful. [db2inst1@poohbah db2inst1]$ db2 connect to sample SQL0902C A system error (reason code = "") occurred. Subsequent SQL statements cannot be processed. SQLSTATE=58005 [db2inst1@poohbah db2inst1]$
Kernel parameter error on Red Hat Linux 7.1:
[db2inst1@mikado db2inst1]$ db2start 01-03-2002 11:19:14 1 0 SQL1063N DB2START processing was successful. 01-03-2002 11:19:16 0 0 SQL1063N DB2START processing was successful. SQL1063N DB2START processing was successful. [db2inst1@mikado db2inst1]$ db2 connect to sample SQL1084C Shared memory segments cannot be allocated. SQLSTATE=57019 [db2inst1@mikado db2inst1]$
The Linux kernel parameters are set incorrectly.
This is fixed by changing the kernel parameters as described in Step 3.
You can automate this change for reboot by updating the /etc/sysctl.conf file.
Verify successful operation as follows:
[db2inst1@poohbah db2inst1]$ db2 connect to sample Database Connection Information Database server = DB2/LINUX 7.2.0 SQL authorization ID = DB2INST1 Local database alias = SAMPLE
On both Linux and on Windows 2000 you may get the following errors:
C:\>db2 connect to sample C:\>db2 connect to sample SQL30081N A communication error has been detected. Communication protocol being used: "TCP/IP". Communication API being used: "SOCKETS". Location where the error was detected: "". Communication function detecting the error: "connect". Protocol specific error code(s): "10060", "*", "*". SQLSTATE=08001 C:\>db2 connect to grommit SQL1013N The database alias name or database name "GROMMIT" could not be found. SQLSTATE=42705
The database you are trying to connect to doesn"t exist.
For both messages, use the following commands to list all the names of the databases in the instance:
db2 attach to yourinstancenamehere db2 list database directory
Connect to one that exists, or create a new one as follows:
db2 create database sample
Be sure you are connected to the correct instance. If you have any doubts, enter the following command to see all the instance names:
I hope this has sparked your interest in MPP databases. It is an exciting area of database computing, and the Linux environment makes it very accessible to everyone.
If you have more than one Linux machine and want to build a cluster please see Infiniband opens the throttle, which describes the use of the Infiniband interconnect technology, but you can run your cluster over Ethernet to get started.
I"m very interested in your feedback and hope to write another article soon on some of the key issues of MPP database processing.
|Schema and program Java file||InsertRows.java||4KB|
- [Dewitt 1986] David J. DeWitt, Robert H. Gerber, Goetz Graefe, Michael L. Heytens, Krishna B. Kumar, M. Muralikrishna: GAMMA - A High Performance Dataflow Database Machine. VLDB 1986: 228-237 at http://dblp.uni-trier.de/db/conf/vldb/DeWittGGHKM86.html.