Developing bulk loader programs

This topic explains how to develop a bulk loader program that can be run from IBM® SPSS® Modeler to load data from a text file into a database. Note that this is not covered under a standard Technical Support agreement, and you should contact an IBM Services representative for assistance.

Using Python to build bulk loader programs

By default, IBM SPSS Modeler searches for a default bulk loader program based on the database type. See Table 1.

You can use the script test_loader.py to assist in developing batch loader programs. See the topic Testing bulk loader programs for more information.

Objects passed to the bulk loader program

IBM SPSS Modeler writes two files that are passed to the bulk loader program.

  • Data file. This contains the data to be loaded, in text format.
  • Schema file. This is an XML file that describes the names and types of the columns, and provides information on how the data file is formatted (for example, what character is used as the delimiter between fields).

In addition, IBM SPSS Modeler passes other information such as table name, user name and password as arguments when invoking the bulk load program.

Note: To signal successful completion to IBM SPSS Modeler, the bulk loader program should delete the schema file.

Arguments passed to the bulk loader program

The arguments passed to the program are listed in the following table.

Table 1. Arguments passed to bulk loader
Argument Description
schemafile Path of the schema file.
data file Path of the data file.
servername Name of the DBMS server; may be blank.
databasename Name of the database within the DBMS server; may be blank.
username User name to log into database.
password Password to log into database.
tablename Name of the table to load.
ownername Name of the table owner (also known as schema name).
logfilename Name of the logfile (if blank, no log file is created).
rowcount Number of rows in the dataset.

Any options specified in the Extra loader options field on the DB Export Advanced Options dialog box are passed to the bulk loader program after these standard arguments.

Format of the data file

Data is written to the data file in text format, with each field being separated by a delimiter character that is specified on the DB Export Advanced Options dialog box. Here is an example of how a tab-delimited data file might appear.

48  F  HIGH    NORMAL  0.692623  0.055369  drugA
15  M  NORMAL  HIGH    0.678247  0.040851  drugY
37  M  HIGH    NORMAL  0.538192  0.069780  drugA
35  F  HIGH    HIGH    0.635680  0.068481  drugA

The file is written in the local encoding used by IBM SPSS Modeler Server (or IBM SPSS Modeler if not connected to IBM SPSS Modeler Server). Some formatting is controlled via the IBM SPSS Modeler stream settings.

Format of the schema file

The schema file is an XML file which describes the data file. Here is an example to accompany the preceding data file.

<?xml version="1.0" encoding="UTF-8"?>
<DBSCHEMA version="1.0">
	<table delimiter="\t" commit_every="10000" date_format="YYYY-MM-DD" time_format="HH:MM:SS" 
append_existing="false" delete_datafile="false">
		<column name="Age" encoded_name="416765" type="integer"/>
		<column name="Sex" encoded_name="536578" type="char" size="1"/>
		<column name="BP" encoded_name="4250" type="char" size="6"/>
		<column name="Cholesterol" encoded_name="43686F6C65737465726F6C" type="char" size="6"/>
		<column name="Na" encoded_name="4E61" type="real"/>
		<column name="K" encoded_name="4B" type="real"/>
		<column name="Drug" encoded_name="44727567" type="char" size="5"/>
	</table>
</DBSCHEMA>

The following two tables list the attributes of the <table> and <column> elements of the schema file.

Table 2. Attributes of the <table> element
Attribute Description
delimiter The field delimiter character (TAB is represented as \t).
commit_every The batch size interval (as on the DB Export Advanced Options dialog box).
date_format The format used for representing dates.
time_format The format used for representing times.
append_existing true if the table to be loaded already contains data; false otherwise.
delete_datafile true if the bulk load program should delete the data file on completion of loading.
Table 3. Attributes of the <column> element
Attribute Description
name The column name.
encoded_name The column name converted to the same encoding as the data file and output as a series of two-digit hexadecimal numbers.
type The data type of the column: one of integer, real, char, time, date, and datetime.
size For the char data type, the maximum width of the column in characters.