Developing bulk loader programs
This topic explains how to develop a bulk loader program that can be run from IBM® SPSS® Modeler to load data from a text file into a database. Note that this is not covered under a standard Technical Support agreement, and you should contact an IBM Services representative for assistance.
Using Python to build bulk loader programs
You can use the script test_loader.py to assist in developing batch loader programs. See the topic Testing bulk loader programs for more information.
Objects passed to the bulk loader program
IBM SPSS Modeler writes two files that are passed to the bulk loader program.
- Data file. This contains the data to be loaded, in text format.
- Schema file. This is an XML file that describes the names and types of the columns, and provides information on how the data file is formatted (for example, what character is used as the delimiter between fields).
In addition, IBM SPSS Modeler passes other information such as table name, user name and password as arguments when invoking the bulk load program.
Note: To signal successful completion to IBM SPSS Modeler, the bulk loader program should delete the schema file.
Arguments passed to the bulk loader program
The arguments passed to the program are listed in the following table.
Argument | Description |
---|---|
schemafile
|
Path of the schema file. |
data file
|
Path of the data file. |
servername
|
Name of the DBMS server; may be blank. |
databasename
|
Name of the database within the DBMS server; may be blank. |
username
|
User name to log into database. |
password
|
Password to log into database. |
tablename
|
Name of the table to load. |
ownername
|
Name of the table owner (also known as schema name). |
logfilename
|
Name of the logfile (if blank, no log file is created). |
rowcount
|
Number of rows in the dataset. |
Any options specified in the Extra loader options field on the DB Export Advanced Options dialog box are passed to the bulk loader program after these standard arguments.
Format of the data file
Data is written to the data file in text format, with each field being separated by a delimiter character that is specified on the DB Export Advanced Options dialog box. Here is an example of how a tab-delimited data file might appear.
48 F HIGH NORMAL 0.692623 0.055369 drugA
15 M NORMAL HIGH 0.678247 0.040851 drugY
37 M HIGH NORMAL 0.538192 0.069780 drugA
35 F HIGH HIGH 0.635680 0.068481 drugA
The file is written in the local encoding used by IBM SPSS Modeler Server (or IBM SPSS Modeler if not connected to IBM SPSS Modeler Server). Some formatting is controlled via the IBM SPSS Modeler stream settings.
Format of the schema file
The schema file is an XML file which describes the data file. Here is an example to accompany the preceding data file.
<?xml version="1.0" encoding="UTF-8"?>
<DBSCHEMA version="1.0">
<table delimiter="\t" commit_every="10000" date_format="YYYY-MM-DD" time_format="HH:MM:SS"
append_existing="false" delete_datafile="false">
<column name="Age" encoded_name="416765" type="integer"/>
<column name="Sex" encoded_name="536578" type="char" size="1"/>
<column name="BP" encoded_name="4250" type="char" size="6"/>
<column name="Cholesterol" encoded_name="43686F6C65737465726F6C" type="char" size="6"/>
<column name="Na" encoded_name="4E61" type="real"/>
<column name="K" encoded_name="4B" type="real"/>
<column name="Drug" encoded_name="44727567" type="char" size="5"/>
</table>
</DBSCHEMA>
The following two tables list the attributes of the
<table>
and <column>
elements of the schema file.
Attribute | Description |
---|---|
delimiter
|
The field delimiter character (TAB is represented as \t). |
commit_every
|
The batch size interval (as on the DB Export Advanced Options dialog box). |
date_format
|
The format used for representing dates. |
time_format
|
The format used for representing times. |
append_existing
|
true if the table to be loaded already contains data; false
otherwise. |
delete_datafile
|
true if the bulk load program should delete the data file on completion of
loading. |
Attribute | Description |
---|---|
name
|
The column name. |
encoded_name
|
The column name converted to the same encoding as the data file and output as a series of two-digit hexadecimal numbers. |
type
|
The data type of the column: one of integer , real ,
char , time , date , and
datetime . |
size
|
For the char data type, the maximum width of the column in
characters. |