There are two basic types of hash file that you might
use in these circumstances: static (hash) and dynamic.
- Static Files. These are the most performant
if well designed. If poorly designed, however, they are likely to
offer the worst performance. Static files allow you to decide the
way in which the file is hashed. You specify:
- Hashing algorithm. The way data rows are
allocated to different groups depending on the value of their key
field or fields.
- Modulus. The number of groups the file has.
- Separation. The size of the group as the
number of 512-byte blocks.
Generally speaking, you
should use a static file if you have good knowledge of the size and
shape of the data you will be storing in the hashed file. You can
restructure a static hashed file between job runs if you want to tune
it. Do this using the RESIZE command, which can be issued using the
Command feature of the Administrator client. The command for resizing
a static file is:
RESIZE filename [type] [modulus] [separation]
Where:
filename is
the name of the file you are resizing
type specifies
the hashing algorithm to use (see Hash File Design)
modulus specifies
the number of groups in the range 1 through 8,388,608.
separation specifies
the size of the groups in 512 byte blocks and is in the range 1 through
8,388,608.
- Dynamic Files. These are hash files which
change dynamically as data is written to them over time. This might
sound ideal, but if you leave a dynamic file to grow organically it
will need to perform several group split operations as data is written
to it, which can be very time consuming and can impair performance
where you have a fast growing file. Dynamic files do not perform as
well as a well-designed static file, but do perform better than a
badly designed one. When creating a dynamic file you can specify the
following information (although all of these have default values):
- Minimum modulus. The minimum number of groups
the file has. The default is 1.
- Group size. The group can be specified as
1 (2048 bytes) or 2 (4096 bytes). The default is 1.
- Split load. This specifies how much (as a
percentage) a file can be loaded before it is split. The file load
is calculated as follows:
File Load = ((total data bytes) / (total file bytes)) * 100
The split load defaults to 80.
- Merge load. This specifies how small (as
a percentage) a file load can be before the file is split. File load
is calculated as for Split load. The default is 50.
- Large record. Specifies the number of bytes
a record (row) can contain. A large record is always placed in an
overflow group.
- Hash algorithm. Choose between GENERAL for
most key field types and SEQ.NUM for keys that are a sequential number
series.
- Record size. Optionally use this to specify
an average record size in bytes. This can then be used to calculate
group size and large record size.
You can manually
resize a dynamic file using the RESIZE command issued using the Command
feature of the Administrator client. The command for
resizing a dynamic file is:
RESIZE filename [parameter [value]]
where:
filename is
the name of the file you are resizing.
Parameter is
one of the following and corresponds to the arguments described above
for creating a dynamic file:
GENERAL | SEQ.NUM
MINIMUM.MODULUS n
SPLIT.LOAD n
MERGE.LOAD n
LARGE.RECORD n
RECORD.SIZE n
By default InfoSphere® DataStage® will
create you a dynamic file with the default settings described above.
You can, however, use the Create File options on the
Hashed File stage Inputs page to specify the
type of file and its settings.
This offers a choice of several types of hash (static)
files, and a dynamic file type. The different types of static files
reflect the different hashing algorithms they use. Choose a type according
to the type of your key, as shown below:
- Type
- Suitable for keys that are formed like this:
- 2
- Numeric - significant in last 8 chars
- 3
- Mostly numeric with delimiters significant in last
8 chars
- 4
- Alphabetic significant in last 5 chars
- 5
- Any ASCII significant in last 4 chars
- 6
- Numeric significant in first 8 chars
- 7
- Mostly numeric with delimiters significant in first
8 chars
- 8
- Alphabetic significant in first 5 chars
- 9
- Any ASCII significant in first 4 chars
- 10
- Numeric significant in last 20 chars
- 11
- Mostly numeric with delimiters significant in last
20 chars
- 12
- Alphabetic significant in last 16 chars
- 13
- Any ASCII significant in last 16 chars
- 14
- Numeric whole key is significant
- 15
- Mostly numeric with delimiters whole key is significant
- 16
- Alphabetic whole key is significant
- 17
- Any ASCII whole key is significant
- 18
- Any chars whole key is significant