base10Cluster

Description

The base10Cluster package implements a numerical clustering algorithm that aims to cluster together numerically closers items in different buckets, each representing a distinct cluster. The base10Cluster algorithm uses two steps to process the input numerical data set:

Binning: After initial processing, each numerical value is assigned to a bin as determined by its base-10 logarithmic value.
Redistribution: After binning, the bins are redistributed to get a balanced clustering across different buckets.

The base10Cluster clustering has been designed to handle out-of-range, empty, and null values by assigning them to a special bucket, termed EMPTY.

The base10Cluster is a single-pass algorithm and can operate on multiple input data set by using multiple threads. For each input data set, the base10Cluster algorithm generates an output file listing the number of buckets and their corresponding minimum values. This information is used by the preprocessing code to assign a string token identifier to a numeric value.

Format

The base10Cluster component can be invoked using the zade.Base10Cluster instance function:

⋮
ZADE zade = new ZADE();
zade.Base10Cluster(args);
⋮

Parameters

int num_threads: The number of threads used for parallelization.
String[] file_names: The list of input file names in CSV format. When submitted via the command line, input file names must be separated by a semicolon (;).
String output_dir: The directory where the output file will be stored.

Output files

minimums: A file, stored in output_dir, that lists the cluster minimum values, sorted in increasing order.
log: A file, stored in the current working directory, that contains the execution log messages from the function.