# Regression analysis of construction data with IBM SPSS Modeler

Civil construction support system

Civil construction is one of the most influential engineering segments in society,
since it encompasses basic social needs such as urban infrastructure and housing
(roads, overpasses, bridges, and dams). The most important material that is used in
civil construction works is concrete, which is extensively studied by several
researchers in the area, and presents highly non-linear properties, such as the
ratio between sturdiness and the elements encompassed by the concrete mixture
^{1}. This article addresses the application of the SPSS Modeler software to
estimate the sturdiness of concrete based on lab results data.

Dr. I-Cheng Yeh (Department of Civil Engineering, Chung-Hua University) created the
set of data that is used in this article, and developed two procedures that analyze
the influence of elements that are used in the concrete's composition over the
resulting sturdiness and difficulties in the assessment of concrete sturdiness,
considering the complexity of relations between its different composing elements
^{2,}^{ 3}. These works compare actual concrete sturdiness results,
obtained in laboratory, with results obtained through mathematical models and the
MLP neural network algorithm.

The IBM SPSS (Statistical Package for Social Sciences) software package was initially created to analyze data that is associated with society, such as public opinions and behaviors. However, its set of pattern recognition and statistical analysis algorithms allows it to be applied in any area or segment that require extracting relevant information from large quantities of data.

### SPSS Modeler as a data regression system

Tests conducted in this article are based on the SPSS Modeler software, currently under version 14.1, to implement a data regression system capable of establishing the sturdiness of concrete according to mixture and age (in days) composition parameters.

The SPSS Modeler provides data reading and writing resources to users, and graphic
analysis and statistical algorithms. This article presents the MLP neural network
algorithm that can be applied in supervised classification (see Related topics for a link to my previous article) or regression, with the
purpose of approximating an unknown function *f: Rn → R* based on function
points, mostly obtained through experimentation. Figure 1
presents a three-dimensional chart that represents the set of sample points of an
unknown function *f: R² → R.*

##### Figure 1. Example of samples in function *f:
R² → R*

Each coordinate point *(x, y, z)* in Figure 2 corresponds to a sample of
function *f: R² → R*, whose input (domain element) is *{x, y}*,
representing a point over the horizontal axis, and output (image element) of the
function is *{z}*, representing a point in the vertical axis. The purpose of
the regression is to model the unknown function that is based on collected samples
to establish the function output for any input values.

The following section presents the neural network algorithm that is used in this article to implement regression.

## Multilayer perceptron neural network algorithm (MLP)

Scientific literature presents several regression algorithms, several of which are
implemented in the SPSS Modeler. The multilayer perceptron neural network algorithm
(MLP) was selected in this article. Symon Haykin presents the MLP neural network
algorithm that is based on the function principles of biological neural structures,
as indicated in Figure 2. Neural computing researches attempt
to organize mathematical models similarly to the structures and organization of
neurons in biological brains to achieve similar processing abilities, in addition to
the inherent capacities of biological brains, such as learning based on examples,
trial and error, knowledge generalization, among many others^{4}.

##### Figure 2. Biological neural structure presenting the structural elements of neurons (dendrites, axon) and synapse

Based on this analogy to biological neurons, the MLP neural network algorithm implements a neural network that is composed of layers of artificial neurons that are stimulated by input signals, which are transmitted through the network via synapses connecting neurons in different layers.

Figure 3 presents an artificial perceptron neuron model with
*N* inputs *{x _{1}, x_{2}, …, x_{N}}*, in
which each input

*x*has an associated synapse

_{i}*w*, and an output

_{i}*y.*

##### Figure 3. Artificial neuron model with inputs
*{x*_{0}, x_{1}, …, x_{N}}, weights
*{w*_{0}, w_{1}, …, w_{N}} and output
*y*

_{0}, x

_{1}, …, x

_{N}}

_{0}, w

_{1}, …, w

_{N}}

There is also an additional neuron parameter, named *w _{0}*, known as
bias that can be interpreted as a synapse associated to an input

*x*The output of the neuron

_{0}= -1.*y*is based on the product between input vector

*x = [x*and vector

_{0}, x_{1}, x_{2}, …, x_{N}]^{T}*w = [w*composed of synapses, including the bias

_{0}, w_{1}, w_{2}, …, w_{N}]^{T}*(w*,

_{0})The neuron output is then obtained through the activation function of neuron *y =
**(x .w)*, in which a hyperbolic tangent function is usually adopted
(sigmoid-nature function), defined by

for a generic value

*a*; however, it is convenient to use other activation functions in certain scenarios.

The artificial neuron model is *feedforward*, that is, connections are
directed from inputs *{x _{0}, x_{1}, …, x_{N}}* to
output

*y*of the neuron. Figure 4 presents the layout of perceptron neurons in an MLP neural network, in which there are two neuron layers, one hidden and one output. Regarding the neural network that is presented in Figure 4, each neuron of the hidden layer is connected to each neuron in the output layer. Therefore, inputs of the output layer neurons correspond to the outputs of hidden layer neurons. The analyst that uses the artificial neural network algorithm must choose how many neurons to use in the hidden layer, considering the set of input data, since with a low number of neurons in the hidden layer, the neural network is not able to generalize each class's data. However, a high number of neurons in the hidden layer prompts the overfitting phenomenon, in which the neural network exclusively learns training data, and does not generalize learning for data classes.

##### Figure 4. Example of neural network with hidden layer

The neural network training process is conducted based on a backpropagation
algorithm, with the purpose of adjusting values that are associated to synapses to
allow the neural network to map an input space and output space, in which the input
vectors *x* are samples of the input space and each input vector is
associated to an output *z*, which can be represented by a vector *z =
{z _{1}, z_{2}, …, z_{M}}*, based on a scalable value
or a symbolic value. Specifically, for symbolic values, a neuron in the output layer
corresponds to each of the possible symbols that are associated to the input
vector.

During the neural network training process, a set of input data is initially
determined, to which the associated output is known, and random values are
attributed to each synapse in the neural network. The data is presented to the
neural network and the supplied output is compared to the actual output, generating
an error value *e*. The error value is then employed to adjust reverse neural
network synapses, from output to inputs (backpropagated), based on the equation

where

*a*is the neural network learning coefficient and

*x*is the input vector component

_{i}*x*times 'i', for the neural network output layer nerve cells, and the following equation is applied for nerve cells in the hidden layer

where

*(w*represents the derived of the activation function that is based on the argument

^{(output)}. y^{(output)})*(w*and

^{(output)}. y^{(output)})*w*and

^{(output)}*y*correspond to the vectors composed of synapses that connect neurons in the hidden layer to those in the output layer, and the outputs of neurons in the output layer, the book

^{(output)}*Neural networks: a comprehensive foundation*, Haykin 1999

^{4}, details how to reach each of the equations that are presented here.

The process of adjusting synapse values is repeated until an interruption criterion is established, for example, a fixed number of repetitions. Thus, in each repetition, the outputs that are provided by the neural network get closer to the actual output, since the synapse value correction equations minimize errors between the output that is provided by the neural network and the actual output, since values are adjusted.

For regression issues, an adopted methodology consists in using neurons with linear activation function and assigning an output neuron to map each of the output vector's components; in cases where the output is the only scalable value, the neural network is designed with a single neuron in the output layer.

## Database presentation

The database that is used in this article is available to the general public at http://archive.ics.uci.edu/ml/,
maintained by the Amherst Massachusetts University (http://rexa.info/) in
collaboration with the National Science Foundation (US). The file
`Concrete_Data.xls`

was chosen to create and analyze statistical
models with the IBM SPSS Modeler software. This file encompasses a spreadsheet of 9
columns and 1030 lines of data, as illustrated in Figure 5.

##### Figure 5. `Concrete_Data.xls`

file that is used
to create and test statistical models

As indicated in Figure 5, the file that is used in this article
is organized in nine columns, in which each line represents data that is collected
from a concrete mixture analyzed in a lab. The first seven columns correspond to
data about concentration of elements in the mixture, in kg by m^{3} of
concrete; the following column corresponds to the age of the concrete, in days; and
the last column corresponds to the sturdiness of the concrete, which is measured in
MPa (megapascal, pressure measurement unit). The data that is provided in the file
is detailed as follows:

- cement concentration (kg/m³);
- gravel (kg/m³);
- ashes (kg/m³);
- water (kg/m³);
- superplasticizer (kg/m³);
- thick sand (kg/m³);
- fine sand (kg/m³);
- age (days);
- concrete sturdiness (MPa).

All of these attributes are numerical variables whose values correspond to the measurement unit presented in parentheses; thus, the neural network that is used in this article is designed to solve a regression issue in which the input space comprises the first eight columns of the file (cement concentration to age) and the output space corresponds to the ninth column of the file (concrete sturdiness).

The following stage consists in feeding the SPSS Modeler software with data obtained to create and test a regression model that is based on the MLP neural network algorithm. The following section establishes how data is prepared and applied, and how statistical models are generated and analyzed by the SPSS Modeler.

## Methodology

After the data is organized in a spreadsheet model, it can be imported and manipulated by the elements of the SPSS Modeler software. Figure 6 presents the stream file, SPSS Modeler file format, created to generate the results that are presented herein.

##### Figure 6. Stream file developed to analyze the data that is based on the MLP algorithm

In the stream file shown in Figure 6, there are seven elements, named nodes; one for each of the following functions:

- Data reading
- Data type adjustment
- Statistical model creation
- Statistical model generated
- Regression chart
- Results analysis
- Viewing the input data and the data obtained from the regression

The element that contains the MLP neural network algorithms is tested based on the following settings:

- Default settings for fixed parameters, see Figure 7
- Number of neurons in the hidden layer: 2, 4, and 6 (empirical values)

##### Figure 7. MLP algorithm configuration screen with default specifications

The following section presents results obtained based on the proposed methodology; actual concrete resistance values are compared to values obtained by the regression after the neural network training process.

## Results

Results were generated by using all 1030 records in the
`Concrete_Data.xls`

file.

Metrics that are applied to the results presented herein represent the linear
correlation coefficient , which is based on the equation

and absolute average error , which is based on the equation

for both generic data sets

*x1*and

*x2*, in which the correlation coefficient measures the influence between data sets, and can also be interpreted as a coefficient of similarity between two data sets; and the absolute average error measures inconsistencies between two data sets.

Results that are obtained with the MLP neural network, altering the number of hidden layer neurons, are:

- Results with two hidden neurons
linear correlation 0.911 and absolute average error 5.37

- Results with four hidden neurons
linear correlation 0.934 and absolute average error 4.61

- Results with six hidden neurons
linear correlation 0.925 and absolute average error 4.87

In this case, both data sets *x1* and *x2* correspond to actual
concrete resistance values in the tests file, and values estimated by the MLP neural
network after the training process.

By analyzing results that are obtained by the neural network with two, four, and six hidden layer neurons, it is concluded that the best neural network configuration is four hidden neurons, which presents both a higher correlation and decreased error between actual data and estimated values, as indicated in Figure 8.

##### Figure 8. Neural network with optimal architecture obtained after the algorithm training process

As previously presented in the MLP section, each neural network input element has an associated synapse, which is represented by a numerical value that controls input; the higher the synapse value, the more relevant is the input for the result that is generated by the neural network. Figure 9 presents a relevance chart for each input field to obtain regression results, information that becomes available after the creation of the node corresponding to the statistical model generated by the SPSS Modeler.

##### Figure 9. Neural network input variables and their respective relevance to the regression process

Lastly, Figure 10 presents a dispersion chart containing actual concrete resistance values in the horizontal axis and values that are estimated by the neural network with four hidden neurons in the vertical axis.

##### Figure 10. Results of the regression of data in the
*Concrete_Data.xls* file

In addition to the points, Figure 10 also presents the identity line, which represents the ideal regression result, in which each concrete sample would present the exact same resistance as estimated by the neural network. In this case, the high concentration of points next to the identity line is visually evident, which is confirmed by correlation (0.934) and error (4.61), evidencing the efficiency of the proposed methodology.

## Conclusion

This article presents pattern recognition and machine learning concepts, and the application of SPSS Modeler software, for supporting civil construction activities, estimating the sturdiness of concrete based on its composition and age. The results present the correlation between actual concrete sturdiness values and values that were estimated by a 0.934 regression, with absolute average error of 4.61MPa, which ensure reliability to the regression system. Because of the field-application nature of the process, in which the analysis result must be available at the worksite, where the concrete is effectively used, it is important to consider the possibility of integrating the system to an interface compatible with mobile devices, thus allowing managers of construction site activities immediate access to such data whenever necessary.

_____________________________________________

^{1. Yeh, I.-C. http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Readme.txt (2007)}^{2. Yeh, I.-C. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete Research. Vol. 28, No. 12, 1998}^{3. Yeh, I.-C. Analysis of strength of concrete using design of experiments and neural networks. Journal of materials in civil engineering, 2006}^{4. Haykin, S. Neural networks: a comprehensive foundation. Publisher: Prentice Hall. Hamilton, Canada. 2nd edition, 1999}^{5. Marques, J. S. Reconhecimento de padrões: métodos estatísticos e neuronais. Publisher: IST Press, Lisbon, Portugal. 2nd edition, 2005}

#### Downloadable resources

#### Related topics

- Statistical analysis of medical data with IBM SPSS Modeler (Carneiro, Bianchi, developerWorks, Oct 2012): Read this article by the author and Marcelo Franceschi de Bianchi, which includes how the MLP neural network algorithm can be applied in supervised classification.
- Pattern recognition: statistical and neural methods by J.S. Marques, IST Press, 2nd edition, 2005.
- Neural Networks: A Comprehensive Foundation (2nd Edition), Simon Haykin: This book is one of the most comprehensive treatments of neural networks from an engineering perspective.
- The nature of statistical learning theory by Vladimir N. Vapnik: Learn more about the fundamental ideas that lie behind the statistical theory of learning and generalization from this book.
- developerWorks on Twitter: Join today to follow developerWorks tweets.
- IBM SPSS Modeler: Build predictive models quickly and intuitively without programming with this powerful data mining workbench.
- IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.