Regression analysis of construction data with IBM SPSS Modeler

Civil construction support system

This is the second article of a sequence that presents methodologies of how IBM SPSS Modeler can meet different needs in industry segments. In this article, SPSS Modeler is employed as a support system for the analysis of concrete that is used in civil construction, identifying the sturdiness of the concrete (compression resistance), which is a critical element in civil construction works. This article was first published by developerWorks Brazil and is an example of developerWorks worldwide offerings.

Share:

Alex Torquato Souza Carneiro, Software Engineer, IBM Brazil

Photo of  Alex Torquato Souza CarneiroAlex graduated and with a master's degree in Telecomputing Engineering by the Federal University of Ceará (Brazil). He has participated in several research and development projects in the machine learning, data mining, computational view, and digital signal and image processing segments since 2005. He is currently a software engineer at IBM Software Group Brazil, working with the IBM SPSS products. Alex also teaches Computer Sciences at Universidade Ibirapuera (São Paulo/Brazil).



29 November 2012

Introduction

Civil construction is one of the most influential engineering segments in society, since it encompasses basic social needs such as urban infrastructure and housing (roads, overpasses, bridges, and dams). The most important material that is used in civil construction works is concrete, which is extensively studied by several researchers in the area, and presents highly non-linear properties, such as the ratio between sturdiness and the elements encompassed by the concrete mixture 1. This article addresses the application of the SPSS Modeler software to estimate the sturdiness of concrete based on lab results data.

Dr. I-Cheng Yeh (Department of Civil Engineering, Chung-Hua University) created the set of data that is used in this article, and developed two procedures that analyze the influence of elements that are used in the concrete's composition over the resulting sturdiness and difficulties in the assessment of concrete sturdiness, considering the complexity of relations between its different composing elements 2, 3. These works compare actual concrete sturdiness results, obtained in laboratory, with results obtained through mathematical models and the MLP neural network algorithm.

The IBM SPSS (Statistical Package for Social Sciences) software package was initially created to analyze data that is associated with society, such as public opinions and behaviors. However, its set of pattern recognition and statistical analysis algorithms allows it to be applied in any area or segment that require extracting relevant information from large quantities of data.

SPSS Modeler as a data regression system

Tests conducted in this article are based on the SPSS Modeler software, currently under version 14.1, to implement a data regression system capable of establishing the sturdiness of concrete according to mixture and age (in days) composition parameters.

The SPSS Modeler provides data reading and writing resources to users, and graphic analysis and statistical algorithms. This article presents the MLP neural network algorithm that can be applied in supervised classification (see Resources for a link to my previous article) or regression, with the purpose of approximating an unknown function f: Rn → R based on function points, mostly obtained through experimentation. Figure 1 presents a three-dimensional chart that represents the set of sample points of an unknown function f: R² → R.

Figure 1. Example of samples in function f: R² → R
Example of samples in function f: R² → R.

(View a larger version of Figure 1.)

Each coordinate point (x, y, z) in Figure 2 corresponds to a sample of function f: R² → R, whose input (domain element) is {x, y}, representing a point over the horizontal axis, and output (image element) of the function is {z}, representing a point in the vertical axis. The purpose of the regression is to model the unknown function that is based on collected samples to establish the function output for any input values.

The following section presents the neural network algorithm that is used in this article to implement regression.

Multilayer perceptron neural network algorithm (MLP)

Scientific literature presents several regression algorithms, several of which are implemented in the SPSS Modeler. The multilayer perceptron neural network algorithm (MLP) was selected in this article. Symon Haykin presents the MLP neural network algorithm that is based on the function principles of biological neural structures, as indicated in Figure 2. Neural computing researches attempt to organize mathematical models similarly to the structures and organization of neurons in biological brains to achieve similar processing abilities, in addition to the inherent capacities of biological brains, such as learning based on examples, trial and error, knowledge generalization, among many others4.

Figure 2. Biological neural structure presenting the structural elements of neurons (dendrites, axon) and synapse
biological neural structure presenting the structural elements of neurons (dendrites, axon) and synapse

Based on this analogy to biological neurons, the MLP neural network algorithm implements a neural network that is composed of layers of artificial neurons that are stimulated by input signals, which are transmitted through the network via synapses connecting neurons in different layers.

Figure 3 presents an artificial perceptron neuron model with N inputs {x1, x2, …, xN}, in which each input xi has an associated synapse wi, and an output y.

Figure 3. Artificial neuron model with inputs {x0, x1, …, xN}, weights {w0, w1, …, wN} and output y
artificial neuron model with inputs {x0, x1, …, xN}, weights {w0, w1, …, wN} and output y.

There is also an additional neuron parameter, named w0, known as bias that can be interpreted as a synapse associated to an input x0 = -1. The output of the neuron y is based on the product between input vector x = [x0, x1, x2, …, xN]T and vector w = [w0, w1, w2, …, wN]T composed of synapses, including the bias (w0),
equation

The neuron output is then obtained through the activation function of neuron y = tangent(x .w), in which a hyperbolic tangent function is usually adopted (sigmoid-nature function), defined by
equation2
for a generic value a; however, it is convenient to use other activation functions in certain scenarios.

The artificial neuron model is feedforward, that is, connections are directed from inputs {x0, x1, …, xN} to output y of the neuron. Figure 4 presents the layout of perceptron neurons in an MLP neural network, in which there are two neuron layers, one hidden and one output. Regarding the neural network that is presented in Figure 4, each neuron of the hidden layer is connected to each neuron in the output layer. Therefore, inputs of the output layer neurons correspond to the outputs of hidden layer neurons. The analyst that uses the artificial neural network algorithm must choose how many neurons to use in the hidden layer, considering the set of input data, since with a low number of neurons in the hidden layer, the neural network is not able to generalize each class's data. However, a high number of neurons in the hidden layer prompts the overfitting phenomenon, in which the neural network exclusively learns training data, and does not generalize learning for data classes.

Figure 4. Example of neural network with hidden layer
Example of neural network with hidden layer

The neural network training process is conducted based on a backpropagation algorithm, with the purpose of adjusting values that are associated to synapses to allow the neural network to map an input space and output space, in which the input vectors x are samples of the input space and each input vector is associated to an output z, which can be represented by a vector z = {z1, z2, …, zM}, based on a scalable value or a symbolic value. Specifically, for symbolic values, a neuron in the output layer corresponds to each of the possible symbols that are associated to the input vector.

During the neural network training process, a set of input data is initially determined, to which the associated output is known, and random values are attributed to each synapse in the neural network. The data is presented to the neural network and the supplied output is compared to the actual output, generating an error value e. The error value is then employed to adjust reverse neural network synapses, from output to inputs (backpropagated), based on the equation
equation3
where a is the neural network learning coefficient and xi is the input vector component x times 'i', for the neural network output layer nerve cells, and the following equation is applied for nerve cells in the hidden layer
equation4
where tangent(w(output) . y(output)) represents the derived of the activation function that is based on the argument (w(output) . y(output)) and w(output) and y(output) correspond to the vectors composed of synapses that connect neurons in the hidden layer to those in the output layer, and the outputs of neurons in the output layer, the book Neural networks: a comprehensive foundation, Haykin 1999 4, details how to reach each of the equations that are presented here.

The process of adjusting synapse values is repeated until an interruption criterion is established, for example, a fixed number of repetitions. Thus, in each repetition, the outputs that are provided by the neural network get closer to the actual output, since the synapse value correction equations minimize errors between the output that is provided by the neural network and the actual output, since values are adjusted.

For regression issues, an adopted methodology consists in using neurons with linear activation function and assigning an output neuron to map each of the output vector's components; in cases where the output is the only scalable value, the neural network is designed with a single neuron in the output layer.


Database presentation

The database that is used in this article is available to the general public at http://archive.ics.uci.edu/ml/, maintained by the Amherst Massachusetts University (http://rexa.info/) in collaboration with the National Science Foundation (US). The file Concrete_Data.xls was chosen to create and analyze statistical models with the IBM SPSS Modeler software. This file encompasses a spreadsheet of 9 columns and 1030 lines of data, as illustrated in Figure 5.

Figure 5. Concrete_Data.xls file that is used to create and test statistical models
Concrete_Data.xls file used to create and test statistical models

(View a larger version of Figure 5.)

As indicated in Figure 5, the file that is used in this article is organized in nine columns, in which each line represents data that is collected from a concrete mixture analyzed in a lab. The first seven columns correspond to data about concentration of elements in the mixture, in kg by m3 of concrete; the following column corresponds to the age of the concrete, in days; and the last column corresponds to the sturdiness of the concrete, which is measured in MPa (megapascal, pressure measurement unit). The data that is provided in the file is detailed as follows:

  • cement concentration (kg/m³);
  • gravel (kg/m³);
  • ashes (kg/m³);
  • water (kg/m³);
  • superplasticizer (kg/m³);
  • thick sand (kg/m³);
  • fine sand (kg/m³);
  • age (days);
  • concrete sturdiness (MPa).

All of these attributes are numerical variables whose values correspond to the measurement unit presented in parentheses; thus, the neural network that is used in this article is designed to solve a regression issue in which the input space comprises the first eight columns of the file (cement concentration to age) and the output space corresponds to the ninth column of the file (concrete sturdiness).

The following stage consists in feeding the SPSS Modeler software with data obtained to create and test a regression model that is based on the MLP neural network algorithm. The following section establishes how data is prepared and applied, and how statistical models are generated and analyzed by the SPSS Modeler.


Methodology

After the data is organized in a spreadsheet model, it can be imported and manipulated by the elements of the SPSS Modeler software. Figure 6 presents the stream file, SPSS Modeler file format, created to generate the results that are presented herein.

Figure 6. Stream file developed to analyze the data that is based on the MLP algorithm
Stream file developed to analyze the data used in this article based on the MLP algorithm

In the stream file shown in Figure 6, there are seven elements, named nodes; one for each of the following functions:

The element that contains the MLP neural network algorithms is tested based on the following settings:

  • Default settings for fixed parameters, see Figure 7
  • Number of neurons in the hidden layer: 2, 4, and 6 (empirical values)
Figure 7. MLP algorithm configuration screen with default specifications
MLP algorithm configuration screen with default specifications

The following section presents results obtained based on the proposed methodology; actual concrete resistance values are compared to values obtained by the regression after the neural network training process.


Results

Results were generated by using all 1030 records in the Concrete_Data.xls file.

Metrics that are applied to the results presented herein represent the linear correlation coefficient coefficient, which is based on the equation
equation5
and absolute average error error, which is based on the equation
equation6
for both generic data sets x1 and x2, in which the correlation coefficient measures the influence between data sets, and can also be interpreted as a coefficient of similarity between two data sets; and the absolute average error measures inconsistencies between two data sets.

Results that are obtained with the MLP neural network, altering the number of hidden layer neurons, are:

  • Results with two hidden neurons

    linear correlation 0.911 and absolute average error 5.37

  • Results with four hidden neurons

    linear correlation 0.934 and absolute average error 4.61

  • Results with six hidden neurons

    linear correlation 0.925 and absolute average error 4.87

In this case, both data sets x1 and x2 correspond to actual concrete resistance values in the tests file, and values estimated by the MLP neural network after the training process.

By analyzing results that are obtained by the neural network with two, four, and six hidden layer neurons, it is concluded that the best neural network configuration is four hidden neurons, which presents both a higher correlation and decreased error between actual data and estimated values, as indicated in Figure 8.

Figure 8. Neural network with optimal architecture obtained after the algorithm training process
Neural network with optimal architecture obtained after the algorithm training process

(View a larger version of Figure 8.)

As previously presented in the MLP section, each neural network input element has an associated synapse, which is represented by a numerical value that controls input; the higher the synapse value, the more relevant is the input for the result that is generated by the neural network. Figure 9 presents a relevance chart for each input field to obtain regression results, information that becomes available after the creation of the node corresponding to the statistical model generated by the SPSS Modeler.

Figure 9. Neural network input variables and their respective relevance to the regression process
Neural network input variables and their respective relevance to the regression process

(View a larger version of Figure 9.)

Lastly, Figure 10 presents a dispersion chart containing actual concrete resistance values in the horizontal axis and values that are estimated by the neural network with four hidden neurons in the vertical axis.

Figure 10. Results of the regression of data in the Concrete_Data.xls file
Results of the regression of data in the Concrete_Data.xls file

(View a larger version of Figure 10.)

In addition to the points, Figure 10 also presents the identity line, which represents the ideal regression result, in which each concrete sample would present the exact same resistance as estimated by the neural network. In this case, the high concentration of points next to the identity line is visually evident, which is confirmed by correlation (0.934) and error (4.61), evidencing the efficiency of the proposed methodology.


Conclusion

This article presents pattern recognition and machine learning concepts, and the application of SPSS Modeler software, for supporting civil construction activities, estimating the sturdiness of concrete based on its composition and age. The results present the correlation between actual concrete sturdiness values and values that were estimated by a 0.934 regression, with absolute average error of 4.61MPa, which ensure reliability to the regression system. Because of the field-application nature of the process, in which the analysis result must be available at the worksite, where the concrete is effectively used, it is important to consider the possibility of integrating the system to an interface compatible with mobile devices, thus allowing managers of construction site activities immediate access to such data whenever necessary.

_____________________________________________

1. Yeh, I.-C. http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Readme.txt (2007)
2. Yeh, I.-C. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete Research. Vol. 28, No. 12, 1998
3. Yeh, I.-C. Analysis of strength of concrete using design of experiments and neural networks. Journal of materials in civil engineering, 2006
4. Haykin, S. Neural networks: a comprehensive foundation. Publisher: Prentice Hall. Hamilton, Canada. 2nd edition, 1999
5. Marques, J. S. Reconhecimento de padrões: métodos estatísticos e neuronais. Publisher: IST Press, Lisbon, Portugal. 2nd edition, 2005

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=847724
ArticleTitle=Regression analysis of construction data with IBM SPSS Modeler
publish-date=11292012