Machine learning follows a process of preparing data, training an algorithm and generating a machine learning model, and then making and refining predictions.
Preparing the data
Machine learning requires data that is analyzed, formatted and conditioned to build a machine learning model. Judith Hurwitz and Daniel Kirsch, authors of Machine Learning For Dummies, advise that “machine learning requires the right set of data that can be applied to a learning process.” Data preparation typically involves these tasks:
- Select a sample subset of data. Make and track assumptions about the data to select attributes germane to the problem you want the algorithm to train for or solve. For example, filter or focus on types of product or customer data and eliminate data about where a product was manufactured.
- Merge or join data sets to aggregate records. Merging simplifies the data and makes it easier to manage. For example, if there is a customer data set and a customer purchases data set, they could be condensed into a new, simpler, attribute for spending for the product.
- Format and sort the data for modeling. Choose the format: flat file or relational database for example. Certain algorithms may require data to be sorted in a specific way. For example, fields for customers may be grouped by where the customer purchased or where they live. These textual, location fields may need to be given numbers and sorted numerically.
- Clean the data by removing or replacing any blank or missing values. There are statistical analysis tools that can help inspect the data for errors and deviations. The goal is to ensure that data is exact, complete and relevant.
- Normalize the data or adjust values that are measured on different scales to a common scale. For example, one data set may score numerically and another by a percentage. To compare the data, the values must be normalized to a common scale.
Training the algorithm
Machine learning uses the prepared data to train a machine learning algorithm. An algorithm is a computerized procedure or recipe. When the algorithm is trained on the data, a machine learning model is generated. Selecting the right algorithm is essential to applying machine learning successfully. Selection is largely influenced by the application and the data available. But there are some commonly used algorithms and applications:
- Regression algorithms
Linear and logistic regression are examples of regression algorithms used to understand relationships in data. Linear regression is used to predict the value of a dependent variable based on the value of an independent variable. Logistic regression can be used when the dependent variable is binary in nature, A or B. With linear regression, for example, a salesperson’s annual sales (the dependent variable) can be determined by its relationship to independent variables such as education or years of experience (the independent variables.)
- Decision trees
Decision trees use classification to make recommendations based on a set of decision rules. For example, betting on a horse to win, place or show could use data about the horse (age, winning percentage, pedigree) and the decision tree would apply rules to those factors to recommend an action or decision.
- Instance-based algorithms
A good example of an instance-based algorithm is K-Nearest Neighbor or k-nn. It uses classification to estimate how likely a data point is to be a member of one group or another based on its proximity to other data points.
- Clustering algorithms
Think of clusters as groups. Clustering focuses on identifying groups of similar records and labeling the records according to the group to which they belong. This is done without prior knowledge about the groups and their characteristics. Types of clustering algorithms include the K-means, TwoStep and Kohonen clustering.
Predicting and refining
Once the data is prepared and the algorithm trained, the machine learning model can make determinations or predictions about the data — on its own. For example:
Consider a data set that has two basic values for cars: weight and speed. Values can be plotted on a graph that shows light cars tend to be fast and heavy cars tend to be slow.
When the machine learning model is provided with data about cars, it uses the algorithm to determine or predict whether a car will tend to be fast or slow, or light or heavy. It does this without explicit human intervention. And the more data provided, the more the model learns and improves the accuracy of its predictions.