Machine learning works through mathematical logic. The relevant characteristics (or "features") of each data point must therefore be expressed numerically, so that the data itself can be fed into a mathematical algorithm that will "learn" to map a given input to the desired output.
Data points in machine learning are usually represented in vector form, in which each element (or dimension) of a data point’s vector embedding corresponds to its numerical value for a specific feature. For data modalities that are inherently numerical, such as financial data or geospatial coordinates, this is relatively straightforward. But many data modalities, such as text, images, social media graph data or app user behaviors, are not inherently numerical, and therefore entail less immediately intuitive feature engineering to be expressed in an ML-ready way.
The (often manual) process of choosing which aspects of data to use in machine learning algorithms is called feature selection. Feature extraction techniques refine data down to only its most relevant, meaningful dimensions. Both are subsets of feature engineering, the broader discipline of preprocessing raw data for use in machine learning. One notable distinction of deep learning is that it typically operates on raw data and automates much of the feature engineering—or at least the feature extraction—process. This makes deep learning more scalable, albeit less interpretable, than traditional machine learning.
Machine learning model parameters and optimization
For a practical example, consider a simple linear regression algorithm for predicting home sale prices based on a weighted combination of three variables: square footage, age of house and number of bedrooms. Each house is represented as a vector embedding with 3 dimensions: [square footage, bedrooms, age]
. A 30-year-old house with 4 bedrooms and 1900 square feet could be represented as [1900, 4, 30]
(though for mathematical purposes those numbers might first be scaled, or normalized, to a more uniform range).
The algorithm is a straightforward mathematical function:
Price = (A * square footage) + (B * number of rooms) – (C * Age) + Base Price
Here, , and are the model parameters: adjusting them will adjust how heavily the model weighs each variable. The goal of machine learning is to find the optimal values for such model parameters: in other words, the parameter values that result in the overall function outputting the most accurate results. While most real-world instances of machine learning involve more complex algorithms with a greater number of input variables, the principle remains the same: optimizing the algorithm's adjustable parameters to yield greater accuracy.