Supervised learning techniques use a labeled training dataset to understand the relationships between inputs and output data. Data scientists manually create ground truth training datasets containing input data along with the corresponding labels. Supervised learning trains the model to apply the correct outputs to unseen data in real-world use cases.
During training, the model’s algorithm processes large datasets to explore potential correlations between inputs and outputs. Then, model performance is evaluated with test data to find out whether it was trained successfully. Cross-validation is the process of testing a model using a different portion of the dataset.
The gradient descent family of algorithms, including stochastic gradient descent (SGD), are the most commonly used optimization algorithms, or learning algorithms, when training neural networks and other machine learning models. The model’s optimization algorithm assesses accuracy through the loss function: an equation that measures the discrepancy between the model’s predictions and actual values.
The loss function measures how far off predictions are from actual values. Its gradient indicates the direction in which the model’s parameters should be adjusted to reduce error. Throughout training, the optimization algorithm updates the model’s parameters—its operating rules or “settings”—to optimize the model.
Because large datasets typically contain many features, data scientists can simplify this complexity through dimensionality reduction. This data science technique reduces the number of features to those most crucial for predicting data labels, which preserves accuracy while increasing efficiency.