Lesson 6: Supervised Learning Models
Supervised learning is the most widely-used machine learning paradigm. In this lesson, you'll master key algorithms including linear regression, logistic regression, decision trees, and random forestsβeach suited to different problem types.
Linear Regression
Linear regression is the simplest supervised learning algorithm. It fits a straight line through data to predict continuous values. The model assumes a linear relationship between input features and output.
When to use: Predicting continuous values like house prices, temperature, or sales. Best when relationship is roughly linear.
// Linear Regression with ML.NET
var mlContext = new MLContext();
var data = mlContext.Data.LoadFromTextFile("prices.csv", hasHeader: true);
var splitData = mlContext.Data.TrainTestSplit(data, testFraction: 0.2);
var pipeline = mlContext.Transforms.NormalizeMinMax("Features")
.Append(mlContext.Regression.Trainers.Sdca());
var model = pipeline.Fit(splitData.TrainSet);
var predictions = model.Transform(splitData.TestSet);
var metrics = mlContext.Regression.Evaluate(predictions);
Console.WriteLine($"RΒ²: {metrics.RSquared:F4}"); // Higher is better
Console.WriteLine($"RMSE: {metrics.RootMeanSquaredError:F4}"); // Lower is better
Logistic Regression
Logistic regression extends linear regression for binary classification. Instead of predicting a value, it predicts the probability of a class (0 or 1). The output is transformed by a sigmoid function to produce probabilities between 0 and 1.
When to use: Binary classification like spam detection, disease diagnosis, or churn prediction.
// Binary Classification with Logistic Regression
var pipeline = mlContext.Transforms.NormalizeMinMax("Features")
.Append(mlContext.BinaryClassification.Trainers
.SdcaLogisticRegression("Label", "Features"));
var model = pipeline.Fit(splitData.TrainSet);
var predictions = model.Transform(splitData.TestSet);
var metrics = mlContext.BinaryClassification.Evaluate(predictions);
Console.WriteLine($"Accuracy: {metrics.Accuracy:F4}");
Console.WriteLine($"AUC: {metrics.AreaUnderRocCurve:F4}"); // 1.0 = Perfect
Decision Trees
Decision trees learn by recursively splitting data on features that best separate classes. They create a tree-like structure of yes/no questions to make predictions. Easy to understand and visualize.
Advantages: Interpretable, handles non-linear relationships, requires little preprocessing
Disadvantages: Prone to overfitting, can be unstable
// Decision Tree Classification
var pipeline = mlContext.Transforms.NormalizeMinMax("Features")
.Append(mlContext.MulticlassClassification.Trainers
.LightGbm("Label", "Features", numberOfLeaves: 4, minimumExampleCountPerLeaf: 2));
var model = pipeline.Fit(splitData.TrainSet);
Random Forests
Random forests combine predictions from many decision trees. Each tree is trained on a random subset of data and features. Final prediction is the average (regression) or majority vote (classification) from all trees.
Advantages: More stable than single trees, handles non-linear data, reduces overfitting, provides feature importance
Best for: Most general problems. Often the first choice for tabular data.
// Random Forest with ML.NET (via LightGBM)
var pipeline = mlContext.Transforms.NormalizeMinMax("Features")
.Append(mlContext.BinaryClassification.Trainers.FastForest(
labelColumnName: "Label",
featureColumnName: "Features",
numberOfTrees: 100,
numberOfLeaves: 20));
var model = pipeline.Fit(splitData.TrainSet);
Comparing Supervised Algorithms
Linear Regression
Speed: Very Fast | Accuracy: Medium | Interpretability: Excellent | Non-linearity: No
Logistic Regression
Speed: Very Fast | Accuracy: Medium | Interpretability: Good | Probability Scores: Yes
Decision Trees
Speed: Fast | Accuracy: Medium-High | Interpretability: Excellent | Overfitting Risk: High
Random Forests
Speed: Medium | Accuracy: High | Interpretability: Medium | Scalability: Good
Evaluating Classification Models
Different metrics evaluate different aspects of classification performance:
- Accuracy: Overall correct predictions. Good for balanced datasets.
- Precision: Of predicted positives, how many are actually correct? Important when false positives are costly.
- Recall (Sensitivity): Of actual positives, how many did we find? Important when false negatives are costly.
- F1-Score: Harmonic mean of precision and recall. Good for imbalanced datasets.
- AUC-ROC: Area under the ROC curve. Measures performance across all thresholds (1.0 = perfect).
// Evaluating classification performance
var metrics = mlContext.BinaryClassification.Evaluate(predictions);
Console.WriteLine($"Accuracy: {metrics.Accuracy:F4}");
Console.WriteLine($"Precision: {metrics.PositivePrecision:F4}");
Console.WriteLine($"Recall: {metrics.PositiveRecall:F4}");
Console.WriteLine($"F1: {(2 * metrics.PositivePrecision * metrics.PositiveRecall) / (metrics.PositivePrecision + metrics.PositiveRecall):F4}");
Console.WriteLine($"AUC: {metrics.AreaUnderRocCurve:F4}");
π§ Quick Check β Lesson 6
Which algorithm is best suited for predicting a house price given square footage and location?
π§ Quick Check β Lesson 6
What is the main advantage of Random Forests over a single Decision Tree?
Lesson Summary
Linear Regression predicts continuous values assuming linear relationships. Fast and interpretable but limited to linear patterns.
Logistic Regression predicts class probabilities for binary classification via sigmoid transformation.
Decision Trees recursively split on features to create interpretable rules but prone to overfitting.
Random Forests combine multiple trees for better stability and accuracy. Often the best choice for tabular data.
Use appropriate evaluation metrics: accuracy, precision, recall, F1-score, and AUC-ROC depending on your problem.