Lesson 2: Fundamentals of Machine Learning
Every successful machine learning project follows a consistent pipeline. In this lesson, you'll learn how ML systems are built from start to finish, including data preparation, model training, evaluation, and deployment considerations.
The Machine Learning Pipeline
A machine learning project isn't just about algorithms—it's a systematic process with multiple stages. Understanding each stage ensures your models perform well in production.
Train-Test Split: A Critical Step
Never evaluate your model on the same data you trained it on. This creates data leakage and leads to overly optimistic performance estimates.
The golden rule: Split your data into at least two parts:
- Training Set (70-80%): Used to train the model
- Test Set (20-30%): Used only for evaluation—the model never sees this data during training
For large datasets, use three sets:
- Training Set (60%): Train the model
- Validation Set (20%): Tune hyperparameters and select the best model
- Test Set (20%): Final performance evaluation
// Train-Test Split with ML.NET
var mlContext = new MLContext();
var data = mlContext.Data.LoadFromTextFile<ModelInput>("data.csv", hasHeader: true);
// 80-20 split
var splitData = mlContext.Data.TrainTestSplit(data, testFraction: 0.2);
var trainingData = splitData.TrainSet;
var testData = splitData.TestSet;
// 60-20-20 split with validation
var trainTestData = mlContext.Data.TrainTestSplit(data, testFraction: 0.4);
var trainData = trainTestData.TrainSet;
var tempData = trainTestData.TestSet;
var valTestData = mlContext.Data.TrainTestSplit(tempData, testFraction: 0.5);
var validationData = valTestData.TrainSet;
var testData = valTestData.TestSet;
Understanding Overfitting and Underfitting
The balance between model complexity and generalization is critical:
Underfitting
Problem: Model is too simple to capture patterns. Poor performance on both training and test data.
Solution: Use more complex model, add features, or train longer.
Good Fit
Sweet spot: Model generalizes well. Good performance on both training and test data.
Goal: Achieve this balance in all projects.
Overfitting
Problem: Model memorizes training data. Great on training set, poor on test set.
Solution: Simplify model, add regularization, or collect more data.
The Bias-Variance Tradeoff
Bias and variance are two sources of error in machine learning models:
- Bias: Error from incorrect assumptions. High bias = underfitting. Simple models have high bias.
- Variance: Error from sensitivity to training data fluctuations. High variance = overfitting. Complex models have high variance.
The goal is to minimize total error by balancing bias and variance:
Key Metrics for Model Evaluation
Different problems require different metrics:
Regression
R² Score: How much variance is explained (0-1, higher is better)
RMSE: Average prediction error magnitude (lower is better)
MAE: Mean absolute error (lower is better)
Classification
Accuracy: % correct predictions
Precision: True positives / all positives
Recall: True positives / actual positives
// Evaluating regression model
var predictions = model.Transform(testData);
var metrics = mlContext.Regression.Evaluate(predictions);
Console.WriteLine($"R²: {metrics.RSquared:F4}");
Console.WriteLine($"RMSE: {metrics.RootMeanSquaredError:F4}");
Console.WriteLine($"MAE: {metrics.MeanAbsoluteError:F4}");
// Evaluating classification model
var predictions = model.Transform(testData);
var metrics = mlContext.BinaryClassification.Evaluate(predictions);
Console.WriteLine($"Accuracy: {metrics.Accuracy:F4}");
Console.WriteLine($"Precision: {metrics.PositivePrecision:F4}");
Console.WriteLine($"Recall: {metrics.PositiveRecall:F4}");
🧠 Quick Check — Lesson 2
Why is it important to separate training and test data?
🧠 Quick Check — Lesson 2
A model performs excellently on training data but poorly on test data. What is the problem?
Lesson Summary
The ML Pipeline has 6 stages: data collection, preprocessing, splitting, training, evaluation, and deployment.
Train-Test Split prevents data leakage. Always evaluate on data the model never saw during training.
Underfitting: Model too simple → poor performance. Overfitting: Model too complex → memorizes noise.
Bias-Variance Tradeoff: Balance between model simplicity and complexity. Goal is to minimize total error.
Choose evaluation metrics appropriate for your problem: R², RMSE, MAE for regression; Accuracy, Precision, Recall for classification.