Lesson 2: Fundamentals of Machine Learning

Every successful machine learning project follows a consistent pipeline. In this lesson, you'll learn how ML systems are built from start to finish, including data preparation, model training, evaluation, and deployment considerations.

The Machine Learning Pipeline

A machine learning project isn't just about algorithms—it's a systematic process with multiple stages. Understanding each stage ensures your models perform well in production.

Complete ML Pipeline

1. Data Collection

Gather raw data from various sources

↓

2. Data Preprocessing

Clean, normalize, and engineer features

↓

3. Data Splitting

Divide into training, validation, and test sets

↓

4. Model Training

Train model on training data

↓

5. Model Evaluation

Test on unseen data and tune hyperparameters

↓

6. Deployment

Deploy to production and monitor performance

Train-Test Split: A Critical Step

Never evaluate your model on the same data you trained it on. This creates data leakage and leads to overly optimistic performance estimates.

The golden rule: Split your data into at least two parts:

Training Set (70-80%): Used to train the model
Test Set (20-30%): Used only for evaluation—the model never sees this data during training

For large datasets, use three sets:

Training Set (60%): Train the model
Validation Set (20%): Tune hyperparameters and select the best model
Test Set (20%): Final performance evaluation

// Train-Test Split with ML.NET
var mlContext = new MLContext();
var data = mlContext.Data.LoadFromTextFile<ModelInput>("data.csv", hasHeader: true);

// 80-20 split
var splitData = mlContext.Data.TrainTestSplit(data, testFraction: 0.2);
var trainingData = splitData.TrainSet;
var testData = splitData.TestSet;

// 60-20-20 split with validation
var trainTestData = mlContext.Data.TrainTestSplit(data, testFraction: 0.4);
var trainData = trainTestData.TrainSet;
var tempData = trainTestData.TestSet;
var valTestData = mlContext.Data.TrainTestSplit(tempData, testFraction: 0.5);
var validationData = valTestData.TrainSet;
var testData = valTestData.TestSet;

Understanding Overfitting and Underfitting

The balance between model complexity and generalization is critical:

Underfitting

Problem: Model is too simple to capture patterns. Poor performance on both training and test data.

Solution: Use more complex model, add features, or train longer.

Good Fit

Sweet spot: Model generalizes well. Good performance on both training and test data.

Goal: Achieve this balance in all projects.

Overfitting

Problem: Model memorizes training data. Great on training set, poor on test set.

Solution: Simplify model, add regularization, or collect more data.

The Bias-Variance Tradeoff

Bias and variance are two sources of error in machine learning models:

Bias: Error from incorrect assumptions. High bias = underfitting. Simple models have high bias.
Variance: Error from sensitivity to training data fluctuations. High variance = overfitting. Complex models have high variance.

The goal is to minimize total error by balancing bias and variance:

Bias-Variance Tradeoff

High Bias, Low Variance: Underfitting

Simple linear model that misses complex patterns

Low Bias, Low Variance: Perfect Balance (Goal)

Model captures true patterns without overfitting

Low Bias, High Variance: Overfitting

Complex model that memorizes noise in training data

Key Metrics for Model Evaluation

Different problems require different metrics:

Regression

R² Score: How much variance is explained (0-1, higher is better)

RMSE: Average prediction error magnitude (lower is better)

MAE: Mean absolute error (lower is better)

Classification

Accuracy: % correct predictions

Precision: True positives / all positives

Recall: True positives / actual positives

// Evaluating regression model
var predictions = model.Transform(testData);
var metrics = mlContext.Regression.Evaluate(predictions);
Console.WriteLine($"R²: {metrics.RSquared:F4}");
Console.WriteLine($"RMSE: {metrics.RootMeanSquaredError:F4}");
Console.WriteLine($"MAE: {metrics.MeanAbsoluteError:F4}");

// Evaluating classification model
var predictions = model.Transform(testData);
var metrics = mlContext.BinaryClassification.Evaluate(predictions);
Console.WriteLine($"Accuracy: {metrics.Accuracy:F4}");
Console.WriteLine($"Precision: {metrics.PositivePrecision:F4}");
Console.WriteLine($"Recall: {metrics.PositiveRecall:F4}");

🧠 Quick Check — Lesson 2

Why is it important to separate training and test data?

🧠 Quick Check — Lesson 2

A model performs excellently on training data but poorly on test data. What is the problem?

Lesson Summary

✅

The ML Pipeline has 6 stages: data collection, preprocessing, splitting, training, evaluation, and deployment.

✅

Train-Test Split prevents data leakage. Always evaluate on data the model never saw during training.

✅

Underfitting: Model too simple → poor performance. Overfitting: Model too complex → memorizes noise.

✅

Bias-Variance Tradeoff: Balance between model simplicity and complexity. Goal is to minimize total error.

✅

Choose evaluation metrics appropriate for your problem: R², RMSE, MAE for regression; Accuracy, Precision, Recall for classification.

Up Next

Lesson 3: Neural Networks Basics

Next Lesson →