Advertisement
Intermediate AI & ML Lesson 2 of 6

Lesson 2: Fundamentals of Machine Learning

Every successful machine learning project follows a consistent pipeline. In this lesson, you'll learn how ML systems are built from start to finish, including data preparation, model training, evaluation, and deployment considerations.

Advertisement

The Machine Learning Pipeline

A machine learning project isn't just about algorithms—it's a systematic process with multiple stages. Understanding each stage ensures your models perform well in production.

Complete ML Pipeline
1. Data Collection
Gather raw data from various sources
2. Data Preprocessing
Clean, normalize, and engineer features
3. Data Splitting
Divide into training, validation, and test sets
4. Model Training
Train model on training data
5. Model Evaluation
Test on unseen data and tune hyperparameters
6. Deployment
Deploy to production and monitor performance

Train-Test Split: A Critical Step

Never evaluate your model on the same data you trained it on. This creates data leakage and leads to overly optimistic performance estimates.

The golden rule: Split your data into at least two parts:

  • Training Set (70-80%): Used to train the model
  • Test Set (20-30%): Used only for evaluation—the model never sees this data during training

For large datasets, use three sets:

  • Training Set (60%): Train the model
  • Validation Set (20%): Tune hyperparameters and select the best model
  • Test Set (20%): Final performance evaluation
// Train-Test Split with ML.NET
var mlContext = new MLContext();
var data = mlContext.Data.LoadFromTextFile<ModelInput>("data.csv", hasHeader: true);

// 80-20 split
var splitData = mlContext.Data.TrainTestSplit(data, testFraction: 0.2);
var trainingData = splitData.TrainSet;
var testData = splitData.TestSet;

// 60-20-20 split with validation
var trainTestData = mlContext.Data.TrainTestSplit(data, testFraction: 0.4);
var trainData = trainTestData.TrainSet;
var tempData = trainTestData.TestSet;
var valTestData = mlContext.Data.TrainTestSplit(tempData, testFraction: 0.5);
var validationData = valTestData.TrainSet;
var testData = valTestData.TestSet;

Understanding Overfitting and Underfitting

The balance between model complexity and generalization is critical:

Underfitting

Problem: Model is too simple to capture patterns. Poor performance on both training and test data.

Solution: Use more complex model, add features, or train longer.

Good Fit

Sweet spot: Model generalizes well. Good performance on both training and test data.

Goal: Achieve this balance in all projects.

Overfitting

Problem: Model memorizes training data. Great on training set, poor on test set.

Solution: Simplify model, add regularization, or collect more data.

The Bias-Variance Tradeoff

Bias and variance are two sources of error in machine learning models:

  • Bias: Error from incorrect assumptions. High bias = underfitting. Simple models have high bias.
  • Variance: Error from sensitivity to training data fluctuations. High variance = overfitting. Complex models have high variance.

The goal is to minimize total error by balancing bias and variance:

Bias-Variance Tradeoff
High Bias, Low Variance: Underfitting
Simple linear model that misses complex patterns
Low Bias, Low Variance: Perfect Balance (Goal)
Model captures true patterns without overfitting
Low Bias, High Variance: Overfitting
Complex model that memorizes noise in training data

Key Metrics for Model Evaluation

Different problems require different metrics:

Regression

R² Score: How much variance is explained (0-1, higher is better)

RMSE: Average prediction error magnitude (lower is better)

MAE: Mean absolute error (lower is better)

Classification

Accuracy: % correct predictions

Precision: True positives / all positives

Recall: True positives / actual positives

// Evaluating regression model
var predictions = model.Transform(testData);
var metrics = mlContext.Regression.Evaluate(predictions);
Console.WriteLine($"R²: {metrics.RSquared:F4}");
Console.WriteLine($"RMSE: {metrics.RootMeanSquaredError:F4}");
Console.WriteLine($"MAE: {metrics.MeanAbsoluteError:F4}");

// Evaluating classification model
var predictions = model.Transform(testData);
var metrics = mlContext.BinaryClassification.Evaluate(predictions);
Console.WriteLine($"Accuracy: {metrics.Accuracy:F4}");
Console.WriteLine($"Precision: {metrics.PositivePrecision:F4}");
Console.WriteLine($"Recall: {metrics.PositiveRecall:F4}");
Advertisement

🧠 Quick Check — Lesson 2

Why is it important to separate training and test data?

🧠 Quick Check — Lesson 2

A model performs excellently on training data but poorly on test data. What is the problem?

Lesson Summary

The ML Pipeline has 6 stages: data collection, preprocessing, splitting, training, evaluation, and deployment.

Train-Test Split prevents data leakage. Always evaluate on data the model never saw during training.

Underfitting: Model too simple → poor performance. Overfitting: Model too complex → memorizes noise.

Bias-Variance Tradeoff: Balance between model simplicity and complexity. Goal is to minimize total error.

Choose evaluation metrics appropriate for your problem: R², RMSE, MAE for regression; Accuracy, Precision, Recall for classification.

Up Next

Lesson 3: Neural Networks Basics

Next Lesson →