Module 5: Supervised Learning - Regression & Classification Models

This is it. You've learned the core concepts, mastered the math intuition, set up your Python environment, and prepared your data. Now, you get to do what machine learning is all about: **training models.**

In this module, we'll focus on **Supervised Learning**, the most common type of machine learning. As a quick recap, this is where we act as a "supervisor" by providing the model with labeled data—data that includes the correct answers. The model's job is to learn the relationship between the inputs (features) and the outputs (labels). We'll cover the two main types of supervised problems—**Regression** and **Classification**—and you'll build four different foundational models from scratch using Python's powerful `scikit-learn` library.

Part 1: Regression — Predicting Continuous Values 📈

A regression problem is when your target variable—the thing you want to predict—is a continuous numerical value. Think of questions like "How much?", "How many?", or "What is the temperature?".

Predicting the **price** of a house.
Forecasting the **number of sales** next quarter.
Estimating the **age** of a person from a photo.

Algorithm 1: Linear Regression (The Workhorse)

Intuition: As we saw in Module 2, Linear Regression is all about finding the "line of best fit" through your data points. It assumes a linear relationship between your features and your target label. For a single feature, this is the familiar equation $y = mx + b$. When you have multiple features (like bedrooms, square footage, etc.), it becomes a more complex plane or hyperplane, but the core idea is the same: the model learns the optimal "weights" (coefficients) for each feature to make the most accurate numerical predictions.

Python Implementation: Let's use the California Housing dataset we loaded in previous modules. The goal is to predict the median house value (`MedHouseVal`).

# --- 1. Import necessary libraries ---
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# --- 2. Load and prepare the data ---
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target)

# --- 3. Split the data into training and testing sets ---
# We train the model on the training set and evaluate it on the unseen testing set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 4. Create and train the model ---
# Instantiate the model
model = LinearRegression()

# Train the model using the .fit() method. This is the "learning" step!
model.fit(X_train, y_train)

# --- 5. Make predictions ---
# Use the trained model to make predictions on the test data
predictions = model.predict(X_test)

print("First 5 predictions:", predictions[:5])
print("Actual first 5 values:", y_test.values[:5])

The Scikit-Learn API: Notice the pattern: create a model object, then use `.fit()` to train and `.predict()` to test. You will see this beautiful, consistent pattern across almost all models in `scikit-learn`, making it incredibly easy to experiment with different algorithms.

Part 2: Classification — Predicting Discrete Categories 🏷️

A classification problem is when your target variable is a discrete category. The questions here are more like "Which one?", "Is it A or B?", or "Does it belong to this class?".

Classifying an email as **"Spam" or "Not Spam"**.
Determining if a bank transaction is **"Fraudulent" or "Legitimate"**.
Identifying a flower in a photo as an **"Iris Setosa", "Iris Versicolor", or "Iris Virginica"**.

Algorithm 2: Logistic Regression (The Probabilistic Classifier)

Intuition: Don't let the name fool you! Despite having "regression" in its name, Logistic Regression is a **classification** algorithm. It works by calculating the probability that a given data point belongs to a particular class. It takes a linear combination of the input features and passes it through a "sigmoid" function, which squishes the output to a value between 0 and 1. We can then set a threshold (typically 0.5) to assign a class. For example, if the probability of an email being spam is > 0.5, we classify it as spam.

Python Implementation: We'll use the famous Iris dataset, which contains measurements of 3 different species of Iris flowers.

# --- 1. Import libraries ---
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

# --- 2. Load and prepare data ---
iris = load_iris()
X, y = iris.data, iris.target

# --- 3. Split data ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# --- 4. Create and train the model ---
# Notice the API is identical to Linear Regression!
log_reg_model = LogisticRegression(max_iter=200) # max_iter helps the model converge
log_reg_model.fit(X_train, y_train)

# --- 5. Make predictions ---
predictions = log_reg_model.predict(X_test)

print("Predictions:", predictions)
print("Actual values:", y_test)

Algorithm 3: Decision Trees (The Flowchart Model)

Intuition: This is one of the most intuitive models. A Decision Tree learns to make predictions by creating a set of if-then-else rules, just like a flowchart. It splits the data based on different feature values to create the "purest" possible groups at each step. Because you can visualize the path of decisions, it's known as a "white-box" model, meaning it's very easy to interpret.

Python Implementation: We can use the same Iris dataset to see how a different model performs.

# --- 1. Import ---
from sklearn.tree import DecisionTreeClassifier

# (We can reuse the X_train, X_test, etc. from the previous example)

# --- 4. Create and train the model ---
tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(X_train, y_train)

# --- 5. Make predictions ---
predictions = tree_model.predict(X_test)

print("Predictions:", predictions)
print("Actual values:", y_test)

Algorithm 4: K-Nearest Neighbors (KNN) (The "Social" Model)

Intuition: KNN is a simple yet powerful algorithm based on the idea that "birds of a feather flock together." To classify a new, unknown data point, it looks at the 'K' closest data points to it in the training set (its "neighbors"). It then assigns the new point to the class that is most common among those neighbors. If K=5 and 3 of the 5 closest neighbors are "Spam", the new email is classified as "Spam".

Crucial Point: Because KNN works based on **distance**, it is highly sensitive to the scale of your features. You **must** perform feature scaling (like Standardization from Module 4) before using KNN.

Python Implementation: Let's apply this to the Iris dataset, but this time with the proper preprocessing step.

# --- 1. Import ---
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# (We reuse the X_train, X_test, etc. from before)

# --- PREPROCESSING STEP: SCALING ---
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use the same scaler fitted on the training data

# --- 4. Create and train the model ---
# We choose K=5 for this example
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train_scaled, y_train) # Train on the SCALED data

# --- 5. Make predictions ---
predictions = knn_model.predict(X_test_scaled) # Predict on the SCALED test data

print("Predictions:", predictions)
print("Actual values:", y_test)

You're a Model Builder! 🎉

Incredible work! You have now officially trained four of the most fundamental machine learning models. You've seen the clear distinction between regression and classification and witnessed the power and consistency of the `scikit-learn` library.

But training a model is only half the story. How do we know if our predictions are any good? How do we measure success? And what about problems where we don't have labeled data at all? That's what's coming next. Prepare to explore the world of **Unsupervised Learning** and the critical art of **Model Evaluation**.