How to Create and Train a Machine Learning Model from Scratch

How to Create and Train a Machine Learning Model from Scratch

Building and Training a Machine Learning Model from Scratch: A Beginner's Guide with Real-World Examples

Machine learning (ML) has revolutionized the way we approach problem-solving, enabling computers to learn from data and make decisions or predictions. In this blog, we’ll walk through the process of creating and training a machine learning model from scratch using Python and its popular libraries like scikit-learn.

1. Setting Up the Environment

Before we begin, we need to set up the tools necessary to build a machine learning model. We'll use popular Python libraries like pandas for data handling, scikit-learn for machine learning algorithms, and matplotlib for visualization.

Run the following command in your terminal or command prompt to install these libraries:

pip install numpy pandas scikit-learn matplotlib seaborn

2. Understanding the Machine Learning Process

Creating a machine learning model can be broken down into these core steps:

  1. Problem Definition: What are you trying to solve or predict?

  2. Data Collection: Gather or load the dataset.

  3. Data Preprocessing: Clean and prepare the data.

  4. Model Selection: Choose the appropriate algorithm.

  5. Model Training: Train the model using your data.

  6. Model Evaluation: Test the model's performance.

  7. Model Tuning: Optimize the model for better accuracy.

  8. Deployment: Use the model in real-world applications.


3. Loading and Exploring the Data

For this guide, we'll use the famous Iris dataset from scikit-learn. This dataset contains information about flowers and their classifications into three species based on the length and width of their petals and sepals.

Here’s how you can load and view the data:

import pandas as pd
from sklearn.datasets import load_iris

# Load the dataset
iris = load_iris()

# Convert to DataFrame for easy visualization
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target

# Display the first few rows
print(df.head())

Explanation:

  • We load the Iris dataset and convert it into a pandas DataFrame so we can easily explore it. The dataset includes columns like petal and sepal measurements and a target column indicating the flower species.

4. Preprocessing the Data

Data is rarely perfect, so we need to clean it before training the model. This includes handling missing data, scaling the features (so they're on the same scale), and encoding categorical variables.

Luckily, the Iris dataset is already clean, so we only need to scale the features:

from sklearn.preprocessing import StandardScaler

# Separate the features and target
X = df.drop('species', axis=1)
y = df['species']

# Standardize the features (scaling)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Explanation:

  • Feature scaling ensures that the machine learning model doesn’t give more weight to features just because they have a larger range of values (e.g., petal length vs. sepal length).

5. Splitting the Data

Now that the data is clean and scaled, we need to split it into two parts:

  1. Training set (80%): This data will be used to train the model.

  2. Test set (20%): This data will be used to evaluate the model.

from sklearn.model_selection import train_test_split

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

print(f"Training samples: {X_train.shape[0]}, Testing samples: {X_test.shape[0]}")

Explanation:

  • We split the dataset into training and testing sets. By keeping a separate test set, we can evaluate how well the model generalizes to new data.

6. Choosing and Training a Model

We’ll start by using a simple algorithm, K-Nearest Neighbors (KNN), which classifies a new point based on its closest neighbors in the dataset.

from sklearn.neighbors import KNeighborsClassifier

# Initialize the model
knn = KNeighborsClassifier(n_neighbors=3)

# Train the model on the training data
knn.fit(X_train, y_train)

Explanation:

  • K-Nearest Neighbors is easy to understand: for each new data point, it looks at the 'k' nearest points in the training data and assigns the majority class label.

7. Evaluating the Model

After training, it's crucial to test the model on the unseen test data to measure its accuracy and other performance metrics.

from sklearn.metrics import accuracy_score

# Make predictions on the test data
y_pred = knn.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

Explanation:

  • Accuracy is a basic metric to measure how many predictions were correct. Here, we predict the species of flowers in the test set and compare them with the actual values to compute accuracy.

8. Tuning the Model

If you’re not satisfied with the initial accuracy, you can tune the model by adjusting its parameters (hyperparameters). For example, we can experiment with different values of k in the KNN model.

# Try with different number of neighbors (k)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Evaluate the model again
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with k=5: {accuracy * 100:.2f}%")

Explanation:

  • By changing the number of neighbors (k), we can potentially improve the accuracy of the KNN model. You can experiment with different values to see what works best.

9. Deploying the Model

Once you're happy with the model's performance, you can save it and deploy it for real-world applications. In Python, you can use the joblib library to save the trained model and later load it when needed.

import joblib

# Save the model to a file
joblib.dump(knn, 'knn_model.pkl')

# Load the model back
knn_loaded = joblib.load('knn_model.pkl')

# Now you can use it to make predictions
y_new_pred = knn_loaded.predict(X_test)

Explanation:

  • Saving the model allows you to use it later without having to retrain it. This is useful when deploying the model in production.
Conclusion
In this blog, we’ve walked through the basic steps of creating and training a machine learning model from scratch using Python. By following these steps, you can build models that make predictions, classify data, or even help automate tasks.

Here’s a recap of the process:

  1. Prepare the data: Clean, scale, and split it into training and test sets.

  2. Choose a model: Select an algorithm like KNN, Decision Trees, etc.

  3. Train the model: Use the training set to fit the model.

  4. Evaluate the model: Check its accuracy and performance on the test set.

  5. Tune the model: Adjust parameters to improve accuracy.

  6. Deploy the model: Save it for later use in real-world applications.

Machine learning is a powerful tool that can solve a wide range of problems, and by mastering the basics, you’re well on your way to building intelligent systems. Happy coding!