# Building a Spam Detector with Naive Bayes Classifier

[Spam detection](https://blog.logrocket.com/email-spam-detector-python-machine-learning/) is a critical task in natural language processing (NLP) and machine learning, aimed at identifying and filtering out unwanted or malicious messages. In this [tutorial](https://bytescrum.com/), <mark>we'll create a simple spam detector using the </mark> [<mark>Naive Bayes</mark>](https://www.ibm.com/topics/naive-bayes#:~:text=Na%C3%AFve%20Bayes%20is%20part%20of,important%20to%20differentiate%20between%20classes.) <mark> classifier</mark> with Python's scikit-learn library.

## **Introduction**

Spam detection plays a crucial role in email systems, messaging apps, and other communication platforms, as it helps users avoid irrelevant or harmful messages. [Machine learning](https://blog.bytescrum.com/introduction-to-machine-learning) models, particularly the Naive Bayes classifier, are widely used for spam detection due to their simplicity and effectiveness.

## **Dataset Exploration**

We'll [start](https://blog.bytescrum.com/understanding-dataframes-in-machine-learning-a-comprehensive-guide) by loading the SMS Spam Collection dataset, which contains SMS messages labeled as 'spam' or 'ham' (not spam). Let's explore the dataset and visualize the distribution of spam and ham messages:

```python
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('SMSSpamCollection', sep='\t', names=['label', 'message'])

# Convert labels to binary (0 for ham, 1 for spam)
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Visualize the distribution of spam and ham messages
spam_count = df['label'].sum()
ham_count = len(df) - spam_count

plt.figure(figsize=(6, 6))
plt.pie([ham_count, spam_count], labels=['Ham', 'Spam'], autopct='%1.1f%%', startangle=90)
plt.axis('equal')
plt.title('Distribution of Spam and Ham Messages')
plt.show()
```

The dataset contains a total of X messages, with Y% of them labeled as spam and Z% labeled as ham.

<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">In the context of spam detection and email filtering, "HAM" refers to legitimate, non-spam messages. When classifying messages, "HAM" is used to categorize messages that are not considered spam.</div>
</div>

## **Data Preprocessing**

Before training the model, we need to [preprocess](https://www.javatpoint.com/data-preprocessing-machine-learning) the text data. This involves removing punctuation, converting text to lowercase, and removing stopwords (common words that do not contribute much to the meaning of the text):

```python
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download stopwords if not already downloaded
nltk.download('stopwords')
nltk.download('punkt')

def preprocess_text(text):
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    text = text.lower()  # Convert text to lowercase
    stop_words = set(stopwords.words('english'))  # Get English stopwords
    word_tokens = word_tokenize(text)  # Tokenize text into words
    filtered_text = [word for word in word_tokens if word not in stop_words]  # Remove stopwords
    return ' '.join(filtered_text)

df['message'] = df['message'].apply(preprocess_text)
```

After preprocessing, each message in the dataset is cleaned and ready for further processing.

## **Feature Extraction**

To train our model, we need to convert the text data into numerical features. We'll use the Bag-of-Words (BoW) model, which represents each message as a vector of word counts:

```python
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
X = count_vectorizer.fit_transform(df['message'])
y = df['label']
```

The `CountVectorizer` converts a collection of text documents into a matrix of token counts, where each row represents a document and each column represents a unique word in the corpus.

## **Model Training and Evaluation**

Next, we'll split the dataset into training and testing sets, and train a Multinomial Naive Bayes classifier:

```python
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

y_pred = nb_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Display evaluation metrics
print("Accuracy: {:.2f}%".format(accuracy * 100))
print("Precision: {:.2f}%".format(precision * 100))
print("Recall: {:.2f}%".format(recall * 100))
print("\nConfusion Matrix:")
print(conf_matrix)
```

<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">You can download the test data from <a target="_blank" rel="noopener noreferrer nofollow" href="https://archive.ics.uci.edu/dataset/228/sms+spam+collection" style="pointer-events: none">https://archive.ics.uci.edu/dataset/228/sms+spam+collection</a></div>
</div>

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1716448053675/48ed82cf-255c-410e-8db8-23e77960ccca.png align="center")

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1716448063775/d8928c66-b000-46e6-bb02-02b23d705047.png align="center")

<details data-node-type="hn-details-summary"><summary>Conclusion</summary><div data-type="detailsContent">In this tutorial, we've built a spam detector using the Naive Bayes classifier. We've explored dataset exploration, data preprocessing, feature extraction, model training, and evaluation. This serves as a foundational example for developers looking to implement spam detection systems using machine learning.</div></details>

Happy coding!

Give this code a try and share your thoughts in comments.
