Demystifying Naive Bayes Classifier

Classification and prediction are the two most essential aspects of Machine Learning, and Naive Bayes is a basic but surprisingly powerful algorithm for predictive modeling. Naive Bayes Algorithm is not only known for its simplicity, but also for its effectiveness. A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for huge datasets. Despite its simplicity, the Naive Bayesian classifier often does surprisingly well and is widely used because it often outperforms more advanced classification methods.

What is Naive Bayes Classifier?

Before we see the implementation of the algorithm, it is very important to understand what the algorithm is all about. To better understand the algorithm, let’s see what each word in ‘Naive Bayes Classifier’ means:

Naive

The reason it’s called the ‘Naive’ Bayes classifier is because it assumes that all the features of measurement are independent of each other. The Naive Bayes classifier assumes that the presence of a feature in a class is unrelated to any other feature. This is naive because it is never true majority of the times.

For example, a fruit may be considered a banana if it is yellow, long and about 5 inches in length. However, if these features depend on each other or are based on the existence of other features, a Naive Bayes classifier will assume all these properties to contribute independently to the probability that this fruit is a banana.

Bayes

The crux of the classifier is based on Bayes theorem. In statistics and probability theory, Bayes theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. It serves as a way to figure out conditional probability.

Given a Hypothesis (H) and evidence (E), Bayes Theorem states that the relationship between the probability of the hypothesis before getting the evidence, P(H), and the probability of the hypothesis after getting the evidence, P(H|E), is:

For this reason, P(H) is called the ‘prior probability’, while P(H|E) is called the ‘posterior probability’. The factor that relates the two, P(E|H)/P(E), is called ‘likelihood ratio’. Using these terms, Bayes’ theorem can be rephrased as follows:

“The posterior probability equals the prior probability times the likelihood ratio.”

Classifier

A Classifier in a machine learning model is the problem of identifying to which set of categories a new observation belongs, based on the training set of data containing observations (records) whose category is known.

Hence, it is a supervised machine learning algorithm.

The Three Flavors of Naive Bayes Classifier

Naive Bayes Classifier comes in 3 flavors in scikit-learn: MultinomialNB, BernoulliNB, and GaussianNB. The difference is the underlying distribution. Let’s see them in detail.

1. Multi-variate Bernoulli Naive Bayes (BernoulliNB)

The binomial model is useful if your feature vectors are binary or, in other words, it assumes that all our features are binary such that they take only two values. (i.e., 0s and 1s). One application would be text classification with a bag of words model where the 0s represent “word occurs in the document” and 1s represent “word does not occur in the document”.

2. Multinomial Naive Bayes (MultinomialNB)

The multinomial naive Bayes model is typically used when we have discrete data. If we have a text classification problem, we can take the idea of Bernoulli trials one step further. Instead of thinking “word occurs in the document”, we can have the “count of how often word occurs in the document” to predict the class or a label. (Example: Movie Review Ratings ranging 1 and 5 as each rating will have certain frequency to represent. Multinomial NB can be used here to classify reviews as 1- or 5-star ratings).

3. Gaussian Naive Bayes (GaussianNB)

Here, we assume that the features follow a normal distribution. Instead of discrete counts, all our features are continuous (Example: Popular Iris dataset where the features are sepal width, petal width, sepal length, petal length)

Implementing the Algorithm

Having discussed the algorithm, let’s dive into a quick implementation of the algorithm in Python using our sample data set: Iris Dataset. As we already said, the features can have different values, as width and length can vary. It’s not possible to represent features in terms of their occurrences. This means data is continuous. Hence, we use Gaussian Naive Bayes here.

Once we import our Iris Dataset, we perform pre-processing and create test and train splits.

Once this is done, we conduct feature scaling. Feature scaling is the method to limit or standardize the range of independent variables so that they can be compared on common grounds. It is performed on continuous variables.

Now, we can fit our Gaussian Naive Bayes Classifier to our training set. After this is done, we predict the test set results and check for the accuracy.

We get an accuracy of 97.36%. For the complete code, please click here.

If you would like to see an implementation of Multinomial Naive Bayes Classifier to classify Yelp Reviews into 1-star or 5-star categories based off the text content in the reviews using Yelp Review Data Set, please click here.

The last but the most important thing to know is where the algorithm can be used.

Applications of the Algorithm

The Naive Bayes algorithm has its applications in multiple real-life scenarios such as:

1. Text Classification

The Naive Bayes algorithm is used as a probabilistic learning method for text classification. The Naive Bayes classifier is one of the most successful known algorithms when it comes to the classification of text documents, i.e., to which category does a text document belong to (Spam/Not Spam).

2. Sentiment Analysis

It can be used to analyze tweets, comments, and reviews—whether they are negative, positive or neutral.

3. Recommendation System

The Naive Bayes algorithm along with algorithms like collaborative filtering is used to build hybrid recommendation systems that uses machine learning and data mining techniques which helps in predicting if a user would like a given resource or not.

4. Face Recognition

It is used to recognize face / not a face pattern. It can also be used to identify the emotion or expression represented on the face.

In conclusion, Naive Bayes Classification has proved its worth in the field of machine learning, despite being less complex. But, this is not the end. It’s just one of the many algorithms that can be utilized effectively in machine learning.

Learn more about how Syntelli empowers organizations to derive value from their data.

Let’s talk

Venkatesh U., Analytics Associate

Venkatesh has a passion for Data Science and an unquenchable thirst for knowledge. He believes that “Data is useless by itself, data is only useful when you apply it.” He loves applying statistical and analytical techniques to data and likes exploring several approaches to build a perfect model. He earned his M.S. in Information Technology from the University of North Carolina at Charlotte and specializes in Data Science and Business Analytics. In addition, Venkatesh has a bachelor’s degree in Electrical Engineering from India. When he isn’t building predictive models, you can find him watching and playing cricket, traveling places with his friends and hard sprinting at the gym.