We can say that there is a plethora of text data generated every day. People are using many social media apps, messaging apps, and blogs on a day-to-day basis. All these applications are generating a large amount of unstructured text data. According to the 6th edition of DOMO report:
Now, you can imagine how much text data is generated every day. This brings up an important question– how do we process it and convert it into structured data?
Let’s consider that a large corporation is receiving 1000 reviews per day, and we want to analyze all these reviews because there a crucial business decision is dependent on that. Computers only understand structured data like spreadsheets. We want that computer to understand the language as we humans do, so the reviews can be analyzed very quickly. In this kind of situation, NLP comes into the picture. If we already have a trained model, then we can get an answer in minutes.
What is Natural Language processing?
“Natural language processing (NLP) is a field of artificial intelligence in which computers analyze, understand, and derive meaning from human language in a smart and useful way.”
Here are some common examples of NLP used in our day-to-day lives:
Most of us use smart assistants like Alexa and Siri that recognize our speech and give us answers based on context. NLP is used to understand the context. As New York Times article, “Why We May Soon Be Living in Alexa’s World,” explained: “Something bigger is afoot. Alexa has the best shot of becoming the third great consumer computing platform of this decade.”
This is a very basic yet very useful feature of Gmail which displays our e-mails categorized into Primary, Social, and Promotions. This helps us review and respond to the important mails first. The Spam folder is also an application of NLP. Can you imagine your inbox with all emails combined to include spam and promotions without categorization?! 😂
When your first language is not English, this can be the most useful application of NLP. It takes text from one language as input and transforms it into the desired language using the right grammar. Google Translate supports almost 109 languages, allowing us to travel anywhere in the world without a language barrier!
How does NLP work?
As NLP gets more popular day by day, you’re probably wondering – How does NLP really work? How does Gmail categorize my emails?
“What is the matter here?”, asked the first lawyer, whose name was Speed.
In the above sentence, “The” has the highest frequency. As we understand it, “the” is just an extra word, but the computer might understand it as the most important word and conclude that the sentence is talking about ”The”. Therefore, we must teach the computer basic concepts of the English language.
This requires building a pre-processing pipeline, which is demonstrated below. (There is no specific order to follow for the pre-processing task, it is completely dependent on the Data.)
Demonstration of a Pre-Processing Pipeline
Building a Pre-Processing Pipeline
Sample Data for Coding
“COVID-19 is an infectious disease caused by severe acute respiratory syndrome coronavirus 2. It was first identified in December 2019 in Wuhan, China, and has resulted in an ongoing pandemic. The first case may be traced back to 17 November 2019. As of 14 June 2020, more than 7.83 million cases have been reported across 188 countries and territories, resulting in more than 431,000 deaths. More than 3.73 million people have recovered.” (Source: Wikipedia COVID-19)
NLTK Library provides all the features for pre-processing the data so this blog references the NLTK Library. However, other Libraries can be used too.
Step 1: Sentence Segmentation
First, we need to break down the sample data into separate sentences. In the sample data, we can see that every sentence has a different context. Separating each sentence makes it easier for a computer to understand rather than having it in the form of one whole paragraph. There are many ways to separate the sentences, but the most basic method is by setting it to separate whenever it finds the punctuation mark (.).
Here, we see separate sentences as an output.
Step 2: Word Segmentation
The data has been divided into separate sentences. Now let us divide it to into words. This process is called as Word Tokenization. Word_tokenize will divide sentences into words whenever there is a space.
You will notice that the code is considering ‘.’ as one word. We will remove that in the next steps.
Now, we have sentences divided into words.
Step 3: Text Lemmatization
In English, one word can be used in many different forms like,
Look at that apple.
Look at those apples.
We can see that both sentences use one-word apple but in different ways. We want to show the root words. Text lemmatization is the process of finding most basic word in each sentence.
The above example explains what Text Lemmatization is.
Step 4: Text Stemming
Text Stemming is the process of removing the pre-fix and post-fix from words. Lemmatization and Stemming can sometimes be confusing because you could get the same output with both algorithms. The below example will help make these clearer.
You will notice that for words like ‘Driving’ and ‘Drive’, Stemming and Lemmatization both result in the same output but for ‘Drove’, it results in a different output. Lemmatization gives us the root of the words while Stemming removes prefix and postfix.
Step 5: Remove Stop words
The English language has many words that we may want to remove before performing any analysis. In NLP, these are called Stop Words.
In the NLTK library, there is a list of Stop Words, but sometimes you might want to define a list of Stop Words on your own. NLTK considers some words like “again” and “before”, and in some cases, you may not want to remove these. Here, you can define the Stop Words list.
Now, let’s write code for removing stop words. We also want to remove punctuation marks from the data.
We’ve covered each step in detail and have learned how to build a reusable pipeline. In Part 2 of this blog series, we will discuss feature extraction and building classifiers based on the data.
If you have any further questions on this tutorial or would like to set up a time to discuss your NLP project with us, please feel free to contact us.
Jinal Butani, Data Scientist
Jinal is a Data Scientist at Syntelli, and her main area of focus is Business Intelligence. She graduated from UNCC in 2019 with a Master’s in Computer Science, and also holds a Bachelor’s in Information Technology. She has experience with successfully managing several Business Intelligence and Data Science projects. Jinal’s passion lies in extracting knowledge from data through visualizations, and building predictive models.
When Jinal isn’t working with data, you can find her performing classical dance, cooking and exploring the world.
Due to its fast, easy-to-use capabilities, Apache Spark helps to Enterprises process data faster, solving complex data problems quickly. We all know that during the development of any program, taking care of the performance is equally important. A Spark job can be...read more
As the U.S. economy faces unprecedented challenges, predictive analytics in financial services is necessary to accommodate customers’ immediate needs while preparing for future changes. These future changes may amount to enterprise transformation, a fundamental...read more
Healthcare organizations face an array of challenges regarding customer communication and retention. Customer intelligence can be a game-changer for small and large organizations due to its ability to understand customer needs and preferences. When it comes to data,...read more