A tete-a-tete on NLTK with my PC!

9 min readAug 21, 2022

MyPC :

So! What are we discussing, today?

Me :

I know it’s been long since we actually did an honest talk about what is it that is trending in our world of Computer Science. And today I felt I should talk to you about few of the challenges that I face in my day to day analytical activities. And how NLTK helped me to a great extent overcome each of those.

So, there are a lot of mundane tasks that I have to do manually when it comes to analyzing unstructured data and using it for serving various use-cases.One such domain where I have had to work on a lot of unstructured data is Natural Language Processing & Generation.

MyPC :

So when you say Unstructured data, what do you exactly mean? Is it the loosely formatted data that you sometimes download on my disks to perform analytics?

Me :

Yes, you got that right. While the concept and idea of unstructured data is quite universally known as many companies work on unstructured data; however let me still give you and the readers a brief background of it.

Unstructured Data / Information is any information that either does not have any pre-defined data model or does not have an organized format / manner to it.
It is generally categorized as qualitative data, and cannot be processed or analyzed via conventional data tools and methods.

MyPC :

How can you best manage Unstructured data?

Me :

Since Unstructured data does not have a pre-defined model, it is best managed in non-relational (NoSQL) databases. One can also use Data lakes to preserve the unstructured data in raw form.

MyPC :

Can you give me some examples, benefits, disadvantages and applications of Unstructured data?

Me :

Yea sure! Some of the prominent examples of unstructured data are : text, mobile activity, social media posts, IoT sensor data, etc.

Some BENEFITS of Unstructured Data:-

Unstructured data is always stored in its native format and remains undefined until needed. This means that the unstructured data then remains adaptable for long and its adaptability increases file formats in the database. This in turn widens the data pool and enables the data scientists to only prepare and analyze the data that they need.
Since you do not have to predefine the data, it can be collected quickly and easily.
Unstructured data allows for massive storage and supports a pay-as-you-use pricing model which helps in cutting cost and also eases scalability.

Some DISADVANTAGES of Unstructured Data :-

One needs to have a solid data-science expertise in order to prepare and analyze unstructured data due its undefined & non formatted nature.
Data managers require specialized tools to manipulate unstructured data which basically just limits their product choices.

Some APPLICAITONS of Unstructured Data :-

In the modern world of big data, Unstructured data is very abundant and prolific. Like mentioned above, it could be anything : image / audio / sensor data / media / text data and much more.

Some of the prominent examples of Unstrucutred data are :-

Rich Media : Media and entertainment data, surveillance data, geo-spatial data, audio, weather data, etc.
Document collections : Invoices, records, emails, productivity applications, etc.
Internet of Things (IoT) : Sensor data, ticker data
Analytics : Machine Learning, Artificial Intelligence

MyPC :

So Shivam, you were telling me that you have to work on a lot of Unstructured data while performing Natural Language Processing. What is NLP, again?

Me :

Natural Language Processing (NLP) is a field that focuses on making natural human language usable by the computer software programs. The urge in the Data Science industry, to use NLTK is the abundance of unstructured data out there. The companies who build IoT products certainly require Data Science as part of building their models and algorithms.

One such kind of unstrctured data type is the natural language data which exists in the data universe as User Generated Data for eg : Product Reviews, Social Media Posts (Linkedin / Facebook / Instagram), Testimonials, Emails, etc.

The models are trained and qualified on structured data. However, the raw data that is available out there in the universe, to be used for these models, is in a very crude and unstructured format. Specially in the applications like Chatbots, Recommendation Engines, Sentiment Analysis Models, etc; it becomes imperative to convert the unstructured natural language data into structured / formatted data.

Too much of unstructured data out there in the universe!

In the current landscape, there are many tools and open-source packages that the programmers and Data Scientists can use for performing NLP.

One such popular package in Python is NLTK (Natural Language Toolkit).

And here’s how you can install it.

pip install nltk==3.5
pip install numpy matplotlib

You can also download a lot of test datasets provided under NLTK. Here’s how you can do it :-

import nltk
nltk.download()

MyPC :

Alright, that is great. I will definitely search more around NLTK. However, you were talking about some day to day challenges that you used to face in your job (particularly the analytical activities) and NLTK really helped you to a great extent. Can you take me through some of those challenges of yours and how you overcame each of those using NLTK?

Me :

Oh yes, certainly!

Scenario 01 : Once, one of our clients shared a corpus of text data with us and asked us to analyse the text data and perform the usual NLP practices on it starting from tokenizing to stemming to the generation of the Syntax / Parse Tree and finally create a wordcloud out of the raw text data provided.

So, here’s what I did :-

import nltk
nltk.download(‘punkt’)
from nltk.tokenize import sent_tokenize, wordtokenize

text = “Natural language processing is an exciting area. A lot of data-centric and data-oriented companies who work on natural language, have started using NLTK to perform the usual NLP practices on the raw english language data.”

Conversion to lower case

So generally, for your knowledge — NLTK is case sensitive.

Hence, for avoiding redundancy of words in the token list that gets generated, we generally convert all the letters / words into lower case. The reason for doing it is to not let our model get confused and count the words like “Earth” and “earth” as two different words.
Basically, the conversion to lower case is an efficient inevitable step. Such standardization practices not only keep the model efficiency according to our standards, but also helps in avoiding the creation of useless data in the system.

text = re.sub(r”[^a-zA-Z0–9]”, “ “, text.lower())

Breaking the text into sentence tokens

print(sent_tokenize(text))
Output :-

[‘Natural language processing is an exciting area.’, ‘A lot of data-centric and data-oriented companies who work on natural language, have started using NLTK to perform the usual NLP practices on the raw english language data.’]

Breaking the text into word tokens

print(word_tokenize(text))
Output :-

['Natural', 'language', 'processing', 'is', 'an', 'exciting', 'area', '.', 'A', 'lot', 'of', 'data-centric', 'and', 'data-oriented', 'companies', 'who', 'work', 'on', 'natural', 'language', ',', 'have', 'started', 'using', 'NLTK', 'to', 'perform', 'the', 'usual', 'NLP', 'practices', 'on', 'the', 'raw', 'english', 'language', 'data', '.']

Next, let’s remove Stop Words

Out there in the world, there is a lot of noise prone data (specially the one that involves natural language). Generally, when Data Scientists build NLP models and grdaully execute them, they encounter a lot of noise in the words.
This noise is basically the stop words like ‘the’, ‘he’, ‘her’, etc. which do not help / contribute in the results that are to be obtained from the NLP models. Hence, these must be removed for cleaner processing inside the model.

Fortunately, NLTK helps us to eliminate such stop words easily by the virtue of a definite set that can be referred to as what is available in the English language.

nltk.download(‘stopwords’)
print(stopwords.words(“english”))

Output :-

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

words = [w for w in text if w not in stopwords.words(“english”)]
text = text.split()
print(text)

['natural', 'language', 'processing', 'is', 'an', 'exciting', 'area', 'a', 'lot', 'of', 'data', 'centric', 'and', 'data', 'oriented', 'companies', 'who', 'work', 'on', 'natural', 'language', 'have', 'started', 'using', 'nltk', 'to', 'perform', 'the', 'usual', 'nlp', 'practices', 'on', 'the', 'raw', 'english', 'language', 'data']

Next we perform Stemming

In our natural language texts, we will often encounter words which have similar or related meanings to each other, like ‘programming’, ‘programmed’, ‘programatic’ etc. Now if you will see, all these words have a root word, and they convey the similar sort of the meaning. In this case, the root word would be ‘program’.

So, it is much better and efficient to extract the root word and eliminate the rest. Here the root word formed is called ‘stem’ and it is not mandatory that stem needs to actually exist as a proper word and should have a proper meaning. Just by committing the suffix and prefix, we can and do generate the stems.

MyPC :

Can you tell me some libraries that can take care of the Stemming problem?

Me :

NLTK has libraries like SnowballStemmer, LancasterStemmer, and PorterStemmer to tackle this problem.

I will make use of PorterStemmer here :-

from nltk.stem.porter import PorterStemmer
# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in text]
print(stemmed)

Output :-

['natur', 'languag', 'process', 'is', 'an', 'excit', 'area', 'a', 'lot', 'of', 'data', 'centric', 'and', 'data', 'orient', 'compani', 'who', 'work', 'on', 'natur', 'languag', 'have', 'start', 'use', 'nltk', 'to', 'perform', 'the', 'usual', 'nlp', 'practic', 'on', 'the', 'raw', 'english', 'languag', 'data']

MyPC :

I have heard of ‘Lemmatization’ as one process, where you can find the base forms of the respective words. Can you tell me more about it?

Me :

Yes, sure!
Lemmatization is the process (like you said), where the base forms of the words in your text / corpus can be extracted. The word that gets extracted is called ‘Lemma’ and it can be easily looked up in the English Dictionary. NLTK gives us the WordNet Lemmatizer that makes use of the WordNet Database to lookup lemmas of words.

Alright then! Let’s then reduce the words to their root form :-

nltk.download(‘wordnet’) #wordnet download before hand is advisable
nltk.download(‘omw-1.4’) #dependency
from nltk.stem.wordnet import WordNetLemmatizer
# Reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w) for w in text]
print(lemmed)

Output :-

['natural', 'language', 'processing', 'is', 'an', 'exciting', 'area', 'a', 'lot', 'of', 'data', 'centric', 'and', 'data', 'oriented', 'company', 'who', 'work', 'on', 'natural', 'language', 'have', 'started', 'using', 'nltk', 'to', 'perform', 'the', 'usual', 'nlp', 'practice', 'on', 'the', 'raw', 'english', 'language', 'data']

Remember : Lemmatization is relatively slower as compared to Stemming, since the latter does not need to look up in the dictionary and just follows with the algorithm for generation of the root words.

Syntax Tree Generation or Parse Tree

Comes the part, that it the end mission.
To define English grammar, we can define it and then use NTLK RegexParser to extract all parts of speech from the text and draw functions to visualize it.

[cont…..]