In the current world of internet and online services, data is being generated at an incredible amount and speed. Largely, Data Science Professionals are handling either tabular or relational data. The tabular data columns have either categorical or numerical data with varied structures like images, text, video and audio.
Businesses are waking up to the reality that data is the new fuel they need to power their operations and market strategies for immense profitability. As a result, they are generating unstructured online content such as website text , blog articles and Social Media Posts towards understanding customer activities, opinion and feedback to successfully derive their business. For instance, its only through text analytics that YouTube can analyze and understand people’s viewpoints on a video.
This has resulted to Text analytics evolving at a faster rate than ever before so as to compete with big textual data that is also growing rapidly. Python is the most popular programming language today, since it takes less time and effort to carry out certain operations. Again, its syntax and code readability makes it efficient, easy to process and to learn. These therefore makes Python the perfect option for building the necessary infrastructure to run, train and try a machine learning model for text analysis. In this article, we explore some valuable analytical and machine learning techniques you need to master for a smooth introduction to text analysis.
Check out our business intelligence app that allows you to make wise business decisions. Download the App, take photos, enter the details, then post and wait for it to reflect on the Location Intelligence Software.
What’s the difference between structured and unstructured data?
In the pursuit of becoming a data scientist in this era of rapidly evolving technologies, fluency in handling both structured and unstructured data is both mandatory and inevitable.
Structured data is the most common type of data that is well-formatted and highly organized. This type of data conforms to tabular format with clear and labelled columns and rows. In often cases, the term structured is used to refer to what is commonly known as quantitative data and working with these data and running some analytical algorithms is very straightforward.
On the other hand, unstructured data is not organized in a pre-defined manner. The information is typically text-heavy but may contain data such as dates, numbers, and facts as well. Common examples of unstructured data include PDFs, Word files, audio, video files or No-SQL databases.
Natural Language processing(NLP)
It is the branch of data science that aids computers to process, analyze and manipulate the human’s natural Languages. Natural Language Processing seeks to bridge the gap between human communication and computer understanding.
The application of NLP extensively cuts across all industries that generates and consumes big data. Basically, NLP is the technology behind all the virtual assistants, speech-to-text, chatbots, computer-based translations and many more. In this article, we shall be working with a dataset of messages which are classified as Spam or Ham. The goal is to explore the data then create an accurate classification model that identifies if a message is a spam or not. You can download the dataset here!
1. Loading the required Packages
- Natural Language Toolkit (NLTK)-This is the leading module for natural language processing in python. It provides easy-to-use interfaces and very rich in libraries and functions.
- Stop Words- Stop words are commonly used words which don’t add much meaning to a sentence or our model, “the”, “a”, “is”, “on”. Stopword module contains all the English stopwords which we shall train our model to ignore.
- Sklearn– It is the most popular machine learning and predictive analytics module in python. It offers a wide range of functionality in preparing features, creating supervised and unsupervised models and measuring the performance of the model.
- WordCloud– This is the package we shall use to create some visual representations of our data text.
- Pandas and Numpy – We require the two libraries to work with data frames.
- Matplotlib and Seaborn – They are plotting packages in python.
2. Loading and preparing the data
We use pandas for loading and tidying up the dataset.
We need to do the following to make our data organized and ready for analysis:
- Get rid of unnamed columns with
- Rename the columns
- Then add a column of labels where 0 represents ham and 1 represents spam
Watch our Live virtual class on Natural Language Processing Here
3. Exploring the data
Now that our data is organized, we can perform some Exploratory Analysis.
Let us begin by having a look at frequency summaries of each class by using the describe function.
We have 4825 rows classified as ham messages with 4516 non-duplicates. Spam messages are 747 with 653 unique. We can use seaborn countplot function to visualize the frequencies.
We want to see which words are common in ham messages and those which are common in spam messages. Before we create the word frequent summary, we need to clean up the data by;
- Breaking all the sentences into words
- Getting rid of punctuation marks
- Converting all the words to lower case
- Getting rid of stopwords and all words with less than two character.
We create a words_cleaner function with one argument of data being cleaned and returns a set of clean words.
Now we can extract all the words in ham messages and create a data frame of their frequencies.
Now we have a data frame of top 10 most used words in Ham messages. We can now create a bar graph to visualize the frequencies.
You can now do the same for Spam messages.
To get a complete view of word frequency in both classes, we can leverage on word clouds. It’s pretty easy to create word clouds in python, we just need to create a function with two arguments -data and background color- then return a word cloud.
Ham Messages word cloud
Now that we have a function for creating word clouds, we only need one line of code to return word clouds. in word clouds, the bigger the word, the frequent it is.
Spam Messages Wordcloud
Spam messages mostly contain words like Free, Call, Text, Mobile, Claim and Call Now.
We have launched an online information resource platform that will give you a Life Long Learning opportunity to re-engineer your career and business to be adaptive to the demands of the market. Please sign up at 4IR Club and subscribe as an individual, company or parastatal to access our up to date E-books-journals, Case studies, Course Handbooks, KPIs, Documentaries, Presentations and Glossary
Creating a model to classify a message to either spam or ham.
Before creating the classifier, we need to clean our features/independent variable i.e(Text column). To clean our features, we shall follow the following simple procedure:
- Remove all punctuation marks in each Text message
- Convert the text message to lower case
- Split each message into single words
- Remove all the stopwords
- Stem the words using PorterStemmer function. This involves cutting each word to its root form. e.g, the stem for these words: [car, cars, car’s, cars’] is car. A Stem for [ loving,`love, lovely loverble] is Lov
- We join back the cleaned words into sentences.
To achieve this, we shall create a function that loops through our data by cleaning each message at a time, then returns an array of cleaned words
See how the dataset looks after cleaning
Training the model.
We shall use Naive Bayes Classifier which is proven to offer statistically satisfying results in text classifications especially email filtering.
Below is a snipe of features preparation
Testing the Accuracy of the model
Our model is 97.77% accurate, which is a commendable performance.
We can also generate a confusion matrix to zoom in to the performance of the model. The leading diagonal values indicate the correctly predicted test value by our model.
Putting it All Together
Today only 20 percent of the data is being generated in the structured format with the majority of it existing in the textual form which is a highly unstructured format. We therefore need Text Analysis to be able to produce meaningful insights from the data.
However, analyzing these texts manually is time-consuming, ineffective and tedious as human beings can only cope with a certain amount of information, no matter how hard-working they are. It is for this reason that Text Analysis with Machine learning is essential for an organization- as it allows personnel to focus on more relevant and motivating tasks, and helps extract valuable insights.
Mastering Text Analysis in Python, will make your job much easier. This can only be achieved by placing yourself in an environment where you can constantly learn and keep reskilling and upskilling to master the concepts that will enable you to remain marketable and a highly sort after talent in the dynamic job space .
Get started with text analysis today. Enroll for Data Science Courses to start or progress your journey in data science. To get constant updates and mentorship in the field, follow our Facebook, Twitter, Linked In and Instagram pages .
Last but not least drop us an email at email@example.com or call +254725349693 for any inquiries.