Twitter Sentiment Analysis

By: Johann Antisseril

Part 1 - Overview

1.1 Introduction

Over the last couple of years, there have been a lot of change happening within the United States. Everything from the Capitol Riots and COVID-19 to social reform movements have changed how we function as a society. Especially during these times, Americans look to the President to lead the country and show support. In this project, the sentiment of the President's tweets (Twitter: @POTUS) will be analyzed. The objective of this project is to see how the President addresses the nation through the use of Twitter. There will be a use of machine learning models towards the end of this project to help determine which one best models the Presidents tweets based on sentiment.

Part 2 - Exploratory Analysis

2.1 Getting the Data

To get the data, the python library Tweepy was used. It utilizes Twitter's API to help retrieve tweets or even post tweets. For this project, the focus will be on retrieving data. The hope was to try to retrieve around 5000 tweets from the Presidents twitter account, however due to Tweepy's limitations and the restrictions to how many tweets a person can retrieve at once, I was only able to pull around 2243 tweets. To prevent myself from pulling data every single time I kept rerunning the code and to prevent reaching the limit of tweets pulled, the data pulled was saved to CSV file called tweets.csv. Since tweets could have emojis, @ or # symbols, and urls, they were removed before being added to the CSV file. Since TweetRetrieval.ipynb has authenication codes that are sensitive, they have been removed and replaced with "xxxx" but the general logic and the rest of the code remains intact to be viewed.

2.2 Importing the CSV

The cell right below are the two library's needed to be installed before anything else is to be done. The cell below that are all the imports from python libraries that were used in this project. The CSV file is read in and all duplicate tweets are dropped and the index is reset to make the dataframe we are working with a lot more clean.

2.3 Cleaning Tweets

All things that are unnecessary are removed from the tweets. The 'RT' is removed, the characters are all forced to be lowercase for uniformity, all punctuation is removed, and the websites linked on are removed too. Stop words are words that are commonly used and usually carry very little meaning in them. Since they don't hold too much value to the sentiment of a sentence, as the things above, it can be removed.

2.4 Applying Sentiment Polarity

By using TextBlob's sentiment.polarity function on all the tweets, each tweet's sentiment polarity value is given. This will be used to analyze how the President tweets. The code will interpret the sentiment polarity (<0 is Negative, 0 is Neutral, and >0 is Positive). This will make it easier since we are placing tweets into categories now.

2.5.1 Overall Sentiment of POTUS's Tweets

In this, all the tweets tied to each unique meaning (Positive, Negative, Neutral) are taken. The number of tweets is recorded and stored and tied to the unique meaning to be added to a dataframe. This is then plotted in a pie graph.

The results shown by the graph below were not too surprising. It can be seen that a majority of the President's tweets are positive, then the next largest category of the President's tweets was neutral, then negative. This is not too surprising as a leader of a nation is generally expected to convery hope and facts which are generally seen as positive or neutral. Only when the President conveys something as serious as a loss of life is it usually interpreted as negative. Similarly, this code picked up the sentiment of the tweets and came up with results which met expectations.

2.5.2 Boxplot of Sentiment Values

Since the previous plot showed a graph of what was expected of the three categories, further exploration of data from a different perspective was needed. To see where most of the sentiment values lay, a boxplot was plotted to see how the data was graphically.

This plot shows the while there are plenty of outliers, it can clearly be seen that most of the data lies between 0.00 and 0.25 with the mean close to 0. This shows that while majority of the data is positive, it is not very positive. It is very close to being neutral. It can be noted that since neutral was set to a very specific value and not a range like positive and negative, it could influence how the data is interpreted in the previous plot.

2.5.3 Percent of Tweets Per Standardized Sentiment Value

To better understand how the data is laid out, all the values are standardized using the z-score method. This is useful to see how the data deviates from the mean. This is then graphed as a histogram to be analyzed.

As expected, most of the data lies within one standard deviation from the mean. As it gets closer to 3 standard deviations both ways and beyond, the percent of tweets reduces to close to 0. It should be noted that this is used purely on numbers so that the labels 'postive','negative', and 'neutral' do not bias the data.

Part 3 - Machine Learning

3.1 Modifying the Tweets for ML Models

This part focuses on tokenizing each tweet to break each tweet up into words. Then each word is then stemmed to put them into their base/root form to make it easier for the models to run on these words and get better results. Each word is then lemmatize which just helps bring meaning to each word for the ML model to run on.

3.2 TF-IDF Vectorization

This part joins all the words back together for each tweet for it to be split into training an testing sets. Then a TF-IDF vectorizer is used to transform by the way it weighs terms. TF-IDF assigns a value to a term according to its importance in a document and then sees how the importance is across all documents and is scaled accordingly. A TF-IDF table is displayed below.

3.3 Top Words

This dataframe displayed is to show the top 15 words used by POTUS's twitter account. This will give a sense of what words are heavily emphasized compared to everything else he has tweeted.

It can be seen that 'American' is the top word used by the President. As the President of the United States of America it is not surprising that this is the top term as it is a term that helps bring unity to all citizens. Briefly looking at the rest of the words, they seem like words that are encouraging coming back from something. Terms like 'build', 'build back', 'job', 'plan', and 'better' all make it seem like rebuilding is a huge focus of his tweets. To put some perspective to these terms, at the time these tweets were taken and this code was written, the world is still dealing with the COVID-19 situation and the President is likely trying to promote a recovering economy and one where Americans can go back to normal. So in that perspective, these results of the top 15 words are not surprising.

3.4 Machine Learning Models

The X_train and X_test should both be modified to fit the TD-IDF weights. The machine learning models that will be used is the Linear Support Vector Classifier(SVC), Bernoulli Naive Bayes, and Logistic Regression. Linear SVC returns a best fit hyperplane that categories data. Bernoulli Naive Bayes assumes independence among data and gives each feature equal importance. For each model, a confusion matrix will be displayed to show the true positive, true neutral, true negative, etc. In addition to that, each model will display the accuracy score and f1 score. F1 score measures a model's accuracy on a dataset and it combines the precision and recall of the model.

As seen above, the Linear SVC performed the best with its accuracy score higher than the rest of the other machine learning models ranging from 0.66-0.72. Its f1 score was also higher for each category than any other model. One interesting things that can be observed is that when the true label was negative, Linear SVC correctly predicted the most negative tweets. When the true label was neutral, Linear SVC correctly predicted the most neutral tweets. However when the true label was positive, Bernoulli Naive Bayes correctly predicted the most positive tweets. For this project, the Linear SVC machine learning model best fit the sentiments of the tweets by the President of the United States of America.