Over the last couple of years, there have been a lot of change happening within the United States. Everything from the Capitol Riots and COVID-19 to social reform movements have changed how we function as a society. Especially during these times, Americans look to the President to lead the country and show support. In this project, the sentiment of the President's tweets (Twitter: @POTUS) will be analyzed. The objective of this project is to see how the President addresses the nation through the use of Twitter. There will be a use of machine learning models towards the end of this project to help determine which one best models the Presidents tweets based on sentiment.
To get the data, the python library Tweepy was used. It utilizes Twitter's API to help retrieve tweets or even post tweets. For this project, the focus will be on retrieving data. The hope was to try to retrieve around 5000 tweets from the Presidents twitter account, however due to Tweepy's limitations and the restrictions to how many tweets a person can retrieve at once, I was only able to pull around 2243 tweets. To prevent myself from pulling data every single time I kept rerunning the code and to prevent reaching the limit of tweets pulled, the data pulled was saved to CSV file called tweets.csv. Since tweets could have emojis, @ or # symbols, and urls, they were removed before being added to the CSV file. Since TweetRetrieval.ipynb has authenication codes that are sensitive, they have been removed and replaced with "xxxx" but the general logic and the rest of the code remains intact to be viewed.
The cell right below are the two library's needed to be installed before anything else is to be done. The cell below that are all the imports from python libraries that were used in this project. The CSV file is read in and all duplicate tweets are dropped and the index is reset to make the dataframe we are working with a lot more clean.
#!pip install plotly
#!pip install textblob
import re
import string
import numpy as np
import pandas as pd
# plotting
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
#textblob
from textblob import TextBlob
# nltk
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk import tokenize
# sklearn
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
data = pd.read_csv('tweets.csv')
data.drop_duplicates(subset = 'Tweets', keep = 'first', inplace = True)
data.reset_index(inplace = True)
data.drop(columns=['index'], inplace = True)
data
Tweets | |
---|---|
0 | January 6, 2021 made it clear: there is a dagg... |
1 | Today, my administration announced that health... |
2 | We must be firm, resolute, and unyielding in o... |
3 | Today, on National Law Enforcement Appreciatio... |
4 | My message to everyone impacted by the Marshal... |
... | ... |
2225 | RT : Tune in for the first press briefing of t... |
2226 | After taking the oath of office this afternoon... |
2227 | The time to move forward is now. |
2228 | There is no time to waste when it comes to tac... |
2229 | Folks — This will be the account for my offici... |
2230 rows × 1 columns
All things that are unnecessary are removed from the tweets. The 'RT' is removed, the characters are all forced to be lowercase for uniformity, all punctuation is removed, and the websites linked on are removed too. Stop words are words that are commonly used and usually carry very little meaning in them. Since they don't hold too much value to the sentiment of a sentence, as the things above, it can be removed.
#Removes the RT from tweets
data['Tweets'] = list(map(lambda x: re.sub('RT',"",str(x)),list(data['Tweets'])))
#Forces all characters to be lowercase for uniformity
data['Tweets'] = data['Tweets'].str.lower()
#Removes all punctuations
def clean_punc(txt):
return txt.translate(str.maketrans('', '', string.punctuation))
data['Tweets'] = data['Tweets'].apply(lambda x: clean_punc(x))
data['Tweets'] = data['Tweets'].apply(lambda x: re.sub('\d+', '', x))
data['Tweets'] = data['Tweets'].apply(lambda x: re.sub("[^\w\d\s]+", '', x))
#Removing websites
data['Tweets'] = data['Tweets'].apply(lambda x: re.sub('www.[^s]+', '', x))
#Removes all the common stop word in the English language
sw = stopwords.words("english")
data['Tweets'] = data['Tweets'].apply(lambda x: " ".join([word for word in str(x).split() if word not in sw]))
data
Tweets | |
---|---|
0 | january made clear dagger throat democracy tom... |
1 | today administration announced health insurers... |
2 | must firm resolute unyielding defense right vo... |
3 | today national law enforcement appreciation da... |
4 | message everyone impacted marshall fire intend... |
... | ... |
2225 | tune first press briefing bidenharris administ... |
2226 | taking oath office afternoon got right work ta... |
2227 | time move forward |
2228 | time waste comes tackling crises face thats to... |
2229 | folks account official duties president pm jan... |
2230 rows × 1 columns
By using TextBlob's sentiment.polarity function on all the tweets, each tweet's sentiment polarity value is given. This will be used to analyze how the President tweets. The code will interpret the sentiment polarity (<0 is Negative, 0 is Neutral, and >0 is Positive). This will make it easier since we are placing tweets into categories now.
#Uses TextBlob's sentiment polarity function to get the sentiment of each tweet
sentiment_lst = data['Tweets'].apply(lambda x: TextBlob(x).sentiment.polarity)
data['Sentiment'] = sentiment_lst
#Interprets the Sentiment into English (<0 is Negative, 0 is Neutral, and >0 is Positive)
data['Meaning'] = [None]*len(data)
for index, row in data.iterrows():
if data.loc[index,'Sentiment'] < 0:
data.loc[index,'Meaning'] = "Negative"
elif data.loc[index,'Sentiment'] > 0:
data.loc[index,'Meaning'] = "Positive"
else:
data.loc[index,'Meaning'] = "Neutral"
data
Tweets | Sentiment | Meaning | |
---|---|---|---|
0 | january made clear dagger throat democracy tom... | 0.100000 | Positive |
1 | today administration announced health insurers... | 0.000000 | Neutral |
2 | must firm resolute unyielding defense right vo... | 0.042857 | Positive |
3 | today national law enforcement appreciation da... | 0.650000 | Positive |
4 | message everyone impacted marshall fire intend... | -0.025000 | Negative |
... | ... | ... | ... |
2225 | tune first press briefing bidenharris administ... | 0.250000 | Positive |
2226 | taking oath office afternoon got right work ta... | 0.195238 | Positive |
2227 | time move forward | 0.000000 | Neutral |
2228 | time waste comes tackling crises face thats to... | 0.103810 | Positive |
2229 | folks account official duties president pm jan... | -0.500000 | Negative |
2230 rows × 3 columns
In this, all the tweets tied to each unique meaning (Positive, Negative, Neutral) are taken. The number of tweets is recorded and stored and tied to the unique meaning to be added to a dataframe. This is then plotted in a pie graph.
values = []
meaning = []
#Goes through and gets all the values tied to each unique meaning to be added to a dataframe
for i in np.unique(data['Meaning']):
values.append(len(data[data['Meaning']==i]))
meaning.append(i)
#Dataframe created and values are added to be plotted below
df = pd.DataFrame()
df['Meaning'] = meaning
df['Value'] = values
fig = px.pie(df, values='Value', names='Meaning', title='Overall Sentiment of POTUS\'s Tweets')
fig.update_traces(textinfo='percent+label+value', textposition='inside')
fig.show()
The results shown by the graph below were not too surprising. It can be seen that a majority of the President's tweets are positive, then the next largest category of the President's tweets was neutral, then negative. This is not too surprising as a leader of a nation is generally expected to convery hope and facts which are generally seen as positive or neutral. Only when the President conveys something as serious as a loss of life is it usually interpreted as negative. Similarly, this code picked up the sentiment of the tweets and came up with results which met expectations.
Since the previous plot showed a graph of what was expected of the three categories, further exploration of data from a different perspective was needed. To see where most of the sentiment values lay, a boxplot was plotted to see how the data was graphically.
sns.set_theme(style="whitegrid")
ax = sns.boxplot(x=data["Sentiment"])
This plot shows the while there are plenty of outliers, it can clearly be seen that most of the data lies between 0.00 and 0.25 with the mean close to 0. This shows that while majority of the data is positive, it is not very positive. It is very close to being neutral. It can be noted that since neutral was set to a very specific value and not a range like positive and negative, it could influence how the data is interpreted in the previous plot.
To better understand how the data is laid out, all the values are standardized using the z-score method. This is useful to see how the data deviates from the mean. This is then graphed as a histogram to be analyzed.
#Gets the mean and std of the data
mean = data['Sentiment'].mean()
std = data['Sentiment'].std()
standardized_value = []
#Goes through the data and gets each sentiment value and subtracts it from the mean and divides by the std to
#standardize the values
for index,row in data.iterrows():
standardized_value.append((data.loc[index,'Sentiment'] - mean)/std)
data['StandardizedValue'] = standardized_value
#Plots a histogram
ax = data['StandardizedValue'].plot(kind='hist', color='g')
ax.yaxis.set_major_formatter(PercentFormatter(xmax=len(data)))
ax.set_xlabel("Standardized Sentiment Values")
ax.set_ylabel("Percent of Tweets")
ax.set_title("Percent of Tweets Per Standardized Sentiment Value")
plt.show()
As expected, most of the data lies within one standard deviation from the mean. As it gets closer to 3 standard deviations both ways and beyond, the percent of tweets reduces to close to 0. It should be noted that this is used purely on numbers so that the labels 'postive','negative', and 'neutral' do not bias the data.
This part focuses on tokenizing each tweet to break each tweet up into words. Then each word is then stemmed to put them into their base/root form to make it easier for the models to run on these words and get better results. Each word is then lemmatize which just helps bring meaning to each word for the ML model to run on.
#Tokenizes each tweet
data['Modified'] = data['Tweets'].apply(RegexpTokenizer('\w+').tokenize)
#Stems each word to put them in their base/root form
data['Modified'] = data['Modified'].apply(lambda x: [nltk.PorterStemmer().stem(c) for c in x])
#Lemmatizes each word to bring meaning
data['Modified'] = data['Modified'].apply(lambda x: [nltk.WordNetLemmatizer().lemmatize(c) for c in x])
data
Tweets | Sentiment | Meaning | StandardizedValue | Modified | |
---|---|---|---|---|---|
0 | january made clear dagger throat democracy tom... | 0.100000 | Positive | -0.078124 | [januari, made, clear, dagger, throat, democra... |
1 | today administration announced health insurers... | 0.000000 | Neutral | -0.515866 | [today, administr, announc, health, insur, req... |
2 | must firm resolute unyielding defense right vo... | 0.042857 | Positive | -0.328262 | [must, firm, resolut, unyield, defens, right, ... |
3 | today national law enforcement appreciation da... | 0.650000 | Positive | 2.329459 | [today, nation, law, enforc, appreci, day, jil... |
4 | message everyone impacted marshall fire intend... | -0.025000 | Negative | -0.625302 | [messag, everyon, impact, marshal, fire, inten... |
... | ... | ... | ... | ... | ... |
2225 | tune first press briefing bidenharris administ... | 0.250000 | Positive | 0.578490 | [tune, first, press, brief, bidenharri, admini... |
2226 | taking oath office afternoon got right work ta... | 0.195238 | Positive | 0.338774 | [take, oath, offic, afternoon, got, right, wor... |
2227 | time move forward | 0.000000 | Neutral | -0.515866 | [time, move, forward] |
2228 | time waste comes tackling crises face thats to... | 0.103810 | Positive | -0.061448 | [time, wast, come, tackl, crise, face, that, t... |
2229 | folks account official duties president pm jan... | -0.500000 | Negative | -2.704578 | [folk, account, offici, duti, presid, pm, janu... |
2230 rows × 5 columns
This part joins all the words back together for each tweet for it to be split into training an testing sets. Then a TF-IDF vectorizer is used to transform by the way it weighs terms. TF-IDF assigns a value to a term according to its importance in a document and then sees how the importance is across all documents and is scaled accordingly. A TF-IDF table is displayed below.
#Combines the words to be ready for the data to be split into a training and testing set
data['Combined'] = data['Modified'].apply(lambda x: " ".join(c for c in x))
#Splits the data into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(np.array(data['Combined']),np.array(data['Meaning']),test_size = 0.3)
#Uses a TF-IDF conversion for the data and prints the data
vectoriser = TfidfVectorizer(ngram_range=(1,2))
vect = vectoriser.fit_transform(X_train)
table = pd.DataFrame(vect.T.todense(), index=vectoriser.get_feature_names())
table
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 1551 | 1552 | 1553 | 1554 | 1555 | 1556 | 1557 | 1558 | 1559 | 1560 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
aanhpi | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
aanhpi equal | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
aapi | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
aapi commun | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
aaron | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
zerocarbon renew | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
zeroemiss | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
zeroemiss unveil | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
zip | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
zip code | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
19739 rows × 1561 columns
This dataframe displayed is to show the top 15 words used by POTUS's twitter account. This will give a sense of what words are heavily emphasized compared to everything else he has tweeted.
#Displays the top 15 words used
count = pd.DataFrame(table.sum(axis=1))
countdf = count.sort_values(0,ascending=False).head(15)
countdf
0 | |
---|---|
american | 31.905965 |
get | 25.173573 |
vaccin | 23.665318 |
infrastructur | 20.577519 |
america | 20.527632 |
back | 20.394797 |
build | 19.753718 |
job | 19.453658 |
work | 19.251240 |
plan | 19.174209 |
better | 17.859784 |
tune | 17.676136 |
today | 17.329147 |
peopl | 16.830516 |
nation | 16.512541 |
It can be seen that 'American' is the top word used by the President. As the President of the United States of America it is not surprising that this is the top term as it is a term that helps bring unity to all citizens. Briefly looking at the rest of the words, they seem like words that are encouraging coming back from something. Terms like 'build', 'build back', 'job', 'plan', and 'better' all make it seem like rebuilding is a huge focus of his tweets. To put some perspective to these terms, at the time these tweets were taken and this code was written, the world is still dealing with the COVID-19 situation and the President is likely trying to promote a recovering economy and one where Americans can go back to normal. So in that perspective, these results of the top 15 words are not surprising.
The X_train and X_test should both be modified to fit the TD-IDF weights. The machine learning models that will be used is the Linear Support Vector Classifier(SVC), Bernoulli Naive Bayes, and Logistic Regression. Linear SVC returns a best fit hyperplane that categories data. Bernoulli Naive Bayes assumes independence among data and gives each feature equal importance. For each model, a confusion matrix will be displayed to show the true positive, true neutral, true negative, etc. In addition to that, each model will display the accuracy score and f1 score. F1 score measures a model's accuracy on a dataset and it combines the precision and recall of the model.
#Uses the TD-IDF vector on the data to tranform it
X_train = vectoriser.transform(X_train)
X_test = vectoriser.transform(X_test)
#Function to predict values based on the data and model. Displays a confusion matrix to show true positives, neutrals,
#and negatives. Prints out the accuracy and f1 scores of the model.
def evaluate(model):
clf = model.fit(X_train, y_train)
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
disp.plot()
plt.show()
print(str(model) + ' accuracy score: ' + str(accuracy_score(y_test, y_pred)))
print(str(model) + ' f1 score: ' + str(f1_score(y_test, y_pred,average=None,zero_division=0)) + '\n')
#Models to be used
models = [LinearSVC(), BernoulliNB(), LogisticRegression()]
#Goes through each model and runs the function above using the model.
for i in models:
evaluate(i)
LinearSVC() accuracy score: 0.7144992526158446 LinearSVC() f1 score: [0.4295302 0.58641975 0.81156069]
BernoulliNB() accuracy score: 0.5665171898355755 BernoulliNB() f1 score: [0. 0.06122449 0.72147002]
LogisticRegression() accuracy score: 0.6457399103139013 LogisticRegression() f1 score: [0.15384615 0.43243243 0.76299376]
As seen above, the Linear SVC performed the best with its accuracy score higher than the rest of the other machine learning models ranging from 0.66-0.72. Its f1 score was also higher for each category than any other model. One interesting things that can be observed is that when the true label was negative, Linear SVC correctly predicted the most negative tweets. When the true label was neutral, Linear SVC correctly predicted the most neutral tweets. However when the true label was positive, Bernoulli Naive Bayes correctly predicted the most positive tweets. For this project, the Linear SVC machine learning model best fit the sentiments of the tweets by the President of the United States of America.