Sentiment Measurement at the Word Level
We will be using the tidytext package in this project and following the approach shown in the “Introduction to tidytext” by Julia Silge and Dave Robinson.
First load the needed packages and the translated tweets from part one of this project.
#load libraries
suppressPackageStartupMessages({
library(tidyverse)
library(tidytext)
library(hunspell)
library(sentimentr)
library(magrittr)
})
# set up some chart defaults
tan1 <- "#FDECCD"
yellow1 <- "#FFBF00"
green1 <- "#007000"
theme_afri <- function(...){
# making a function allows passing further theme elements
ggplot2::theme(
plot.background = element_rect(fill = tan1, color = NA),
panel.background = element_blank(),
legend.background = element_blank(),
legend.box.background = element_blank(),
legend.key = element_blank(),
...
)
}
afrisenti_translated <- readr::read_csv('https://raw.githubusercontent.com/apsteinmetz/tidytuesday/master/2023-02-28_african_language/data/afrisenti_translated.csv',
show_col_types = FALSE) |>
select(tweet_num,assigned_long,translatedText,label) |>
mutate(label = as.factor(label))
afrisenti_translated
# A tibble: 111,720 × 4
tweet_num assigned_long translatedText label
<dbl> <chr> <chr> <fct>
1 1 Amharic Amhara region needs moderation!!! He hates Mus… nega…
2 2 Amharic Having a mood in a way that annoys someone.... nega…
3 3 Amharic Domestic violence - without hearing - without … nega…
4 4 Amharic Ethiopia, what was your role in overthrowing T… nega…
5 5 Amharic No matter how Oromo learns, they don't realize… nega…
6 6 Amharic Tish, dark powder, are you standing at 97? Tim… nega…
7 7 Amharic Local residents said that the internet and tel… nega…
8 8 Amharic What happened to Schimmels is sad. By the word… nega…
9 9 Amharic An unstable country does not need elections nega…
10 10 Amharic The idiot and the bubble, the Ethiopian PM Abi… nega…
# … with 111,710 more rows
The tidytext
package makes it easy to divide each tweet into separate words and measure the sentiment valence of each one. The tidytext
package has a few sentiment lexicons. Here, we decide the sentiment of each word using the “afinn” lexicon. We chose this because it has 11 shades of sentiment and we hope the finer granularity will be helpful. If a word is not in the lexicon we code it as”0” or neutral. Further, we need to use the stems of words. “Idiots” is not in our sentiment sources, “Idiot” is. Word stemming will fix that. The hunspell
package using hunspell_stem()
will do the trick. It returns a list of possible stems (usually just one, but not always) so we have to unnest()
the list column. The trade-off is that if the word is not in hunspell
’s dictionary, it drops the word. Fortunately, there are over 49,000 words in the dictionary.
tweet_word_sentiment <- afrisenti_translated %>%
select(tweet_num, translatedText, label) %>%
unnest_tokens(word, translatedText) |>
anti_join(stop_words, by = join_by(word)) %>%
mutate(word = hunspell_stem(word)) |>
unnest(word) |>
left_join(get_sentiments("afinn"),
multiple = "all",
by = join_by(word)) %>%
mutate(value = replace_na(value, 0)) %>%
rename(sentiment_afinn = value)
I don’t know who decided that “superb” gets a “+5” and “euphoric” gets a “+4,” but there you are. We can see how the valences are distributed in our tweets. The vast majority of words are not in the lexicon and therefore are coded as zero, neutral.
tweet_word_sentiment |>
ggplot(aes(sentiment_afinn)) +
geom_histogram(bins = 10,fill = yellow1,color = tan1,binwidth = 1) +
scale_x_continuous(n.breaks = 10,limits = c(-5,+5)) +
theme_afri() +
labs(title = '"Afinn" Sentiment Database Allows More Nuance')
Now let’s look at how stemming works. The tweet below is coded as negative in the Afrisenti data set.
afrisenti_translated$translatedText[45]
[1] "These idiots are cannibals and unforgivable renegades of the Eritrean people"
tweet_word_sentiment |>
filter(tweet_num == "45") |>
select(word,sentiment_afinn)
# A tibble: 5 × 2
word sentiment_afinn
<chr> <dbl>
1 idiot -3
2 cannibal 0
3 forgivable 0
4 renegade 0
5 people 0
The benefits and drawbacks of stemming are apparent. We can find the stem of “idiots” and “cannibals” but “unforgivable” gets changed to “forgivable” which flips its sentiment. No matter because neither word is in our sentiment lexicon. There are no positive words in this tweet so the sum of all the valences is negative, which matches with the assigned sentiment.
Next we add up the valences for each tweet to arrive at the net sentiment.
tweet_sentiment <- tweet_word_sentiment %>%
group_by(tweet_num) %>%
summarise(sentiment_afinn = as.integer(sum(sentiment_afinn))) %>%
ungroup() |>
full_join(afrisenti_translated,by = join_by(tweet_num)) |>
mutate(language = as_factor(assigned_long)) |>
rename(label_original = label) |>
# set NA sentiment to neutral
mutate(across(contains("sentiment"),~replace_na(.x,0)))
We have numeric valences. If we want to compare our measurment to the original we have to make a choice what to label each number. Obviously, zero is neutral but should we expand the neutral range to include, say -1 and +1? Sadly, no. The summary below shows that we already assigned far more neutral labels than the original data set has. This is not a good omen. We wish we had more granularity.
tweet_sentiment <- tweet_sentiment |>
mutate(label_afinn = as.factor(cut(sentiment_afinn,
breaks = c( -Inf,-1,0,Inf), labels=c("negative","neutral","positive")))) |>
select(language,tweet_num,translatedText,label_original,label_afinn)
summary(select(tweet_sentiment,label_original,label_afinn))
label_original label_afinn
negative:36564 negative:27564
neutral :39151 neutral :51688
positive:36005 positive:32468
It’s not me, it’s you.
As we look at the cross-tab there are many, many incorrect classifications. The accuracy for both of these measures is scarcely more that 50%. The “Kappa” statistic shows that we are only about 26% better than random chance. It’s not zero but it’s not good. Why the disappointing result? First of all, our valence measure isn’t opinionated enough. There are far too many neutral tweets.
avgs <- tweet_sentiment |>
group_by(language) |>
summarise(across(contains("label"),
~mean(as.numeric(.x)-2),
.names = "mean_{str_remove(.col,'label_')}")) |>
pivot_longer(cols = contains("mean"), names_to = "source",values_to = "average") |>
mutate(source = str_remove(source,"mean_"))
# plot the sentiment distribution by language
avgs |> ggplot(aes(y = average,x = language)) +
geom_point(aes(color = source,shape = source),
size = 4) +
geom_line() +
scale_y_continuous(n.breaks = 5) +
scale_shape_manual(values = c(afinn = 15,
sentence=19,original =3)) +
scale_color_manual(values = c(afinn = "red",
sentence=green1,original ="black")) +
scale_fill_manual(values = c(afinn = "red",
sentence=green1,original ="black")) +
theme_afri() +
labs(y="Sentiment",x="Language",
title = "There is Disagreement About The Average Sentiment",
subtitle = "Why is English the most divergent?") +
ggplot2::coord_flip()
You may wonder if one of the other sentiment lexicons would produce a better result. I have tried the others but I don’t include them here because the results are substantially identical.
In our defense, I’m not sure the Afrisenti sentiment assignments are better, as we saw above. But maybe that just means Google Translate has stripped some of the emotion out of them that is present in the original language. I don’t know, but this would mean Google Cloud Translate doesn’t work for this purpose.
Here’s the puzzle, though. The biggest disagreement about sentiment is in English-language tweets, where no translation is needed so we can’t blame Google for this. A look at some English tweets reveals that they are mostly in Pidgin so the vocabulary is not what we would expect in the sentiment sources we’re using. Here are some random English tweets:
afrisenti_translated |>
filter(assigned_long == "English") |>
slice_sample(n=5) |>
select(translatedText)
# A tibble: 5 × 1
translatedText
<chr>
1 jesus taught i had sef
2 better all those figures they are generating if u never dey vex reach na the …
3 people don dey find themselves from twitter oo and that ilorin where everyone…
4 it's sweet soul abeg
5 as a guyman used to paying the lump sum and calling it a day with the mindset…
So, paradoxically, the translated tweets of truly foreign languages are rendered into more standard English.
Maybe we’ll get more agreement if we address negation, as mentioned above. The tweet below illustrates the problem.
afrisenti_translated$translatedText[28670]
[1] "@user Why don't America want the world to be peaceful 🤔🤔🤔"
tweet_word_sentiment |>
filter(tweet_num == "28670")
# A tibble: 3 × 4
tweet_num label word sentiment_afinn
<dbl> <fct> <chr> <dbl>
1 28670 negative user 0
2 28670 negative world 0
3 28670 negative peaceful 2
Unless the negation is itself a negative valence word, many negative tweets will test as positive. In this example “peaceful” is positive but “don’t want” clearly flips it.
A More Sophisticated Approach
There are a number of approaches to solve the negation problem. A simple one is to combine negation words like “no” and “not” with the subsequent one. We will use this approach in our next post attempting machine learning. Here we’ll try sentence-level measurement using the sentimentr
package. This package understands negation better. Our test tweet from above is a challenge. This is since the negation of “peaceful” comes eight words before “not.” The default of n.before = 5
doesn’t give us the right answer.
mytweets <- "Why don't America want the world to be peaceful"
sentimentr::sentiment(mytweets,n.before = 5)
element_id sentence_id word_count sentiment
1: 1 1 9 0.25
Setting n.before = Inf
captures this but comes at the expense of slower processing speed and potentially more false negatives.
sentimentr::sentiment(mytweets,n.before = Inf)
element_id sentence_id word_count sentiment
1: 1 1 9 -0.25
With that noted, lets forge ahead.
As before, we start by breaking the tweets up into constituent parts, sentences in this case. Most tweets will be just one sentence of course. Again we compute a sentiment score, this time with the sentiment_by()
function. It yields a finer grained sentiment score than our earlier measures.
tweet_sentence <- afrisenti_translated |>
mutate(language = as.factor(assigned_long)) |>
mutate(sentence = get_sentences(translatedText)) %$%
sentiment_by(sentence,list(language,tweet_num),n.before = Inf)
I am using a less common pipe operator below, %$%
from the magrittr
package, which expands the list-column created by get_sentences
. Normally I would use tidyr::unnest()
to do this but it loses the special object class that sentiment_by()
needs. The sentimentr
package uses the fast data.table
vernacular, not the dplyr
one, which I mostly use here.
Before we look at the tweet-level sentiment let’s group by language. This lets us see if the sentiment measure is consistent across languages.
avgs <- tweet_sentence |>
group_by(language) |>
summarise(ave_sentiment = mean(ave_sentiment))
avgs_orig <- afrisenti_translated |>
group_by(assigned_long) |>
summarise(ave_sentiment = mean(as.numeric(label)-2))
# plot the sentiment distribution by language
tweet_sentence |>
as_tibble() |>
# make the size manageable
slice_sample(n=1000) |>
group_by(language) |>
rename(sentiment = ave_sentiment) |>
ggplot() + geom_boxplot(
aes(y = sentiment, x = language),
fill = NA,
color = "grey70",
width = 0.55,
size = 0.35,
outlier.color = NA
) +
geom_jitter(
aes(y = sentiment, x = language),
width = 0.35,
height = 0,
alpha = 0.15,
size = 1.5
) +
scale_y_continuous(n.breaks = 5) +
# theme_bw() +
theme_afri() +
geom_point(data = avgs,
aes(y = ave_sentiment, x = language),
colour = "red",
shape = 18,
size = 4) +
geom_point(data = avgs_orig,
aes(y = ave_sentiment, x = assigned_long),
colour = green1,
shape = 19,
size = 4) +
ylab("Sentiment") + xlab("Language") +
labs(title = "Average Sentiment at the Sentence Level is Consistently Positive",
subtitle = "There is wide disagreement with English tweets.") +
annotate("text",y = .7,x=7.5,label = "Red Diamond is Measured") +
annotate("text",y = .7,x=6.5,label = "Green Circle is Original") +
coord_flip()

The dots are the measured sentiment of a sample of tweets. The markers are the average measured sentiment (red diamond) and average original sentiment (green circle), across all tweets. The measured sentiment looks more consistent then the original. The tweets in all languages are scored, on average, positive, by our calculations. We obviously disagree with the original. Note that the original 3-level sentiment as been converted to a numeric range, -1 to 1, to compute the green markers, whereas the range for our measurements is much wider.
tweet_sentence |>
ggplot(aes(ave_sentiment)) + geom_histogram(fill=yellow1,bins=50) +
scale_x_continuous(limits = c(-3,3))+
theme_afri()
We no longer have the problem where too many tweets are neutral. The median of the distribution above is zero but there are far fewer tweets that are of exactly zero valence. This let’s us expand the range of neutral above and below zero. I tried to balance the data to match the distribution of sentiment in the original data as closely as possible.
While we’re at it let’s put our results together.
# picked these levels to balance the data set
low = -0.01
high = 0.12
tweet_sentence <- tweet_sentence |>
as_tibble() |>
mutate(label_sentence = cut(ave_sentiment,
breaks = c( -Inf,low,high,Inf), labels=c("negative","neutral","positive")))
tweet_sentiment <-
left_join(tweet_sentiment, tweet_sentence,
by = join_by(language, tweet_num))
Compare our distribution to the original.
tweet_sentiment |>
pivot_longer(c(label_original,label_sentence),
names_to = "source",values_to = "sentiment") |>
group_by(source,sentiment) |>
count() |>
group_by(source) |>
reframe(sentiment,proportion = n/sum(n)) |>
ungroup() |>
ggplot(aes(sentiment,proportion,fill=source)) + geom_col(position = "dodge") +
scale_fill_manual(values = c(yellow1,green1)) +
theme_afri() +
labs(title = 'Choose Valence Range of "Neutral" to Balance Outcomes')