How many words do you need to understand Donald Trump on Twitter?
Learning English: How big does your vocabulary need to be to roughly understand someone? How can you reach this point the fastest?
5 min read · published August 27, 2020 · last update November 17, 2020
Here are the highlights of this article:
- With a vocabulary of Donald Trump's most used 500 words, you'll roughly understand his tweets.
- If you start learning a language, learn the most used words first. This is up to 150 times more effective.
- Trump's most used words are largely overlapping with the most used English words.
Learning a new language can be very frustrating. At the beginning, you don't understand anything. It takes a long time to get to the point where you understand a conversation. So, is there any way to speed it up?
Let's dive right in.
Why analyze Donald Trump's tweets?
Donald Trump is the current US president. With more than 85 million followers his twitter account is among the top 10 accounts worldwide and his tweets often make international news headlines. Most of his tweets use simple words, which makes them easy to understand for English learners. Also, the content of his tweets makes for interesting discussions with almost anyone.
This article analyzes his tweets to illustrate the troubles of someone learning a new language. Debates on Donald Trump's politics should be held in other forums.
Donald Trump's most used words
The word frequency was analyzed out of the last 30,000 English words Donald Trump tweeted (himself, without retweets) which came from 1144 tweets. When we refer to Trump's top or most used words in the rest of this article, we always mean the top words according to the frequency distribution of the analyzed 30,000 words.
Here are Donald Trump's top 100 words:
We can compare this list to the list of top English words, which we created in a previous article.
- There's a large overlap: 63 of Trump's top 100 words also appear in the top 100 English words. 82 appear in the top 300 English words.
- Specific words, as joe, biden, fake, news, military, police, china, … are used much more frequently by Donald Trump than the average English speaker. This is partly a reflection of his personal vocabulary, and partly due to current American politics.
A note about counting words:
- Some researchers count word roots, or lemmas, and not single words. E.g. in the list above is, be, are, was, been would all be grouped under to be. In this article, all words are considered separately.
Knowing half the words isn't enough
So what happens when we only understand these words? How much of a tweet do we actually "get"? Here's a simulation if our vocabulary is only the 100 words above. (These two tweets are picked at random. they are simply the last tweets I analyzed).
The big number on top is the percentage of words you understand. E.g. in the first tweet, you understand 20 out of 37 words. That's 54%.
First the good news: we understand about 50% of the words in Donald Trump's tweets by learning just 100 words. Now the bad news: although we understand 50% of the words, we do not understand the message.
How does this change when we increase our vocabulary to the 500 most used words of Donald Trump?
Going from 100 to 500 words we understand another 25% of the words. Please read and compare this simulation with the previous. Our 25% increased understanding seems to be crucial. Knowing only 50% of the words was not enough to understand the tweets. But now, knowing 75% of the words, we are able to guess the meaning of each tweet.
Does the percentage of words we understand in each tweet double if we double our vocabulary? Let's look at a simulation with the 1000 most frequent words:
It would seem logical, that twice the vocabulary leads to us understanding twice as many words. However, going from a vocabulary of 500 to 1000 words does only increase the percentage of understood words by 10%. Chances are, that the words you learned only appear in specific contexts.
So what does this mean? We see that the more words we know, the more we understand. But there also are clearly diminishing returns: 100 words make us understand 49%, 500 words gives us 75% and 1000 words gives us only 85%.
Let's explore this further. What is the vocabulary size with the steepest learning curve? Analyzing all words over all tweets, we get the following curve:
The percentage numbers in the curve differ slightly from our simulations before, as they now are an average over all 1144 tweets and not only 2.
The curve is very steep at the beginning, but it flattens the more our vocabulary size increases.
Looking at the curve and our simulations we can draw the following conclusion:
Learning 500 words allows us to understand 75% of the words. As we saw in the simulation before, that is enough to roughly get the meaning of the tweets (while still not understanding some details). When our vocabulary size increases further, the words we learn might only appear in a specific context.
Or more general:
To roughly understand Donald Trump's tweets, we need to know at least his 500 most used words.
Does it matter which words we learn?
Until now we always assumed that our vocabulary is equal to Trump's most used words. But what happens if it is not? To answer this, let's try the opposite: let's see how much we understand if our vocabulary consists of Trump's least used words.
The result is quite radical. Even if we know 1000 words, we understand only 4% of the words in each tweet. And we don't understand the meaning of any tweet. This means it's much more important which words we know, than how many words we know.
Again, we can draw a curve for all words over all analyzed tweets. Using this curve, we see that if we know the least used 100 words, we only understand 0.3%. And we saw in the previous curve that if we know the 100 most used words, we understand 49%.
Think about it: with the right 100 words we understand 50x more than with the wrong 100 words. This means learning the most used words is roughly 150x more effective.
- To roughly understand Donald Trump's tweets, we need to understand about 75% of the words he uses in each tweet. We should reach that point when you know his 500 most used words.
- Learning the most used words is 150 times more effective than learning his least used words.
- There are different ways of counting words. This article counted single words, not word roots.
- Donald Trump's most frequently used vocabulary largely overlaps with the top English words, but there are differences. This means many of this article's conclusions are specific to Donald Trump's Twitter feed. In the following articles we will see if we can draw general conclusions to speed up understanding anyone.