Analyzing tweets about BYU Football and BYU Basketball

Posted by Allyson Irwin, Avery McCusker, and Colby Nelson on June 18, 2020


  • Social media chatter centers around game time events, merchandise and experiences, and sports programs
  • Teams can use tweet analysis to understand how they can improve social media outreach with fans and the community
  • Focusing on major games could help improve social media outreach

The Background

Sports teams around the world compete on two fronts: we watch them struggle for victory on the playing field, but we sometimes forget they are also competing in the business world as well.  In a world of increasingly rich data sources, teams have the opportunity to improve their odds of winning, at least on the business side, by leveraging the data around them to better understand their fans and other customers. With publicly available social media data and some practice with data analysis tools, teams can take their first steps toward understanding their customers better.

So we wanted to figure out:

What can sports teams learn by analyzing social media activity about their games?

Gathering the Data

We gathered data for Brigham Young University’s football and basketball teams for the 2019-2020 seasons from Twitter. We web scraped tweets and comments related to ‘byu football’ and ‘byu basketball’ the week before each home game. The tweets we gathered represent what BYU’s fans were talking about at each point in time. Social media goes largely untapped in teams’ efforts to connect and improve their business models.

On the left, is a sample tweet with a few of the features we collected about the tweet.

Gathering Additional Data

To add to our Tweet dataset, we also gathered official and semi-official team Twitter handles (e.g. coaches, players, and BYU-sponsored accounts) and added a column to track whether each Tweet was shared by a team-related account.

We also gathered game-related data from with fields including schedule position (e.g. first game of the season), game date, game start time, sport, rival opponent (i.e. inc-conference or in-state opponents), and ranked opponent (i.e. game opponent ranked in top 25 nationally as of game day).

Cleaning Web Scraped Social Media Data

We used python via Google Colab to clean and combine the tweets from all the games and from ESPN. We converted columns that were “missing not at random” and that contained unique strings to booleans such as making “image_URL” into “has_image” to show whether the original tweet contained an image or not. To remove certain null values, we changed them to zeros where null meant none, like on retweet or like count. Instead of leaving the game opponents as categorical data, we converted that field into dummy variables for each game.

def clean_tweets(tweet): 
  # remove old style retweet text "RT"
  tweet = re.sub(r'^RT[\s]+', '', tweet)
  # remove hyperlinks
  tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
  # remove hashtags
  # only removing the hash # sign from the word
  tweet = re.sub(r'#', '', tweet)
  # tokenize tweets
  tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
  tweet_tokens = tokenizer.tokenize(tweet)
  tweets_clean = []    
  for word in tweet_tokens:
      if (word not in stopwords_english and # remove stopwords
              word not in emoticons and # remove emoticons
              word not in string.punctuation): # remove punctuation
          stem_word = stemmer.stem(word) # stemming word
  return tweets_clean

Analyzing Tweet Sentiment

For our first step, we analyzed the sentiment (positive vs. negative) of the tweets.

Referencing articles by Mukesh Chapagain and, we used NLTK and Scikit-Learn Python packages to train a sentiment analysis model on NLTK sample tweets and then apply the classification models to our data.

We specifically used the ‘twitter_samples’ corpus from NLTK. We then pre-processed the corpus training set data using NLTK’s ‘TweetTokenizer’ and custom regex. We excluded emoticons in a custom list to be sure to not lose the sentiment significance they carry and created a custom function for preprocessing and tokenizing each tweet:

# feature extractor function
def bag_of_words(tweet):
  words = clean_tweets(tweet)
  words_dictionary = dict([word, True] for word in words)    
  return words_dictionary
# positive tweets feature set
pos_tweets_set = []
for tweet in pos_tweets:
  pos_tweets_set.append((bag_of_words(tweet), 'pos'))    
# negative tweets feature set 
neg_tweets_set = []
for tweet in neg_tweets:
  neg_tweets_set.append((bag_of_words(tweet), 'neg'))

Next, we create a feature extractor function using a simple ‘bag of words’ methodology to test the relative frequency of words in the tweets (as seen on the left)

We used a randomized 80-20 split to create training and test sets for scoring tweet sentiment. We then trained classifiers using the NLTK ‘NaiveBayesClassifier’, and Scikit-Learn’s ‘MultinomialNB’, ‘BernoulliNB’, ‘LogisticRegression’, ‘SGDClassifier’, ‘LinearSVC’, and ‘NuSVC’ models on our training data set of positive and negative tweets.

Lastly, we created a voting function to implement a voting system among the seven trained models, and we applied this combined model to the tweets and comments that we had scraped to show if they had a positive or negative sentiment. 

The full Jupyter Notebook file for our sentiment analysis can be seen on Google Colab here.

Clustering Tweet Topics

Next, we wanted to identify the topics people focus on when talking about BYU sports on Twitter. We used a Latent Dirichlet Allocation statistical model for organizing these clusters and understanding them, which can be found on Google Colab here.

For the topic clustering process, we tokenized and lemmatized the words from the tweets. We removed stop words and used the Python Gensim library to split the tweets and comments into ngrams. We performed LDA analysis separately for the tweets and the comments in our dataset to be able to understand the differences in them.

We used the Python Gensim package’s ‘LdaModel’ to perform topic clustering for each group varying the number of topics and testing between 2 and 8 topics to find an optimal number of topics with logical word groups with understandable relationships.

The visualizations below were created by using the Gensim pyLDAvis to show the final modeled groups with 3 topics for the tweets and 4 topics for the comments in our dataset.

Labeling Tweet Topics

To find the best topic label, we analyzed the words with each topic to see common trends. We labeled topic 0 as 'Time' because most of the words refer to time specific things. Topic 1 is 'Merchandise and Experience' as the games. We chose 'Sports Programs' for Topic 2 because the tweets refer to the overall program.

Labeling Comment Topics

Although the tweets contained similar words and topics, we decided to label them separately for a more accurate representation. Topic 0 is 'Team Players' because it talks about the players and their actions. Because Topic 1 did not have an overarching theme, we labeled it as 'Other' because the words were unrelated to a specific topic. Topic 2 is 'Game' because the words relate to the game itself. 'Experience' is the label for Topic 3 because it represents the fan experience.

Understanding Trends in Tweets for Each Game

We summarized the data for each game so we could see fields such as total number of tweets and comments, total number of likes and retweets for tweets and comments, total number of official BYU authors, and total number of images and videos in tweets and comments for each game. As you can see in the figure, those values are all generally correlated and follow similar trends.

Predicting Retweet Count for Future Tweets

After analyzing tweet sentiment, common tweet topics, and game level data, we decided to see if we could predict each tweet’s retweet count. For our predictions, we started by only including features that would be available when the tweet is posted, which excludes fields such as “comment_retweets” or “comment_sentiment.” We used Azure ML to create a predictive experiment with our data to see which fields most directly correlated with the retweet count

We started with all of the attributes of a tweet and ran various algorithms to make this prediction as accurate as possible. Our tool of choice for this process was Azure Machine Learning. Using a Linear Regression gave us the best prediction with an R^2 of 0.4633. Something interesting we discovered was the best R^2 occurred when we left all the columns in our data set.

Implementing the Model in a Live Prediction Calculator

The Retweet Calculator was built as a webform in C# and connects to Microsoft Azure Machine Learning through an API. All the factors listed on the Retweet Calculator are sent to Azure and are run through our predictive models. The results are then returned and shown at the bottom of the form. The website is deployed through Azure as well, which supplies the SSL certificate. This was done through Web Deploy incorporated into Microsoft Visual Studio.

The following button will take you to a site to see our calculator in action, so you can predict how many retweets a tweet about byu basketball or football will get.

Retweet Calculator

Author Bios

Allyson Irwin

LinkedIn | Her Website

Allyson is a first year Master of Information Systems Management student at BYU. She is pursuing a career in data engineering to use her programming and analytical skills together.

Colby Nelson


Colby is a Master of Information Systems Management student at BYU graduating April 2021. He is from Billings, Montana, and is interested in using leveraging data to solve problems and improve our world.

Avery McCusker


Avery is an Information Systems student at Brigham Young University aspiring to use his technical background in a sales role.