November 16, 2023 in Five-Minute Analyst
Wheel of Words: A quick and fun use of an NLP model
SHARE: PRINT ARTICLE:
https://doi.org/10.1287/LYTX.2023.04.16
I recently found myself eagerly awaiting the seventh episode of the second season of Amazon’s adaptation of Robert Jordan’s “Wheel of Time.” For those unfamiliar with the Wheel of Time series of novels, it consists of 14 tomes and one prequel. Many of the books are very long with literally thousands of characters. Robert Jordan created a fantasy world that has attracted a very devout fan base. The seasons of Amazon’s adaptation loosely sync with the first and second books in the series, “The Eye of the World” and “The Great Hunt,” respectively. At the time of this writing, the final episode of Season 2 has yet to be aired, so I am focusing on analyzing Season 1 against the first novel.
The adaptation continues to shift away from the book but somehow, overall, still maintains the big-picture storyline. As such, there are numerous fan websites that seek to compare the show versus the novels. I decided I would dive in and see what I could find.
Even for a five-minute analysis, I always first check to see what has already been analyzed. On WoT.fandom.com, I was able to find point-of-view (POV) breakdowns per book [1]. There are technically seven POVs in the first novel, but for the sake of comparison with the shows, I chose to only look at five. Those five POVs correspond to the following characters: Egwene, Rand, Perrin, Nynaeve and Moiraine. Almost 80% of the first novel is from Rand’s POV.
Staying focused on those same five characters but shifting to the Amazon series, I found the website thegreatblight.com to have a great Season 1 breakdown that includes published Google Sheets with data [2]. Specifically, they include three measures of interest: screen time, speaking time and word count. I combine data from all four of the data sets into Figure 1. Moraine speaks more on average when on screen than the other characters. However, she barely registers for novel POV time. The POV column in the far right of Figure 1 is dominated by Rand. But wait! I hope the superb analysts reading this are pointing out that POV and these other metrics from the show are not the same thing! You, of course, would be correct. This is why I spent fruitless time searching for dialogue word counts by character in the novels – fruitless because, try as I might, I was unable to find that type of analysis. And to be honest, until more recently, that kind of work would indeed be tedious.
Feeling unsatisfied with the results of my search, I thought I might try to do a very quick analysis of the text and simply extract the word counts for myself. After all, I have a PDF copy of the first book. How hard could this be? And having little personal experience in this space beyond some basic text mining, I first asked both the OpenAI ChatGPT 3.5 (yes, the free version) and Google Bard for some direction. My previous experience with these tools led me to being specific in my requests. This produced some less-than-stellar recommendations that led me down a path into what I would only assume to be some level of hell, involving increasingly complex ways to deal with regular expressions. The challenge was that my PDF file of the novel was simply not formatted in a nice and easy way to parse out the words spoken by the various characters. And thus, I moved on from this approach and am not counting it as part of the five minutes.
Switching tactics in my prompt engineering, I asked a more generic question: Is there a natural language processing (NLP) model that can extract less structured word counts by character from text? This prompt gave a more humanlike reply with not one, but three options to consider. Option 3 was a rule-based approach that sounded too much like the aforementioned regular expression level of hell. Option 2 was using advanced contextual language models, which sounded great but might be a little more horsepower than what I need here. Option 1, and ultimately the option I chose, was specific and suggested using pretrained Named Entity Recognition (NER) models. Specifically, it recommended spaCy’s “en_core_web_sm” model to identify the character names in the text and then extract the dialogue.
Off to a Jupyter notebook I went. Of note, I use PyPDF2 to read in the PDF file and extract the text. Once done, there is still one hiccup: the text is longer than the default max_length, but can be changed. Then, with the help of some ChatGPT on splitting up the dialogue and creating a dictionary of word counts by character, I run the NLP model in literally seconds and produce results. If you’d like to try on your own, I’ve included the Jupyter notebook [3].
Results from this very quick use of an NLP model show that the disparity between the show and novel are not so extreme. Figure 2 shows us that although Rand still dominates the dialogue in the novel, it is not to the same extent as when only considering the POV word count. The next step will be to analyze the word counts of Season 2 against Book 2, “The Great Hunt.” As the two diverge further, it will be interesting to see how this adaptation compares with the author’s original version.
I found this exploration of an NLP model useful and relatively quick to learn and implement. Clearly, more applications of this are possible, and I hope that this encourages everyone to take five minutes and try something new.
References
Nick Ulmer, CAP, has been an operations research analyst since 2014. He is the inaugural chair of the INFORMS Military Veterans Interest Forum and a Principal Operations Research Analyst for CANA LLC, leading teams of analytics professionals to produce high level analytics products across federal and commercial domains.
([email protected])