December 13, 2022 in Five-Minute Analyst
Analytic Wizardry
SHARE: PRINT ARTICLE:
https://doi.org/10.1287/LYTX.2023.01.03
COVID-19 and the subsequent quarantine became a reason to buy more Lego bricks, as if such a reason were necessary. The primary collection theme, along with the illustrated books, became Harry Potter. The kids are watching the third movie as I type this article. For an early Christmas (and birthday) present, my spouse received one of the larger sets, Diagon Alley. For two reasons: One is that we have a Thanksgiving tradition that started two years ago (thanks again, COVID) that made getting the set enticing. And two, we know anecdotally that sets are phased out and soon become ridiculously more expensive. Figure 1 shows we are no longer able to fit our sets within their designated space. However, one does not complain about too many Lego sets.
I do not like to live my life based on anecdotal evidence and thought a little exploration of data on Lego sets sounded like a nice distraction and was pleased to find a large amount of available data on Brickset.com. For this excursion, I targeted the Harry Potter collections. The data I used started with 134 unique Harry Potter Lego sets. After some cleaning, I reduced it to 95 sets. The 39 sets removed were often promotional sets rather than retail and/or had missing data. For a five-minute analysis, I was not going to attempt to track down the missing data. To get a visualization to help me understand what was going on, I first grouped the sets by launch year and looked at the mean retail price. Figure 2 shows what that looks like as a line chart using ggplot2, which is only somewhat helpful in showing that the overall trend over time is that Lego sets get more expensive. For a better look, I produced some box plots sorted by launch year. Figure 3 shows how the box plots show a similar trend but also that the range of prices has increased. And in later years, there are some outliers with high price points. Perhaps Lego knows that the Harry Potter generation can now afford them?! Or maybe have hopeless addictions to Lego and Harry Potter.
So how does this compare to the value of the sets? Values are supported by what the current prices are vice the retail price at launch. Note that the longest time a set has been officially for sale is about three years. Figure 4 uses facet_wrap() to show side-by-side box plots of retail price and new value by year. The 2003, 2005 and 2007 sets have quite a gap between retail and current value. Not surprisingly, the sets from the last few years have not gained as much in value. The y-axis scales are not uniform. Note the gap in 2003 is large but the actual difference in value is not so large as perhaps other years. In this situation, I would not make the y-axis scales the same.
From Figure 4, we can see how there is a little grace period after a set is discontinued, and someone might still find it for a price close to the original. That is good news, but anecdotal evidence has suggested that might not always be the situation. It is also important to note that finding older sets can be a challenge itself.
Next, I explore what might be a factor in the price. The data set I used provided several potentially helpful fields. In Figure 5, I used a gglot2 pairs plot to look at some of those fields, including the number of mini figurines, pieces, years available, and a few of the pricing fields. Some of the high correlations can be attributed to their obvious connection. The number of mini figurines and pieces correlate with various price points, with the most correlation with retail price. Somewhat surprisingly, the value of used sets is a little more correlated than the value of new sets with those fields. Years since launch of set is also a little correlated with value of new sets, and much more than value of used sets.
I built a model using the built-in lm() function. I am more interested in how much various fields might contribute to the value of new sets than in how they might be divided into train and test sets, so I just used the 95 rows to build the model. Here’s my initial model: HPlm ← lm(ValueNew ~ Minifigs + Pieces + YearsSinceAvail, data=HPLegoData). There is some evidence of heteroskedasticity, and R-squared is 0.3248. Perhaps there is a better model. In this model, Minifigs count had a p-value of 0.0824, so perhaps we could drop that variable and bring in some other variables.
Lastly, I was curious about average number of pieces in sets. The trend for Harry Potter Legos has been to release larger and larger sets. This makes sense, as the primary audience is aging and the franchise has not had to reintroduce the stories to a new generation yet. Another data pull that might highlight a multigenerational story of fans would be Star Wars. Perhaps another day, and another five-minute analysis. It is hard to tire of combining Lego and data analysis.
Regardless of various trends, I am happy that my spouse, who due to sociocultural norms of her youth was given dolls instead of Lego sets, is now able to embrace and share the Lego love with our kids. And me too! Thanks for reading.
Nick Ulmer, CAP, has been an operations research analyst since 2014. He is the inaugural chair of the INFORMS Military Veterans Interest Forum and a Principal Operations Research Analyst for CANA LLC, leading teams of analytics professionals to produce high level analytics products across federal and commercial domains.
([email protected])