January 21, 2025 in Forum
Justice for JetBlue? Analytics as a Gateway to Fairer Rankings
SHARE: PRINT ARTICLE:
https://doi.org/10.1287/LYTX.2025.01.03
Rankings of competitors have become pervasive in recent years, from universities to sports cars to washing machines. Some ranking methods have obvious flaws, but their problems can be greatly reduced using analytics.
For example, consider the plight of beleaguered JetBlue Airways. In 2022 and 2023, The Wall Street Journal ranked the airline dead last in service quality among nine major U.S. carriers, based largely on delay and cancellation rates as calculated by the U.S. Department of Transportation (DOT). Its CEO has protested the rating because, unlike its competitors, its operations are centered in the U.S. Northeast, where air traffic congestion is the nation’s worst. Yet, the rankings treat route structure as irrelevant and make no distinction between a flight to troubled New York LaGuardia and another to weather-free Phoenix.
It isn’t clear that JetBlue’s complaint is compelling: Although its “northeast exposure” might explain its weaker punctuality, it’s not certain that this is the case. But can we move beyond “maybe” by adjusting the rankings to achieve “geographic neutrality” so that different airlines are compared on a “level flying field?” Doing so requires some swerving to avoid mathematical potholes, but the endeavor is quite feasible for an analytics professional. Let’s discuss.
The DOT defines a flight as on time if it reaches the gate at its destination no more than 14 minutes after its scheduled arrival time. A given airline’s on-time rate is the percentage of its flights that meet the 14-minute standard. Mathematically, this rate is a weighted average of its on-time rates at the airports it serves, the weighting factor for each airport being the fraction of the airline’s flights that arrive there. For calendar year 2022, the DOT on-time rates for the 10 largest U.S. airlines are shown in Table 1.
Table 1: DOT On-Time Rates for 2022 for Large U.S. Airlines
| Airline |
2022 On-time Statistic |
| Delta Airlines* |
81.79% |
| Alaska Airlines* |
79.13% |
| United Airlines* |
79.10% |
| American Airlines* |
77.38% |
| Hawaiian Airlines |
75.80% |
| Spirit Airlines |
73.18% |
| Southwest Airlines |
71.61% |
| Frontier Airlines |
66.18% |
| JetBlue Airways |
63.90% |
| Allegiant Air |
63.22% |
Note. A flight is considered on time if it arrives no more than 14 minutes after its scheduled arrival time.
*For these airlines, the statistic includes both their mainline operations and those of their affiliated commuter carriers.
As a first step toward geographic neutrality, we might focus on arrivals at the 25 largest U.S. airports, which collectively account for 64% of all U.S. passengers. (Most of the 36% of flights that arrive elsewhere originated at one of these 25 airports, so those other flights offer little new information.) Suppose each of the 10 largest airlines is assigned as its on-time statistic the simple average of its on-time rates at the 25 airports. Then, each airport would get a weighting of 4%, and all airlines would suffer equally from LaGuardia’s troubles and benefit equally from Phoenix’s desert calm. JetBlue’s disadvantage would thus seem to disappear. That was easy.
Alas, too easy. The problem is that not all U.S. airlines serve all 25 airports. For example, Southwest Airlines does not fly to Newark, and Frontier Airlines does not fly to Chicago O’Hare. If an airline avoids the most delay-prone airports, then the simple average of its scores for the airports it serves might be artificially impressive.
Ah, but we can turn to “curved grading” to circumvent the difficulty. For each large airport, we can define a large airline’s score as the difference between its on-time percentage and the simple average of those percentages for all Top 10 carriers that serve the airport. (The differences for these large airlines will necessarily average to zero). An airline’s simple average of its various “normalized” scores could be its performance metric. Under this convention, avoiding delay-plagued airports no longer confers a benefit.
Are we done? Afraid not. Imagine a hypothetical airport served by only two airlines, both very poor in punctuality. Then, they might “besmirch” the average on-time score for that airport (which the DOT also calculates), while benefiting because their own performances would not look bad compared with that average. They would be setting the very airport benchmark on which they would be judged! And their final normalized score would be inappropriately high.
Many people might get discouraged at this point, but not professionals versed in analytics. They could adjust a given airport’s punctuality benchmark by answering the question: If this airport were served by airlines of average punctuality rather than those that actually fly here, what would its average on-time performance be?
For example, suppose the airlines that serve airport X have an average “curved grading” score of -2%, while the average on-time rate there is 78%. One could revise the airport’s punctuality benchmark to 78% + 2% = 80% and then compare each airline’s DOT score there with 80% rather than 78%. (Note that this revision eliminates the artificial benefit when two weak airlines monopolize an airport.) When the adjustment is made at all airports, each airline gets a new punctuality score: The curved grading has been curved again. Whereas some airlines would look worse under the adjustment, others would look better: A carrier that flies to airports served by unusually prompt airlines would see its score go up.
Are we there yet? Well, no. Once we revise the scores for individual airlines, we need to revise again the benchmarks at individual airports. For example, suppose that the airlines that serve airport X – at an average on-time rate of 78% – now have adjusted curved-ratings scores averaging -1% rather than -2%. Then, the new benchmark for airport X should be 79% rather than 80%. With the new airport benchmarks, the airline-specific scores would be revised yet again.
We find ourselves in an iterative process: When we adjust an airport benchmark, we adjust the individual airline scores, which justifies a further change in the benchmark, which alters yet again the airline on-time scores. And so on. But an iterative process is a tasty intellectual snack to the specialist in analytics, who can monitor the process recursively until it converges. As it rapidly does in connection with the airline on-time scores. (See Caulkins et al. (1993) for an algorithmic formulation about airline punctuality.)
But what does all this mean for JetBlue? The picture is mixed. For the DOT on-time statistic, the geographically neutral scores by airline are seen in Table 2.
Table 2: Final DOT = Rule Punctuality Scores for 2022, Based on Stabilized Airport Adjustments
| Rank and Airline |
Relative On-Time Score Under “Curved Grading”a |
Absolute On-Time Score Based on 25-Airport Averageb |
| 1. Delta |
+7.75% |
81.34% |
| 2. United |
+7.06% |
80.65% |
| 3. Hawaiian |
+4.66 % |
78.24% |
| 4. American |
+4.17% |
77.76% |
| 5. Alaska |
+3.83% |
77.41% |
| 6. Spirit |
+0.24% |
73.83% |
| 7. Southwest |
-4.89% |
68.69% |
| 8. Frontier |
-8.12% |
65.47% |
| 9. JetBlue |
-8.43% |
65.15% |
| 10. Allegiant |
-9.45% |
64.14% |
a“Curved grading” tied to neutrality adjustments described in the text.
bUsing 73.59% as a baseline because that is the simple average of the adjusted on-time arrival rates for the 25 largest U.S. airports.
We see that JetBlue hasn’t appreciably moved from its original DOT position in either absolute or relative terms. Actually, very few airlines did. The correlation between airline rankings in Tables 1 and 2 is a hefty 0.915, whereas the corresponding correlation for airline-specific on-time scores is 0.972. Seeing this outcome, the reader might wonder whether the exercise described here was essentially a waste of time. But such an ex post facto assessment is too harsh: It was certainly conceivable that airport adjustments would be consequential, and it was useful to replace speculation on that point with a vigorous treatment of the actual data.
The analytics profession puts great stock in sensitivity analysis. A natural question arises: What if the threshold for lateness is not 15 minutes but something larger? After all, a 20-minute delay is far less disruptive than a two-hour delay, yet the DOT treats the difference as immaterial. To put it briefly, JetBlue remains in ninth place when the threshold is raised to 45 minutes, or with respect to mean delay (using the same iterative approach as before to achieve geographic neutrality).
Importantly, however, the DOT also determines the percentage of flights canceled by each airline at each airport. Processing the cancellation data to reach geographic neutrality yields Table 3.
Table 3: 2022 Flight Cancellation Rates for 10 Largest U.S. Airlines Under Two Methods of Calculation
| Rank |
Airline, Overall % of Flights Canceleda |
Airline, Airport-Adjusted % of Flights Canceledb |
| 1 |
Hawaiian, 0.92 |
Hawaiian, 0.64 |
| 2 |
Delta, 1.60 |
Delta, 1.83 |
| 3 |
United, 1.90 |
United, 1.94 |
| 4 |
Alaska, 2.72 |
Frontier, 2.82 |
| 5 |
Frontier, 2.87 |
JetBlue, 2.83 |
| 6 |
American, 2.96 |
Alaska, 2.90 |
| 7 |
Spirit, 3.00 |
American, 2.91 |
| 8 |
Southwest, 3.26 |
Spirit, 2.94 |
| 9 |
Allegiant, 3.52 |
Allegiant, 3.72 |
| 10 |
JetBlue, 3.74 |
Southwest, 3.56 |
aFor all scheduled flights at all airports.
bFor flights arriving at 25 largest U.S. airports, with neutrality adjustments described in the text.
Here, JetBlue does gain substantially from the effort to improve comparability. In the adjusted cancellation rates, JetBlue moves from the worst performer among the airlines considered into the better half, ranking 5 of 10 with a revised cancellation rate of 2.83%, which is 24% lower than its original rate of 3.74%.
Because a cancellation can be far more disruptive than a delay, JetBlue’s gain in this regard deserves real attention. JetBlue’s high cancellation rate is tied to the large fraction of its flights at the New York airports, where other airlines also suffer more cancellations. (Of the 25 largest airports, the two with the worst cancellation rates are LaGuardia and Newark.) For those other airlines, however, New York does less damage to their overall cancellation rates because smaller fractions of their flights arrive there.
In short, JetBlue’s complaint about misleading comparisons was partially justified. More generally, this situation illustrates how an apparent drawback in a ranking system need not be immutable and how a little analytics can go a long way to make things fairer. In spirit, the revised scores and rankings are similar to seasonally adjusted unemployment rates or inflation-adjusted dollar prices.
Airline reliability scores are far from the only statistical indicators that suffer blind spots. For example, the Consumer Price Index for U.S. leisure airline fares dropped from $311.21 to $253.35 between July 2022 and July 2023, a decline of 18.6%. But might the drop at least partially reflect a changing mix of routes flown, which could have reduced the mean fare paid even if fares on individual routes stayed the same?
Although blind spots arise in all sorts of comparisons, analytics often provides a mechanism to see beyond those spots. The result can be rankings that are fairer and more illuminating, in which those entities ranked at the top genuinely deserve to be there.
Reference
- Caulkins, J., A. Barnett, P. Larkey, Y. Yuan and J. Goranson, 1993, “The On-Time Machines: Some Analyses of Airline Punctuality,” Operations Research, Vol. 41, No. 4, pp. 710-720.
Arnold Barnett is the George Eastman Professor of Management Science and professor of statistics at the MIT Sloan School of Management. Jan Reig Torra is an AI Research Scientist who graduated from the Master of Business Analytics Program at the MIT Sloan School.