A Bandit-Based Approach to Educational Recommender Systems: Contextual Thompson Sampling for Learner Skill Gain Optimization

Published Online:https://doi.org/10.1287/ited.2025.0174

Abstract

In recent years, instructional practices in operations research, management science, and analytics have increasingly shifted toward digital environments, where large and diverse groups of learners make it difficult to provide practice that adapts to individual needs. This paper introduces a method that generates personalized sequences of exercises by selecting, at each step, the exercise most likely to advance a learner’s understanding of a targeted skill. The method uses information about the learner and their past performance to guide these choices, and learning progress is measured as the change in estimated skill level before and after each exercise. Using data from an online mathematics tutoring platform, we find that the approach recommends exercises associated with greater skill improvement and adapts effectively to differences across learners. From an instructional perspective, the framework enables personalized practice at scale, highlights exercises with consistently strong learning value, and helps instructors identify learners who may benefit from additional support.

1. Introduction

Over recent years, active learning has been increasingly adopted in operations research (OR), management science (MS), and analytics education, transforming how students engage with learning materials, receive feedback, and develop conceptual understanding (Fornasiero et al. 2021, Reeves et al. 2021, Maclean and Bayley 2024). One way active learning strategies have been implemented is through massive open online courses (MOOCs) and other digital learning environments that provide online access to instructional materials, practice activities, and assessments. Yet despite this digital shift in instructional practice, many courses continue to rely on standardized learning paths in which all learners progress through the same predetermined sequence of exercises, offering limited opportunity for personalized learning experiences. The need for adaptive sequencing becomes even more difficult to address in MOOCs, where the sheer number of learners severely constrains instructors’ ability to provide individualized learning trajectories. Moreover, students in OR/MS/analytics courses enter with widely differing levels of quantitative skills, yet scalable mechanisms for adapting practice to these varied skill levels remain limited.

Educational recommender systems (ERSs) offer a promising approach for supporting active learning by tailoring practice opportunities to learners’ evolving needs and guiding them along personalized learning paths. Within such systems, personalized recommendations function as a form of individualized feedback through scaffolding. Scaffolding is traditionally understood as an instructional practice in which a teacher provides structured support and gradually removes guidance as learners develop greater competence (van de Pol et al. 2010). It involves sequencing tasks so that each exercise offers an appropriate level of challenge relative to the learner’s current understanding. In our setting, the ERS operationalizes this principle by continuously updating its estimates of a learner’s skill state after each interaction and selecting subsequent exercises whose difficulty is aligned with that learner’s evolving proficiency. In doing so, the ERS provides individualized feedback in the form of targeted guidance about what to practice next.

In the ERS literature, the most commonly used approach for generating recommendations has been collaborative filtering (CF) techniques (Khanal et al. 2020). CF methods identify patterns in interaction logs to recommend relevant exercises to users, typically through the detection of similarity either between users (UserCF) or between exercises (ItemCF). Nevertheless, CF methods face important limitations in educational contexts. First, CF methods are not inherently personalized: recommendations are derived from aggregated behavioral patterns rather than the learner’s unique profile. User-based CF, for example, assumes that learners with similar past interactions will benefit from similar exercises, overlooking individual heterogeneity in learning needs or cognitive skill levels. Second, learners’ preferences and knowledge states evolve over time, yet CF approaches typically rely on static similarity measures. This makes CF ill suited to capture temporal dependencies or to adjust recommendations as learners progress. Third, without an explicit mechanism for exploration, CF also tends to reinforce historically popular exercises, limiting opportunities to identify exercises better aligned with a learner’s evolving profile.

Contextual bandit algorithms, by contrast, are inherently adaptive and address these shortcomings directly. They frame recommendation as a sequential decision problem in which the effectiveness of an exercise is uncertain ex ante and heterogeneous across learners, skills, and time. In a bandit setting, each recommendation step selects an exercise for a particular learner. The “context” is the information available at decision time about the learner and the candidate exercise (e.g., recent performance, affective state, topic, difficulty), and the resulting “reward” is the observed learning benefit after engagement, here operationalized as the change in estimated skill gain. The exploration–exploitation mechanism at the heart of bandit methods enables deliberate trialing of uncertain but potentially valuable exercises (exploration) while selecting those exercises expected to generate the greatest learning gains (exploitation). In this way, bandit algorithms can contribute to the construction of adaptive learning paths that respond to evolving learner needs. The literature study by Da Silva et al. (2023) highlights that, despite their promise, bandit-based methods remain underexplored in ERSs, underscoring a key avenue for future research.

This study addresses the identified literature gap by proposing a bandit-based framework for ERS and, to the best of our knowledge, is the first empirical evaluation of Thompson sampling (TS) for educational recommendation. TS is a Bayesian posterior sampling algorithm (Thompson 1933) with strong theoretical regret guarantees (Agrawal and Goyal 2012, 2013) and robust empirical performance across sequential decision-making tasks (Ferreira et al. 2018, Aramayo et al. 2023). At each decision step, TS maintains a posterior distribution over the expected utility of the available exercises and selects the next exercise by sampling from these distributions. In doing so, exercises are chosen in proportion to their probability of being optimal, which provides a principled mechanism for balancing exploration of uncertain exercises with exploitation of those exercises expected to yield high utility.

The proposed framework applies linear TS (LinTS) as the bandit strategy for educational recommendation. LinTS specifies a separate linear model for each available exercise, where the expected reward is expressed as a linear function of learner features. This formulation enables recommendations to adapt to evolving learner profiles and knowledge states. For benchmarking, we also consider standard TS, which operates in a noncontextual setting, alongside conventional CF approaches. This comparative design enables a systematic evaluation of the contribution of contextual modeling to sequential recommendation in ERSs.

The framework employs a reward signal based on learner skill gain, defined as the learner’s improvement in estimated knowledge state of a particular cognitive skill as computed by a Bayesian knowledge tracing (BKT) model (Corbett and Anderson 1994). BKT models skill acquisition as a latent probabilistic process, updating a learner’s knowledge state with each interaction based on observed responses, thereby providing a dynamic estimate of learning progress. The use of skill gain as reward contrasts with the predominant evaluation metrics in recommender studies, which rely on the correctness of recommended exercises (Manickam et al. 2017), ratings of recommended exercises (Nafea et al. 2019), or user satisfaction levels (Tarus et al. 2017). Correctness may lead to inflated performance estimates, as systems can achieve high accuracy by recommending exercises the learner is already able to solve, without necessarily promoting new learning. Similarly, ratings and satisfaction scores capture subjective perceptions of difficulty or enjoyment but may diverge from actual knowledge acquisition. These metrics are therefore limited as proxies for genuine learning progress, as they do not fully capture the cognitive development of a learner. Accordingly, skill gain is adopted as the reward signal because it more directly aligns the optimization process with the pedagogical objective of competence development.

Our experiment is performed on the ASSISTments data set (Patikorn et al. 2020), which provides interaction data from an online secondary-level tutoring system (Heffernan and Heffernan 2014). The reward signal is defined as the change in estimated skill between two consecutive interactions, based on the BKT proficiency measures included in the data set. Results show that LinTS achieves the highest performance, yielding a 15.2% improvement in average skill gain over the noncontextual TS baseline, as well as 16.5% and 20.7% improvements over CF baselines. This underscores the effectiveness of contextual modeling for adaptive educational recommendation.

The remainder of this paper is organized as follows. Section 2 reviews related work on ERS and bandit methods. Section 3 formalizes the problem setting and details the baselines (UserCF, ItemCF) and the bandit policies (TS, LinTS). Section 4 describes the experimental setup, including data set description, data preprocessing, data splitting, and validation strategy. Section 5 reports empirical results, and Section 6 discusses practical takeaways for OR/MS/analytics instructors. Section 7 ends with a conclusion and directions for future work.

2. Related Work

Research on active learning in OR/MS/analytics education has primarily focused on improving in-class teaching practices. For example, Maclean and Bayley (2024) examined efficient approaches for assessing higher-order thinking skills in an undergraduate business analytics course. Similarly, Fornasiero et al. (2021) introduced a puzzle-based learning method to support the development of optimization skills in high school students, whereas Reeves et al. (2021) designed hands-on, constructivist exercises to enhance learning in an introductory probability and statistics course. All these interventions remain largely classroom based. Importantly, none of these studies consider how active learning might be supported through personalized learning paths in digital environments, a domain that is becoming increasingly relevant as OR/MS/analytics courses move online.

Several approaches have been proposed to optimize sequences of exercises in ERSs, each aiming to guide learners through personalized educational trajectories. CF remains the most dominant approach in ERSs, with 5 of the 16 studies analyzed in the literature review by Da Silva et al. (2023) employing either pure CF methods or hybrid strategies that incorporate CF components. CF approaches are typically classified into two main variants: user-based and item-based filtering. The user-based approach infers preferences by identifying patterns of similarity among users, generating recommendations based on exercises previously interacted with by users exhibiting comparable behavioral profiles. Conversely, the item-based variant focuses on relationships between exercises, recommending exercises that are similar in content or usage patterns to those a given learner has already engaged with (Da Silva et al. 2023). Among CF approaches, user-based variants are more commonly adopted in the recommendation process, reflecting the growing pedagogical emphasis on student-centered learning (Krahenbuhl 2016). In the reviewed studies, various similarity metrics have been applied to operationalize these patterns, including cosine similarity (Wu et al. 2015, Tarus et al. 2017, Huang et al. 2019), Euclidean distance (Sergis and Sampson 2016), and Pearson correlation coefficient (Nafea et al. 2019), each offering different computational strategies for capturing relational proximity within interaction data.

Although CF has shown effectiveness in static recommendation scenarios, its reliance on historical similarity patterns constrains its pedagogical applicability. Recommendations are derived from aggregated interaction behavior rather than being informed by the learner’s evolving knowledge state, and the absence of an explicit exploration mechanism leads CF to reinforce familiar or popular exercises instead of identifying exercises that may more effectively promote sustained learning progress. These limitations create the need for methods that explicitly model sequential decision making.

Contextual bandit algorithms provide such a framework by conditioning recommendations on learner-specific features (personalization) and dynamically balancing the tradeoff between exploiting known effective exercises and exploring uncertain but potentially more beneficial ones. As highlighted in the systematic literature review by Da Silva et al. (2023), despite the prominence of bandit algorithms in general recommender system research, such methods remain underexplored in educational contexts, revealing a notable mismatch between the two domains and underscoring the need for further research.

Most existing applications of multiarmed bandit algorithms within the educational domain are situated in the context of educational games, where their capacity for adaptive, sequential decision making aligns well with the dynamic and interactive nature of game-based learning environments. Liu et al. (2014) apply an upper confidence bound (UCB)-explore strategy, a noncontextual bandit algorithm based on the UCB algorithm, in a physics-based educational game involving different ways of displaying number lines. Clément et al. (2014) employ the exponential-weight algorithm called EXP4, a context-free bandit strategy that incorporates expert knowledge to narrow the set of possible exercises when training a bandit policy. This method is applied within an educational game for seven- to eight-year-old schoolchildren, designed to support the development of numerical decomposition skills in the context of manipulating money. These studies are limited in two respects. First, they target narrow game-based case studies with bespoke mechanics and short interaction horizons; as a result, their findings may not transfer to broader settings such as MOOCs or large-scale tutoring platforms that involve heterogeneous content, longer trajectories, and diverse learner populations. Second, both methods are context free and therefore cannot condition recommendations on learner features, precluding meaningful personalization.

To date, only two studies have applied contextual bandits within the ERS domain. The first study by Manickam et al. (2017) employs a contextual linear UCB algorithm with context vectors representing latent concept knowledge profiles inferred from learners’ interaction histories in a college-level physics setting. The second study by Intayoad et al. (2020) incorporates past student behaviors and current learner state into a correlation analysis to preselect candidate exercises. As such, we argue that this approach cannot be considered a true contextual bandit model, as contextual information is confined to the prefiltering stage rather than being embedded within the bandit algorithm itself. The recommendation policy relies on a noncontextual ϵ-greedy strategy, with reward defined as whether the learner clicked on the suggested exercise. Although these approaches provide valuable empirical insights, the use of correctness on the next exercise in the first study and click-based feedback in the second as optimization criteria may be limiting for ERS. Such measures primarily capture short-term task performance or engagement and may fail to account for sustained learning gains or the incremental development of a learner’s knowledge state, which are particularly critical for constructing effective learning paths. In particular, the first study by Manickam et al. (2017) incorporated knowledge estimates as input features to guide recommendations. However, we argue that it would be more appropriate to optimize directly for improvements in these knowledge states, thereby aligning system objectives with sustained learning progress rather than immediate task performance.

To date, no studies on multiarmed bandits in ERS have incorporated TS (Thompson 1933), despite its well-documented effectiveness in general recommender system research (Ferreira et al. 2018, Aramayo et al. 2023, De Kerpel and Benoit 2025) and its empirically demonstrated superiority over frequentist strategies such as UCB (Chapelle and Li 2011). Its probabilistic exercise-selection mechanism facilitates a principled balance between exploration and exploitation while inherently accounting for uncertainty in reward estimation. The absence of TS in prior ERS research therefore constitutes a notable literature gap, which the present study seeks to address. In this study, we implement LinTS, a contextual bandit algorithm that models each exercise as a linear function of learner features, and evaluate it alongside the noncontextual TS baseline. Unlike prior approaches, we define the reward as skill gain, thereby directly optimizing for improvements in learners’ knowledge states rather than correctness on the next exercise.

3. Methodology

This section describes the algorithms and decision-making framework implemented in our ERS experiment. We first introduce the multiarmed bandit (MAB) problem and its contextual variant. We then detail the CF baselines (UserCF and ItemCF) and the two bandit-based methods TS and LinTS.

3.1. MABs in Educational Recommendation

In the MAB framework, a recommendation session unfolds over a sequence of T discrete interaction rounds. At each round t{1,,T}, the ERS selects an exercise at from a finite set of available exercises A. Each exercise corresponds to a distinct learning activity, such as a practice exercise, instructional video, or interactive simulation. Once the learner engages with the recommended exercise, the system observes a reward rt,at, drawn from an unknown probability distribution specific to that exercise.

In this work, the reward is defined as the skill gain associated with the specific cognitive skill s targeted by the recommended exercise at. Let Kt1(s) denote the learner’s estimated mastery of skill s immediately prior to the interaction, and Kt(s) the mastery estimate immediately after, both estimated using a BKT model (Corbett and Anderson 1994). The reward is then computed as

rt,at=Kt(s)Kt1(s),(1)
ensuring that the gain measure reflects changes in the learner’s knowledge state for the relevant skill only rather than general performance or unrelated knowledge. This continuous-valued reward formulation aligns the optimization objective with long-term pedagogical effectiveness, as it directly quantifies the incremental learning benefit rather than short-term correctness.

The goal of the ERS is to choose exercises {at}t=1T such that the sum of the learner’s realized rewards is maximized:

maxa1,,aTt=1Trt,at,(2)
subject to the constraint that the expected reward associated with a particular exercise is initially unknown and must be inferred through interaction. This creates a fundamental tradeoff between exploration—selecting exercises to improve knowledge about their effectiveness—and exploitation—selecting exercises currently believed to yield the highest educational benefit.

The contextual extension of MABs (CMAB) allows the ERS to incorporate side information about the learner and the exercise. At round t, the system observes a context vector xtRd that includes learner features such as demographic attributes, historical performance, or emotional state. The expected reward for an exercise is then modeled as

μa(xt)=f(xt,a),(3)
where f is an unknown reward function. By leveraging context, CMAB algorithms can tailor recommendations to individual learners, potentially improving personalization and long-term learning outcomes.

Figure 1 summarizes the contextual bandit interaction in our ERS. At time t, the learning platform (environment) provides a context vector xt describing the current learner state. The bandit policy (agent) selects an exercise at, which is delivered to the learner. After the interaction, the platform returns a reward rt,at and the tuple (xt,at,rt,at) is logged to update the policy.

Figure 1. Bandit Feedback Process in an ERS
Notes. The environment (learning platform) emits context xt; the agent (bandit policy) recommends an exercise at; after the learner engages, the environment returns reward rt,at (skill gain). The resulting tuples (xt,at,rt,at) support online learning and evaluation.

3.2. Collaborative Filtering Baselines

3.2.1. UserCF.

The user-based CF baseline estimates the expected effectiveness of a candidate exercise for a target learner by leveraging similarity in historical interaction profiles across the entire learner population. Let U denote the set of all learners, and let Ru,a represent the observed skill-gain reward for learner u on exercise a. For a given target learner u, a similarity score sim(u,u) is computed with every other learner uU{u} based on the cosine distance between their respective interaction vectors in the learner–exercise space. Cosine-based similarity is the most common choice in distance-based CF, although alternative measures such as the Pearson correlation or the dot product are also used (Liu et al. 2017).

The predicted effectiveness of an exercise a is then obtained as a similarity-weighted average of the recorded rewards from all other learners:

R^u,a=uU{u}sim(u,u)·Ru,auU{u}|sim(u,u)|.(4)

The system then selects the exercise a with the maximal predicted value R^u,a. The UserCF algorithm is formally defined in Algorithm 1.

Algorithm 1

(User-Based Collaborative Filtering)

  • 1: Input: Target learner u, candidate exercise set A, interaction matrix R (skill gain)

  • 2: Compute sim(u,u) for all uU{u}

  • 3: for each aA do

  • 4:   R^u,auU{u}sim(u,u)Ru,auU{u}|sim(u,u)|

  • 5: end for

  • 6: Return aargmaxaAR^u,a

3.2.2. ItemCF.

The item-based CF baseline instead exploits similarities between exercises, as derived from their historical usage patterns across learners. As in the user-based variant, Ru,a stores the skill-gain reward for learner u on exercise a. For a target learner u and a candidate exercise a, the cosine similarity sim(a,a) is computed with every other exercise aA{a}. The predicted skill-gain reward for a is then given by the similarity-weighted average of the learner’s own past rewards on other exercises:

R^u,a=aA{a}sim(a,a)·Ru,aaA{a}|sim(a,a)|.(5)

The algorithm then recommends the exercise a with the highest predicted value. The ItemCF algorithm is formally defined in Algorithm 2.

Algorithm 2

(Item-Based Collaborative Filtering)

  • 1: Input: Target learner u, candidate exercise set A, interaction matrix R (skill gain)

  • 2: Compute sim(a,a) for all aA{a}

  • 3: for each aA do

  • 4:   R^u,aaA{a}sim(a,a)Ru,aaA{a}|sim(a,a)|

  • 5: end for

  • 6: Return aargmaxaAR^u,a

3.3. Bandit Policies

3.3.1. TS.

TS is a Bayesian algorithm based on probability matching (Thompson 1933), where exercises are selected proportionally to their probability of being optimal given the current belief state. At each round t, the algorithm samples a reward value for each exercise from its posterior distribution and selects the exercise with the highest estimated reward. This approach naturally balances exploration and exploitation by favoring exercises with high estimated rewards while still exploring uncertain options.

In the standard formulation of TS, the reward is binary, and the posterior distribution over the mean reward of each exercise follows a Beta distribution. In contrast, in the proposed ERS, the reward signal is continuous, representing the learner’s skill gain after engaging with an exercise. We therefore model rewards as Gaussian-distributed with unknown mean and variance. The conjugate prior for this case is the Normal–Inverse–Gamma distribution:

P(μa,σa2)NΓ1(ma,νa,αa,βa),(6)
where ma denotes the prior mean, νa>0 the prior precision (or pseudo-count), αa>0 the prior shape parameter, and βa>0 the prior scale parameter. This prior is specified for each exercise a, allowing every exercise to maintain its own posterior belief about the expected reward. Moreover, the prior is conjugate to the Gaussian likelihood with unknown mean and variance, ensuring that the posterior distribution remains in the Normal–Inverse–Gamma family after each update. A theoretical analysis of this policy has been studied in Honda and Takemura (2014).

At each round, TS samples from the posterior for every exercise a, selects the exercise with the highest sampled mean, observes the reward signal, and updates the corresponding hyperparameters. The TS algorithm is formally defined in Algorithm 3.

Algorithm 3

(TS with Normal–Inverse–Gamma Prior)

  • 1: Input: Exercise set A, hyperparameters {ma,νa,αa,βa}aA

  • 2: for t=1,2,,T do

  • 3:  for each aA do

  • 4:   Sample σa2^InverseGamma(αa,βa)

  • 5:   Sample μa^N(ma,σa2^/νa)

  • 6:  end for

  • 7:  Select atargmaxaAμa^

  • 8:  Recommend at and observe continuous reward rt,at

  • 9:  Update for at:

  • 10:   m˜mat,ν˜νat

  • 11:   νatν˜+1

  • 12:   matν˜m˜+rt,atν˜+1

  • 13:   αatαat+12

  • 14:   βatβat+ν˜(rt,atm˜)22(ν˜+1)

  • 15: end for

3.3.2. LinTS.

LinTS (Agrawal and Goyal 2013) extends TS to contextual bandits by assuming that the expected reward of each exercise is a linear function of the context vector. It balances exploration and exploitation by sampling parameter vectors from a Bayesian posterior distribution that captures uncertainty in the estimated parameters.

In LinTS, each exercise a is associated with an information matrix Aad×d and a reward vector bad. The posterior mean parameter vector ωa,t is given by

ωa,t=Aa1ba,(7)
which is updated iteratively. The posterior covariance of the parameter estimates is proportional to Aa1, shrinking as more context–reward pairs are observed for that exercise.

Exploration in LinTS arises naturally from the parameter sampling process. At each time step t, a parameter vector θa,t is sampled for each exercise a according to

θa,tN(ωa,t,v2Aa1),(8)
where v>0 is a fixed exploration scaling factor. This sampling encourages exploration of exercises with high uncertainty in their reward estimates, whereas exploitation is driven by the posterior mean ωa,t. Over time, as more evidence is gathered and hence Aa1 shrinks, the algorithm shifts naturally toward exploitation.

In this work, we adopt a Gaussian LinTS variant as formalized in Algorithm 4. Consistent with the approach of Agrawal and Goyal (2013), the posterior variance scaling factor v is held fixed rather than estimated adaptively.

Algorithm 4

(LinTS Algorithm with Fixed Exploration Scaling Factor (Agrawal and Goyal 2013))

  • 1: Input: Regularization parameter λ, exploration scaling factor v, context space X, set of exercises A

  • 2: Initialize: For each exercise aA,

    Aa=λId,ba=0d,ωa,0=0d

  • 3: for each time step t=1,2, do

  • 4:  Observe context vector xtX

  • 5:  for each exercise aA do

  • 6:   Sample parameter vector:

    θa,tN(ωa,t,v2Aa1)

  • 7:   Compute the expected reward:

    r^a,t=xtTθa,t

  • 8:  end for

  • 9:  Select exercise at=argmaxar^a,t

  • 10:  Recommend exercise at and observe reward rat,t

  • 11:  Update the selected exercise:

    Aat=Aat+xtxtTbat=bat+rtxtωat,t+1=Aat1bat

  • 12: end for

4. Experimental Setup

4.1. Data Set

The data set used for this experiment is the ASSISTments 2017 data set (Patikorn et al. 2020), a large-scale clickstream corpus collected from the web-based ASSISTments tutoring system (Heffernan and Heffernan 2014), which records middle-school students’ mathematics exercise-solving activities between 2004 and 2006. The data set is widely used in learning analytics research, for example, in learner performance prediction (Hakkal and Lahcen 2024) and knowledge tracing estimation (Cully and Demiris 2020, Neshaei et al. 2024). It contains 935,638 interaction records from 1,708 unique learners across 3,162 distinct exercises, with 37.4% of all attempts answered correctly.

The data set provides three complementary types of inputs. First, clickstream records capture the raw sequence of learner–system interactions. Second, exercise information describes the exercises themselves, including the associated cognitive skill(s) and exercise type (e.g., multiple choice, open response). Third, student profiles encode both background information and behavioral characteristics. Background attributes include sociodemographic indicators such as gender, the middle school attended, and the academic year in which the system was used. Behavioral characteristics are derived from historical interaction logs and summarized over the learner’s past activity. These include the following:

  1. Academic proficiency indicators, which include the learner’s average knowledge mastery across all mathematical skills targeted by the system, performance on the Massachusetts Comprehensive Assessment System (MCAS) mathematics test and the learner’s overall correct response rate. These academic proficiency indicators capture complementary dimensions of learner ability. The MCAS mathematics score provides a stable, externally validated measure of baseline competence, whereas average knowledge mastery reflects broader conceptual understanding as inferred during system interaction. The overall correctness rate indicates how efficiently and accurately learners apply their knowledge.

  2. Affective state indicators that capture internal emotions or psychological states that can influence learning, which include the averaged tendencies toward confusion, frustration, boredom, and engaged concentration.

  3. Disengaged behavior indicators that reflect behavioral patterns that indicate that a learner is not productively engaged, which include the averaged tendencies for carelessness (e.g., slipping an exercise), gaming the system, and disengaging from the learning task.

The affective and disengagement indicators are obtained via a two-stage process: manual labeling through in-class field observations on a representative subsample, followed by the training of automated detectors using supervised machine learning methods to the full data set (Pardos et al. 2014). In addition, cognitive measures in the form of BKT estimates of mastery are included in the data set. After each exercise attempt, the system recalculates the learner’s probability of mastery for the specific cognitive skill targeted by the exercise, thereby providing a fine-grained, time-varying representation of the learner’s knowledge state. In this study, reward is defined as skill gain, computed from the difference between the BKT mastery estimate after and before the interaction for a specific skill targeted by the exercise. This continuous-valued signal measures the incremental change in the learner’s estimated mastery due to engaging with the recommended exercise, providing a pedagogically grounded target for optimization.

4.2. Data Preprocessing

Prior to partitioning the data set into training, validation, and test subsets, we apply the following preprocessing pipeline.

  1. Reward calculation and filtering. Any interaction in which either the preinteraction mastery estimate or the postinteraction mastery estimate for the skill associated with the attempted exercise is missing is removed. Only interactions with strictly positive rewards are retained, focusing the learning process on exercises that have demonstrably advanced a learner’s mastery of the targeted skill and also reducing computation time. The empirical distribution of these computed rewards is shown in Figure 2(a). The pronounced peak around zero reflects the large share of interactions that do not yield measurable improvements in estimated skill gain. Moreover, the distribution is positively skewed: most interactions correspond to small to moderate gains, with a long right tail representing relatively larger skill gain improvements.

  2. Duplicate user–exercise interactions. For any user–exercise pair with multiple recorded interactions, only the most recent chronologically observed attempt is retained. This situation often arises on the ASSISTments platform when learners request hints, retry after incorrect responses, or reopen an exercise within the same session. Retaining only the final attempt ensures that the postinteraction mastery estimate reflects the learner’s ultimate knowledge state for that exercise, avoiding inflated counts from partial or intermediate states.

  3. Learner activity threshold. Learners with fewer than 50 interactions are excluded to preserve sufficient historical data for personalized modeling, because shorter histories produce highly unstable skill gain estimates and provide too little signal for meaningful contextual differentiation. The resulting student activity levels, measured as the number of retained interactions per learner, are summarized in Figure 2(b).

  4. Warm-start enforcement. After the temporal split, any validation or test interactions involving a user or exercise unseen in the training set are removed. This ensures that evaluation occurs in a warm-start setting, where all entities at test time have prior representation in the training data, thereby avoiding cold-start scenarios.

Figure 2. Overview of Data Set Characteristics
Notes. (a) Distribution of skill-gain rewards. (b) Variability in student activity levels.

A summary of the final preprocessed data set, including the number of unique users, exercises, skills, and total number of interactions is provided in Table 1.

Table

Table 1. Summary Statistics of the Preprocessed ASSISTments 2017 Data Set

Table 1. Summary Statistics of the Preprocessed ASSISTments 2017 Data Set

StatisticValueDescription
Unique users1,250Number of distinct learners in the data set.
Unique exercises2,600Number of distinct exercises available for recommendation.
Interactions167,585Total number of recorded learner–exercise interactions after preprocessing.
Number of skills102Number of distinct knowledge concepts, for example, supplementary angles, Pythagorean theorem.

For the contextual bandit setting, we construct a context vector for each interaction by concatenating user features as listed in Table 2. Categorical variables are one-hot encoded, and continuous features are standardized.

Table

Table 2. Features Used to Construct the Context Vector xt

Table 2. Features Used to Construct the Context Vector xt

FeatureDescription
Sociodemographic characteristics
 Academic yearYear(s) during which the learner used the platform (categorical).
 SchoolAnonymized middle-school identifier (categorical).
 GenderGender of the learner (categorical).
Academic proficiency
 Average knowledge masteryAverage student knowledge level across all skills the learner has attempted.
 Overall correctness rateFraction of correct responses across all attempted exercises.
 MCAS mathematics scoreStandardized MCAS math assessment score.
Affective state
 ConfusionMean predicted probability of confusion over past interactions.
 FrustrationMean predicted probability of frustration over past interactions.
 BoredomMean predicted probability of boredom over past interactions.
 Engaged concentrationMean predicted probability of being focused/engaged.
Disengaged behavior
 CarelessnessMean predicted probability of careless errors.
 Gaming the systemMean predicted probability of exploiting system loopholes.
 Off-taskMean predicted probability of disengagement from the learning task.

4.3. Data Splitting

We adopt a temporal user split strategy, a commonly used evaluation approach that splits the historical interactions by percentage based on the interaction timestamps (Meng et al. 2020). For each learner, interactions are ordered chronologically, with the first 70% assigned to training, the next 15% to validation, and the final 15% to test. This preserves the natural temporal sequence of interactions, ensures user overlap across splits, and mirrors real-world online deployment where future learner states are unknown at recommendation time.

4.4. Algorithms

Both UserCF and ItemCF maintain a user–exercise reward matrix that is updated in buffered batches every 1,000 interactions, with pending updates applied at the end of training. UserCF estimates candidate effectiveness by computing cosine similarity between the target learner and all others. ItemCF instead relies on similarities between candidate exercises and those previously attempted by the learner. For the bandit models, TS represents each exercise with a Gaussian reward distribution under a Normal–Inverse–Gamma prior initialized with noninformative hyperparameters, and it updates exercise-specific statistics incrementally after each interaction. LinTS maintains a separate linear model per exercise with ridge regularization fixed at λ=1. To reduce computational overhead, matrix inversions and parameter estimates are recomputed only every 1,000 steps. Both TS-based models (TS and LinTS) include a short warm-start phase of random exercises to ensure initial coverage. Finally, in all methods, once a learner has attempted an exercise, it is excluded from future recommendations, reflecting realistic tutoring scenarios where repeating the same exercise yields negligible learning gains.

4.5. Validation Strategy

We tune the bandit hyperparameters using grid search on the validation split, using mean instantaneous reward as the evaluation criterion. LinTS is tuned over different values of the variance-scaling parameter v, whereas TS is tuned over the Normal–Inverse–Gamma prior parameters (v0,α0,β0) with a fixed prior mean m0=0. Table 3 summarizes the respective search spaces. The best-performing configuration for each model is then retrained on the combined training and validation data and evaluated once on the held-out test split, without further adaptation during testing. The CF baselines have no hyperparameters and are directly trained on the combined data set before final evaluation, whereas TS-based strategies require hyperparameter tuning, which incurs additional computational cost.

Table

Table 3. Hyperparameter Grids Used for LinTS and TS

Table 3. Hyperparameter Grids Used for LinTS and TS

ModelHyperparameterCandidate values
LinTSv{0.001,0.01,0.05,0.1,0.25,0.5,1.0,2.0,5.0}
TSm0{0.0}
v0{0.01,0.1,0.5,1.0,5.0}
α0{0.1,1.0,2.0}
β0{0.1,1.0,2.0}

5. Results

All model hyperparameters were selected by maximizing the average instantaneous reward on a validation set. The best configuration for TS was (m0,v0,α0,β0)=(0.0,0.01,1.0,2.0), which corresponds to a neutral prior mean m0=0, a low prior precision v0=0.01 yielding a diffuse prior over exercise means, and hyperparameters (α0,β0)=(1.0,2.0) that maintain posterior uncertainty during the initial learning phase. For LinTS, the optimal exploration scale was v=0.05, and because v scales the posterior covariance of sampled coefficients, this relatively small value reduces injected sampling noise and thus favors exploitation of the informative learner context once sufficient evidence has been accumulated.

Figure 3 presents the evolution of cumulative average reward on the held-out test set across all models. The results show that bandit-based approaches outperform CF baselines. Both TS variants yield higher rewards than UserCF and ItemCF, confirming that exploration–exploitation strategies can generate more effective recommendations than neighborhood-based heuristics.

Figure 3. Cumulative Average Reward on the Held-Out Test Set
Note. LinTS outperforms all noncontextual baselines, including TS and CF baselines, underscoring the value of contextual modeling in adaptive educational recommendation.

Among the tested models, LinTS achieves the highest performance, converging to a final average reward of 0.198. This corresponds to a 15.2% improvement over standard TS (0.172), a 16.5% improvement over ItemCF (0.170), and a 20.7% improvement over UserCF (0.164). Although standard TS already performs better than both CF baselines, the contextual extension embodied in LinTS produces substantially larger gains, highlighting the added value of incorporating learner features into the exercise-selection process.

Figure 4 shows exercise-selection frequency distributions, with the x axis denoting exercise identifiers and the y axis denoting the number of selections during testing. ItemCF (Figure 4(a)) spreads choices widely across the exercise space, reflecting the absence of adaptive prioritization. UserCF (Figure 4(b)), by contrast, concentrates almost exclusively on a few exercises, illustrating premature convergence and overexploitation. TS (Figure 4(c)) distributes exercises more broadly than UserCF, avoiding the premature lock-in observed in that model. At the same time, its selections are less diffuse than ItemCF and hence concentrate more on consistently rewarding exercises. LinTS (Figure 4(d)) goes further by identifying a narrower set of high-value exercises, indicating more effective balancing of exploration and exploitation in the contextual setting.

Figure 4. Exercise Selection Frequency Distributions During Testing Across the Four Best-Performing Agents
Note. Contextual modeling (LinTS) concentrates selections on a narrower set of informative exercises, whereas noncontextual strategies spread choices more diffusely across the exercise space.

To better understand these dynamics, Figure 5 analyzes LinTS behavior during training. In the first 10,000 rounds (Figure 5(a)), the distribution is broad and relatively uniform, reflecting an exploratory phase in which the agent samples widely across the exercise space. In contrast, during the final 10,000 rounds (Figure 5(b)), the frequency distribution becomes highly concentrated on a small subset of exercises, indicating focused exploitation of high-value learning opportunities. These findings highlight that contextual linear modeling not only improves reward performance but also produces qualitatively different exploration–exploitation dynamics, enabling more principled exploration and more focused exploitation of high-reward exercises.

Figure 5. Exercise Selection Frequency Distributions for LinTS (v=0.05) During Training
Notes. (a) Early exploration behavior. (b) Later-stage exploitation dynamics.

6. Discussion

From an instructional perspective, the proposed bandit-based ERS offers three key benefits for OR/MS/analytics courses. First, many such courses still rely on instructor-centered teaching practices in which all learners progress through the same fixed sequence of exercises predetermined by the instructor. This structure limits the ability to adjust difficulty, pacing, or feedback to individual needs and often reproduces typical drawbacks of fixed exercise paths: exercises that are too easy may induce disengagement, whereas overly difficult tasks can cause frustration and reduced persistence. The proposed ERS addresses these limitations by adaptively selecting exercises based on each learner’s evolving skill profile. For example, the system could be embedded into an introductory optimization or probability course to automatically recommend additional practice on duality or conditional probability for students who struggle while advancing more prepared students to more difficult exercises. By personalizing the learning trajectory in this way, LinTS supports scalable learning environments in which instruction becomes responsive rather than prescriptive, enabling individualized practice without requiring instructors to manually construct multiple parallel exercise pathways. This adaptivity is particularly valuable in large enrollment courses and other digital learning environments where instructors cannot feasibly monitor or tailor learning trajectories for all students.

Second, as shown in Figures 4 and 5, LinTS concentrates its recommendations on a relatively small set of exercises. This provides empirically grounded feedback to support course design: The learned policy highlights which exercises consistently generate large skill gains and are therefore strong candidates for in-class discussion, worked examples, or graded assignments. The same mechanism can underpin an instructor dashboard that identifies which prerequisite skills learners are struggling with, such as matrix operations in linear programming or probability rules in stochastic models, and highlights which exercises produce the strongest gains for specific student subgroups. Such information supports targeted intervention during classroom hours or tutorial sessions.

Third, learners in OR/MS/analytics courses often enter with widely varying quantitative competencies in areas such as statistics, linear algebra, and optimization. Because LinTS can condition its recommendations on learner background, it can identify students who struggle with standard exercise sets, for example, because of limited prerequisite skills, and recommend more suitable practice exercises. This enables instructors to provide differentiated remediation. Collectively, these insights underscore the potential of contextual bandit–based ERS to support individualized data-driven instruction and remediation in OR/MS/analytics courses, including large-scale digital learning environments where traditional personalized feedback is difficult to provide at scale.

7. Conclusion

ERS provide a scalable mechanism for supporting active learning in digital OR/MS/analytics settings, where large and heterogeneous learner populations make individualized guidance difficult to provide manually. CF remains widely used in ERS, but its reliance on historical similarity patterns, lack of adaptivity, and absence of an exploration mechanism limit its ability to support effective personalized learning trajectories. This work introduces a contextual bandit framework based on LinTS, which models exercise effectiveness as a function of learner features and optimizes directly for skill gain. Experiments on the ASSISTments 2017 data set show that LinTS outperforms both noncontextual TS and CF baselines, achieving higher average skill gains and exhibiting desirable exploration–exploitation dynamics. The results highlight several instructional benefits: adaptive sequencing that responds to learners’ evolving skill profiles, data-driven insights into which exercises most effectively promote learning, and the ability to identify students who may require targeted support.

Some limitations of this study must be acknowledged. The work relies on simplifying assumptions, which may limit the extent to which the findings generalize beyond the present setting. For example, learners with fewer than 50 interactions were excluded, limiting the applicability of the proposed approach in sparse data settings involving many new or infrequent users. Future work should incorporate richer contextual signals, explore nonlinear model classes, and consider multiobjective formulations that balance learning progress with other pedagogical goals. Such extensions would further enhance the applicability of contextual bandits in adaptive learning systems.

References

  • Agrawal S, Goyal N (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Mannor S, Srebro Nathan, Williamson RC, eds. Proc. 25th Ann. Conf. Learn. Theory, vol. 23 (PMLR, Cambridge, MA), 39.1–39.26.Google Scholar
  • Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs. Dasgupta S, McAllester D, eds. Proc. 30th Internat. Conf. Machine Learn., vol. 28 (PMLR, Cambridge, MA), 127–135.Google Scholar
  • Aramayo N, Schiappacasse M, Goic M (2023) A multiarmed bandit approach for house ads recommendations. Marketing Sci. 42(2):271–292.LinkGoogle Scholar
  • Chapelle O, Li L (2011) An empirical evaluation of Thompson sampling. Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 24 (Curran Associates, Red Hook, NY), 1–9.Google Scholar
  • Clément B, Roy D, Oudeyer PY, Lopes M (2014) Online Optimization of teaching sequences with multi-armed bandits. Stamper J, Pardos ZA, Mavrikis M, McLaren BM, eds. Proc. 7th Internat. Conf. Ed. Data Mining (International Educational Data Mining Society, Worcester, MA), 269–272.Google Scholar
  • Corbett AT, Anderson JR (1994) Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling User-Adapt. Interactions 4(4):253–278.CrossrefGoogle Scholar
  • Cully A, Demiris Y (2020) Online knowledge level tracking with data-driven student models and collaborative filtering. IEEE Trans. Knowledge Data Engrg. 32(10):2000–2013.CrossrefGoogle Scholar
  • Da Silva FL, Slodkowski BK, Da Silva KKA, Cazella SC (2023) A systematic literature review on educational recommender systems for teaching and learning: Research trends, limitations and opportunities. Ed. Inform. Tech. (Dordrecht) 28(3):3289–3328.Google Scholar
  • De Kerpel L, Benoit D (2025) A reward-informed semi-personalized bandit approach for enhancing accuracy and serendipity in online slate recommendations. ACM Trans. Recommender Systems (ACM, New York).Google Scholar
  • Ferreira KJ, Simchi-Levi D, Wang H (2018) Online network revenue management using Thompson sampling. Oper. Res. 66(6):1586–1602.LinkGoogle Scholar
  • Fornasiero M, Malucelli F, Pazzi R, Schettini T (2021) Empowering optimization skills through an orienteering competition. INFORMS Trans. Ed. 22(1):1–8.LinkGoogle Scholar
  • Hakkal S, Lahcen AA (2024) XGBoost to enhance learner performance prediction. Comput. Ed. Artificial Intelligence 7:100254.CrossrefGoogle Scholar
  • Heffernan NT, Heffernan CL (2014) The ASSISTments ecosystem: Building a platform that brings scientists and teachers together for minimally invasive research on human learning and teaching. Internat. J. Artificial Intelligence Ed. 24(4):470–497.CrossrefGoogle Scholar
  • Honda J, Takemura A (2014) Optimality of Thompson sampling for Gaussian Bandits depends on priors. Kaski S, Corander J, eds. Proc. 17th Internat. Conf. Artificial Intelligence Statist., vol. 33 (PMLR, Cambridge, MA), 375–383.Google Scholar
  • Huang L, Wang CD, Chao HY, Lai JH, Yu PS (2019) A score prediction approach for optional course recommendation via cross-user-domain collaborative filtering. IEEE Access 7:19550–19563.CrossrefGoogle Scholar
  • Intayoad W, Kamyod C, Temdee P (2020) Reinforcement learning based on contextual bandits for personalized online learning recommendation systems. Wireless Personal Comm. 115(4):2917–2932.CrossrefGoogle Scholar
  • Khanal SS, Prasad P, Alsadoon A, Maag A (2020) A systematic review: Machine learning based recommendation systems for e-learning. Ed. Inform. Tech. (Dordrecht) 25(4):2635–2664.Google Scholar
  • Krahenbuhl KS (2016) Student-centered education and constructivism: Challenges, concerns, and clarity for teachers. Clearing House 89(3):97–105.CrossrefGoogle Scholar
  • Liu Y, Feng J, Lu J (2017) Collaborative filtering algorithm based on rating distance. Kim CH, Lee HW, Lee DH, Sakurai K, eds. Proc. 11th Internat. Conf. Ubiquitous Inform. Management Comm. (Association for Computing Machinery, New York), 1–7.Google Scholar
  • Liu YE, Mandel T, Brunskill E, Popovic Z (2014) Trading off scientific knowledge and user learning with multi-armed bandits. Accessed August 7, 2025, https://api.semanticscholar.org/CorpusID:4103970.Google Scholar
  • Maclean KDS, Bayley T (2024) That’s incorrect and let me tell you why: A scalable assessment to evaluate higher order thinking skills. INFORMS Trans. Ed. 25(1):23–34.LinkGoogle Scholar
  • Manickam I, Lan AS, Baraniuk RG (2017) Contextual multi-armed bandit algorithms for personalized learning action selection. Proc. IEEE Internat. Conf. Acoustics Speech Signal Processing, 6344–6348.Google Scholar
  • Meng Z, McCreadie R, Macdonald C, Ounis I (2020) Exploring data splitting strategies for the evaluation of recommendation models. Proc. 14th ACM Conf. Recommender Systems (Association for Computing Machinery, New York), 681–686.Google Scholar
  • Nafea SM, Siewe F, He Y (2019) On recommendation of learning objects using Felder-Silverman learning style model. IEEE Access 7:163034–163048.CrossrefGoogle Scholar
  • Neshaei SP, Davis RL, Hazimeh A, Lazarevski B, Dillenbourg P, Käser T (2024) Towards modeling learner performance with large language models. Proc. 17th Internat. Conf. Ed. Data Mining (International Educational Data Mining Society, Worcester, MA), 759–768.Google Scholar
  • Pardos ZA, Baker RS, San Pedro M, Gowda SM, Gowda SM (2014) Affective States and state tests: Investigating how affect and engagement during the school year predict end-of-year learning outcomes. J. Learn. Analytics 1(1):107–128.CrossrefGoogle Scholar
  • Patikorn T, Baker RS, Heffernan NT (2020) ASSISTments longitudinal data mining competition special issue: A preface. J. Ed. Data Mining 12(2):i–xi.Google Scholar
  • Reeves KA, Hernandez-Gantes V, Centeno G, Gushi Nurnberg C (2021) Game—Constructivist exercises to enhance teaching of probability and statistics for engineers. INFORMS Trans. Ed. 22(1):55–64.LinkGoogle Scholar
  • Sergis S, Sampson DG (2016) Learning object recommendations for teachers based on elicited ICT competence profiles. IEEE Trans. Learn. Tech. 9(1):67–80.CrossrefGoogle Scholar
  • Tarus JK, Niu Z, Yousif A (2017) A hybrid knowledge-based recommender system for e-learning based on ontology and sequential pattern mining. Future Generation Comput. Systems 72:37–48.CrossrefGoogle Scholar
  • Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3–4):285–294.CrossrefGoogle Scholar
  • van de Pol J, Volman M, Beishuizen J (2010) Scaffolding in teacher–student interaction: A Decade of research. Ed. Psych. Rev. 22(3):271–296.CrossrefGoogle Scholar
  • Wu D, Lu J, Zhang G (2015) A fuzzy tree matching-based personalized e-learning recommender system. IEEE Trans. Fuzzy Systems 23(6):2412–2426.CrossrefGoogle Scholar