We present sparse tree-based and list-based density estimation methods for binary/categorical data. Our density estimation models are higher-dimensional analogies to variable bin-width histograms. In each leaf of the tree (or list), the density is constant, similar to the flat density within the bin of a histogram. Histograms, however, cannot easily be visualized in more than two dimensions, whereas our models can. The accuracy of histograms fades as dimensions increase, whereas our models have priors that help with generalization. Our models are sparse, unlike high-dimensional fixed-bin histograms. We present three generative modeling methods, where the first one allows the user to specify the preferred number of leaves in the tree within a Bayesian prior. The second method allows the user to specify the preferred number of branches within the prior. The third method returns density lists (rather than trees) and allows the user to specify the preferred number of rules and the length of rules within the prior. The new approaches often yield a better balance between sparsity and accuracy of density estimates than other methods for this task. We present an application to crime analysis, where we estimate how unusual each type of modus operandi is for a house break-in.

History: David Martens served as senior editor for this article.

Funding: The authors acknowledge support from NIDA [Grant R01 DA054994].

Data Ethics & Reproducibility Note: There are no ethical issues with this algorithm that we are aware of. Data sets for testing the algorithm are either simulated or publicly available through the UCI Machine Learning Repository (Markelle Kelly, Rachel Longjohn, Kolby Nottingham, The UCI Machine Learning Repository, https://archive.ics.uci.edu). The housebreak data were obtained through the Cambridge Police Department, Cambridge, MA. The code capsule is available on Code Ocean at https://doi.org/10.24433/CO.2985251.v1 and in the e-companion to this article (available at https://doi.org/10.1287/ijds.2021.0001).

1. Introduction

A histogram is a piecewise constant density estimation model. There are good reasons that the histogram is among the first techniques taught to any student dealing with data (Chakrabarti et al. 2006); (i) histograms are easy to visualize, (ii) they are accurate as long as there are enough data in each bin, and (iii) they have a logical structure that most people find interpretable. A downside of the conventional histogram is that all of these properties fail in more than two or three dimensions, particularly for binned binary or categorical data. One cannot easily visualize a conventional higher-dimensional bar plot or histogram. For binary data, this would require us to visualize bins on a high-dimensional hypercube. Worse, there may not be enough data in each bin, so the estimates would cease to be accurate. In terms of interpretability, for a higher-dimensional histogram, a large set of logical conditions characterizing each bin ceases to be an interpretable representation of the data and can easily obscure important relationships between variables. Considering all of the marginals is often useless for binary variables because there are only two bins (zero and one). Not only do histograms become uninterpretable in high dimensions, other high-dimensional density estimation methods are also uninterpretable; flexible nonparametric approaches, such as kernel density estimation (KDE), simply produce a formula, and the estimated density landscape cannot be easily visualized without projecting it to one or two dimensions, in which case we would lose substantial information. The question we ask is how to construct a piecewise constant density estimation model (like a histogram) for categorical data that have the three properties mentioned; (i) it can be visualized, (ii) it is accurate, and (iii) it is interpretable.

In this paper, we present three methods for constructing tree- and list-based density estimation models. These types of models are alternatives to bar plots or variable bin-width histograms (e.g., see Scott 1979, Wand 1997). In our models, a leaf is analogous to a histogram bin (i.e., a probability mass function bin) and is defined by conditions on a subset of variables (e.g., if $x$ is the feature vector, the conditions can be “the second component of $x$ is zero” and “the first component of $x$ is one”), and the density is estimated to be constant with each leaf (that is, regardless of what the other components of $x$ are, the density is constant). Our approaches use only a subset of the variables to define the bins, making them more interpretable.

Our density estimation models can be useful in multiple domains to detect new patterns or errors in the data. For example, Figure 1 shows the sparse density tree for the COCO-Stuff (Caesar et al. 2018) labels. The data set contains 118,000 training samples over 91 stuff categories. Although the labels are sparse, our method finds interesting combinations, such as mirror and blanket or railing, skyscraper, and ground, that are shown in Figure 2.

Figure 1. (Color online) A Sparse Density Tree to Represent the COCO-Stuff Labels
*Notes*. Each leaf shows the density and number of training points that belong to that leaf (because densities are small for large data sets, such as COCO-stuff labels, the number of points might be easier to understand). Leaves 1, 2, and 3 are visualized in Figure 2. This tree contains 20 leaves. It took approximately 25 seconds to create the tree and around 6.4 minutes to run the validation process that tunes parameters and optimizes the tree for each parameter setting. Each run takes approximately 25 seconds; there are five repeats per parameter setting and three parameter settings.

Figure 2. (Color online) Examples of Images That Contain Labels from Leaf 1 (Panel (a)), Leaf 2 (Panel (b)), and Leaf 3 (Panel (c)) in Figure 1
*Notes*. (a) Red, cardboard; green, wall-other; blue, stairs; yellow, ground-other. (b) Magenta, railing; orange, skyscraper; yellow, ground-other. (c) Teal, blanket; aqua, mirror-stuff.

Let us give a second example, this time illustrating how each bin is modeled to be of constant density. We are interested in understanding the set of passengers on the Titanic. Specifically, our goal is to create a density model to understand how common or unusual the particular details of a passenger might be given three categorical features: passenger class (first, second, third, crew), whether someone is an adult (adult, child), and gender (male, female). A leaf (histogram bin) in our model might be if passenger class is third class and a person is child, then the probability of belonging to this leaf is $p (state)$ is 0.038. That is, the total density in the bin where these conditions hold is 0.038, so 3.8% of passengers are children traveling in third class. The density is constant within the leaf, so for an additional variable gender not in the tree (with values male and female), each of these values would be equally probable in the leaf, with probability $0.038 / 2 = 0.019$ . We described just one bin, whereas a full tree is in Figure 3.

Figure 3. (Color online) A Sparse Density Tree to Represent the Titanic Data Set
*Notes*. The probability of belonging to the leaf (P), the density (f), and the volume (V) are specified for each leaf of the sparse tree. The density is estimated to be constant within each leaf. Here, we can see that the volume times the density equals the probability to be in the leaf. We formally define probability, density, and volume in Section 3.

Each of our methods aims to globally optimize a Bayesian posterior possessing a sparsity prior, which acts as a regularization term. (Unlike past work on density trees (Ram and Gray 2011), our three methods are not constructed using greedy tree induction; they are optimized instead.) In this way, Bayesian priors control the shape of the tree or list. This helps with both generalization and interpretability. For the first of our three methods, the prior parameter controls the number of leaves; its value should be set to be the user’s desired number of leaves. For the second method, the prior controls the desired number of branches for nodes of the tree. For the third method, which creates lists (one-sided trees), the prior controls the desired number of leaves and also, the length of the leaf descriptions.

The structure of these sparse tree models aims to fix the three issues with conventional histograms: (i) visualization (we need only write down the conditions we used in the density tree (or density rule list) to visualize the model); (ii) accuracy (the prior encourages the density estimation model to be smaller, which means the bins are larger and generalize better); and (iii) interpretability (the prior encourages sparsity and encourages the model to obey a user-defined notion of interpretability).

In what follows, we provide related works, and then, we describe our three methods in Section 3. In Section 4, we discuss how we optimized the posteriors for three methods. Section 5 provides experiments and examples, Section 6 provides a run-time analysis and a study using simulated data sets. Section 7 provides a consistency proof, and we conclude in Section 8.

2. Related Work

2.1. Nonparametric Density Estimation

Density estimation is a classic topic in statistics and unsupervised machine learning. Without using domain-specific generative assumptions, the most useful techniques have been nonparametric, mainly variants of KDE (Akaike 1954, Rosenblatt et al. 1956, Parzen 1962, Cacoullos 1966, Nadaraya 1970, Rejtö and Révész 1973, Silverman 1986, Devroye 1991, Wasserman 2006, Mahapatruni and Gray 2011, Cattaneo et al. 2020, Varet et al. 2023). KDE is highly tunable, is not domain dependent, and can generalize well, but it does not have the interpretable logical structure of histograms. Similar alternatives include mixtures of Gaussians (Ormoneit and Tresp 1996, 1998; Zhuang et al. 1996; Li and Barron 2000; Chen et al. 2006; Seidl et al. 2009), forest density estimation (Liu et al. 2011), RODEO (Liu et al. 2007), and other nonparametric Bayesian methods (Müller and Quintana 2004), which have been proposed for general purpose (not interpretable per se) density estimation. Jebara (2008) provides a Bayesian treatment of latent directed graph structure for non independent and identically distributed (i.i.d.) data but does not focus on sparsity. Pólya trees have been generated probabilistically for real-valued features (Wong and Ma 2010) and could be used as priors for our method. Friedman et al. (1984) uses a projection pursuit method to perform density estimation. Another task related to density estimation is level set estimation, where the goal is to determine whether the density at a leaf is higher than a prespecified value γ; Willett and Nowak (2007) address the problem using tree representations, and Holmström et al. (2015) use a discretized kernel to construct level set trees. Some works (e.g., Sasaki and Hyvärinen 2018, Liu et al. 2021) use neural networks to perform density estimation that do not aim to be interpretable. In Luo et al. (2019), a smoothing spline is used to perform density estimation. In Rehn et al. (2018), a nonparametric density estimator called “FRONT” segments a data stream through a periodically updated linear transformation.

The most closely related paper to ours is on density estimation trees (DETs) (Ram and Gray 2011) and their extensions. DETs are constructed in a top-down greedy way. This gives them a disadvantage in optimization, often leading to lower-quality trees. They also do not have a generative interpretation, and their parameters do not have a physical meaning in terms of the shape of the trees (unlike the methods defined in this work). DET was used by Wu et al. (2018), leveraging ideas from Lu et al. (2013) with random forests to perform density estimation. DET has also been applied to high-energy physics (Anderlini 2015). Techniques to avoid overfitting in tree-based density estimation models have been discussed by Anderlini (2016). Other top-down greedy approaches (e.g., Yang and Wong 2014a, b; Li et al. 2016) use discrepancy, negative log likelihood, or MISE (Ooi 2012) as splitting criteria. A distinction between our work and existing work is that we place priors directly on the shape of the trees that we desire using a Bayesian approach. We do not rely on greedy splitting and pruning; the splits are optimized instead.

2.2. Bayesian Tree Models

Bayesian tree models are commonly used for tasks other than density estimation (i.e., classification and regression). Some examples include Bayesian CART (Wu et al. 2007), Bayesian additive regression trees (BARTs) (Chipman et al. 2010), and Bayesian rule lists (Letham et al. 2015, Yang et al. 2017). Bayesian CART and BART use priors that specify the probability that a node is terminal and a uniform probability distribution over the choices for a split. Our priors function differently. Often, we have priors over global properties of the trees, such as the number of total leaves (our method I and method III). Also, we have prior parameters governing the number of branches at a node (our method II), which is different from Bayesian CART or BART where there are only two branches at every node. Our rule list density approach (method III) has a prior on the number of conditions used in each rule, which is similar to Bayesian rule lists but not similar to Bayesian CART or BART, which have only one condition defining each split.

3. Methods

We use a Bayesian approach to achieve sparsity by introducing priors on the shape of the trees. In particular, in method I, we define a prior on the number of leaves in the tree, then calculate the likelihood of the data having been generated by a particular tree, and multiply the prior and the likelihood to create a posterior to be optimized over all trees. In method II, we instead choose a prior over the number of branches for each split in the tree, preferring a small number of branches. In method III, we switch to rule lists, where the prior prefers models with a smaller number of rules and a smaller number of conjunctions per rule.

Before the introduction of the three methods, we first present notation. We will focus on the problem of estimating the unknown distribution f with tree-structured approximations given a set of n data points $X = {x_{i}, \dots, x_{n}}$ drawn i.i.d. from f on $X \subset R^{p}$ .

For the tree-structured estimations, we express the path to a leaf as the set of conditions on each of p features along the path. For instance, for a particular leaf (leaf t in Figure 4), we might see conditions on the first feature (e.g., $x_{.1} \in {4, 5, 6}$ ) and the second feature ( $x_{.2} \in {100, 101}$ ). Thus, the leaf is defined by the set of all feature values that obey these conditions: that is, the leaf could be

x \in {x_{.1} \in {4, 5, 6}, x_{.2} \in {100, 101}, x_{.3}, x_{.4}, \dots, x_{. p} any values} .

**Figure 4. (Color online) Example of Computation of Volume**

This implies that there is no restriction on $x_{.3}, x_{.4}, \dots, x_{. p}$ for observations within the leaf. Notationally, a condition on the jth feature is denoted $x_{. j} \in σ_{j} (l)$ , where $σ_{j} (l)$ is the set of allowed values for feature j along the path to leaf l. If there are no conditions on feature j along the path to l, then $σ_{j} (l)$ includes all possible values for feature j. Thus, leaf l includes feature values x obeying:

x \in {x_{.1} \in σ_{1} (l), x_{.2} \in σ_{2} (l), \dots, x_{. p} \in σ_{p} (l)} .

For categorical data, the volume in a leaf l, which is required for computing the density, is calculated as

V_{l} = \prod_{j = 1}^{p} | σ_{j} (l) | .

(1)

We give example of this volume computation next.

The data are categorical. Possible values for $x_{.1}$ are ${1, 2, 3, 4, 5, 6, 7}$ . Possible values for $x_{.2}$ are ${100, 101, 102, 103}$ . Possible values for $x_{.3}$ are ${10, 11, 12, 13, 14, 15}$ , and those for $x_{.4}$ are ${8, 9, 10}$ . Consider the tree in Figure 4.

We compute the volume for leaf l. Here, $σ_{1} (l) = {4, 5}$ because l requires both $x_{.1} \in {4, 5, 6}$ and $x_{.1} \in {4, 5}$ . $σ_{2} (l) = {100, 101},$ $σ_{3} (l) = {10, 15},$ and $σ_{4} (l) = {8, 9, 10}$ because there is no restriction on $x_{.4}$ . So, $V_{l} = \prod_{j} | σ_{j} (l) | = 2 \cdot 2 \cdot 2 \cdot 3 = 24 .$

Our notation handles only categorical data (for ease of exposition) but can be extended to handle ordinal and continuous data. For ordinal data, the definition is the same as for categorical, but σ_j can (optionally) include only contiguous values (e.g., ${3, 4, 5}$ but not ${3, 4, 6}$ ). For continuous variables, σ_j is the “volume” of the continuous variables: for example, for node condition $x_{. j} \in (0, 0.5), σ_{j} = 0.5 - 0$ .

In the next subsections, we present the three methods: the leaf-sparse modeling approach, the branch-sparse modeling approach, and the density list approach. We define the prior and likelihood for each of the three modeling methods, combining them to get posteriors.

3.1. Method I: Leaf-Sparse Density Trees

Let us define the prior, likelihood, and posterior for method I.

3.1.1. Prior.

For this modeling method, the main prior on tree T is on the number of leaves K_T. This prior we choose to be Poisson with a particular scaling (which will make sense later on), where the Poisson is centered at a user-defined parameter λ, which is the user’s desired number of leaves in the tree prior to seeing data. Notation $N_{K_{T}}$ is the number of trees with K_T leaves. It can be calculated directly given K_T. The prior over trees is

P (K_{T} | λ) = P (K_{T} is the number of leaves in tree T | λ) \propto N_{K_{T}} \cdot Poisson (K_{T}, λ)

= N_{K_{T}} e^{- λ} \frac{λ^{K_{T}}}{K_{T}!} .

Thus, λ allows the user to control the number of leaves in the tree. The number of possible trees is finite; thus, the distribution can be trivially normalized. Among trees with K_T leaves, tree T is chosen uniformly, with probability $1 / N_{K_{T}}$ . This means that the probability to choose a particular tree T is Poisson:

\begin{array}{l} P (T | λ) \propto P (T | K_{T}) P (K_{T} | λ) \propto \frac{1}{N_{K_{T}}} N_{K_{T}} e^{- λ} \frac{λ^{K_{T}}}{K_{T}!} = e^{- λ} \frac{λ^{K_{T}}}{K_{T}!} \\ \propto Poisson (K_{T}, λ) . \end{array}

We place a uniform prior over the probabilities for a data point to land in each of the leaves. To do this, we start from a Dirichlet distribution with equal parameters $α_{1} = \dots = α_{K_{T}} = α \in Z^{+}$ , where $α > 1$ . The hyperparameter α is a pseudocount that is typically chosen to be a small number (one or two) to avoid a zero value for the estimated densities. We denote the vector with K_T equal entries $[α, \dots, α]$ as $α_{K_{T}}$ . We draw multinomial parameters $θ = [θ_{1}, \dots, θ_{K_{T}}]$ from Dir $(α_{K_{T}})$ , which govern the prior on the popularity of each leaf. Thus, the prior on feature values given the tree structure is

p (θ | α_{K_{T}}, T) = \frac{1}{Beta (α_{K_{T}})} \prod_{l = 1}^{K_{T}} θ_{l}^{α - 1} .

Here, $Beta (α_{K_{T}}) = \frac{\prod_{l = 1}^{K_{T}} Γ (α_{K_{T}, l})}{Γ (\sum_{l = 1}^{K_{T}} α_{K_{T}, l})} = \frac{{(Γ (α))}^{K_{T}}}{Γ (K_{T} α)}$ is the multinomial beta function, which is also the normalizing constant for the Dirichlet distribution.

Thus, the first part of our generative modeling method is as follows given hyperparameters λ and α. The number of leaves in T is $K_{T} \propto scaled Poisson (λ) (i . e ., N_{K_{T}} \times Poisson (K_{T}, λ)$ ), the tree shape is T $\propto Uniform over trees with K_{T} leaves$ , and $Prior distribution over leaves : θ \propto Dir (α_{K_{T}}) .$ As usual, the prior is regularization that can be overwhelmed with a large amount of data. However, it takes a substantial amount of data and an enormous number of feature-value pairs to reduce the prior’s influence on the posterior.

3.1.2. Likelihood.

Here, we calculate the likelihood of the data to have arisen from a particular tree. Let n_l denote the number of points captured by the lth leaf, and denote $V_{l}$ as the volume of that leaf, defined in (1). The probability to land at any specific value within leaf l is $\frac{θ_{l}}{V_{l}}$ . The likelihood for the full data set is thus

P (X | θ, T) = \prod_{l = 1}^{K_{T}} {(\frac{θ_{l}}{V_{l}})}^{n_{l}} .

3.1.3. Posterior.

The posterior, which is (by definition) the likelihood times the prior, can be written as follows, where we have substituted the distributions from the prior into the formulas:

\begin{array}{l} P (T | λ, α, X) \propto \int_{θ : simplex} P (K_{T} | λ) \cdot P (T | K_{T}) \cdot P (θ | α_{K_{T}}, T) \cdot P (X | θ, T) d θ \\ i . e ., \int_{θ : simplex} (size) \cdot (tree shape | size) \cdot (leaf popularity) \cdot \\ (data | tree shape, leaf popularity) d θ \\ \propto \int_{θ : simplex} P (T | λ) [\frac{1}{Beta (α_{K_{T}})} (\prod_{l = 1}^{K_{T}} θ_{l}^{α - 1})] [\prod_{l = 1}^{K_{T}} {(\frac{θ_{l}}{V_{l}})}^{n_{l}}] d θ \\ \propto P (T | λ) \frac{1}{Beta (α_{K_{T}})} (\prod_{l = 1}^{K_{T}} {(\frac{1}{V_{l}})}^{n_{l}}) \int_{θ : simplex} \prod_{l = 1}^{K_{T}} θ^{n_{l} + α - 1} d θ \\ \propto P (T | λ) \frac{Beta ([n_{1} + α, \dots, n_{K_{T}} + α])}{Beta (α_{K_{T}})} \prod_{l = 1}^{K_{T}} \frac{1}{V_{l}^{n_{l}}} \\ \propto P (T | λ) \frac{Γ (K_{T} α)}{Γ (n + K_{T} α)} \prod_{l = 1}^{K_{T}} \frac{(n_{l} + α - 1)!}{(α - 1)!} V_{l}^{- n_{l}}, \end{array}

where

P (T | λ)

is simply Poisson(

K_{T}, λ

) as discussed earlier. For numerical stability, we maximize the log posterior, which is equivalent to maximizing the posterior.

As compared with full Bayesian approaches, by maximizing the posterior, we leverage relatively faster computation time and optimize for a single model, which can be important for interpretability. However, in turn, we miss out on uncertainty information, which we could get by modeling the posterior density.

For the purposes of prediction, we are required to estimate the density that is being assigned to leaf l. This is calibrated to the data simply as $\hat{f} = n_{l} / (n V_{l})$ , where n is the total number of training data points and n_l is the number of training data points that reside in leaf l. The formula implicitly states that the density in the leaf is uniformly distributed over the features whose values are undetermined within the leaf (features for which σ_j contains all possible values for feature j). That is, the probability mass function is the same for any point within the leaf.

3.2. Method II: Branch-Sparse Density Trees

In method I, we regularized the number of leaves of the density tree. In method II, we instead regularize the number of branches at each node to simplify the model by avoiding models with too many branches. In other words, method I aims to have a small number of leaves. Method II aims to have a small number of branches.

In method I, a Dirichlet distribution is drawn only over the leaves. In method II, a Dirichlet distribution is drawn at every internal node to determine branching. Similar to the previous method, we choose the tree that optimizes the posterior.

The prior is composed of two pieces: the part that creates the tree structure and the part that determines how data propagate through it.

3.2.1. Tree Structure Prior.

For tree T, we let $B_{T} = {b_{i} | i \in I}$ be a multiset, where each element is the count of branches from a node of the tree. For instance, if in tree T with three nodes, the three nodes have three branches, two branches, and two branches, respectively, then $B_{T} = {3, 2, 2}$ . We let $N_{B_{T}}$ denote the number of trees with the same multiset B_T. Note that B_T is unordered, so {3, 2, 2} is the same multiset as {2, 3, 2} or {2, 2, 3}.

Let I denote the set of internal nodes of tree T, and let L denote the set of leaves. As before, we let $V_{l}$ denote the volume of leaf l.

In the generative model for the branch-sparse density tree, a Poisson distribution with parameter λ is used at each internal node in a top-down fashion to determine the number of branches. Iteratively, for node i, the number of branches, b_i, obeys $b_{i} \sim Poisson (λ),$ where the parameter λ is the desired number of branches (before seeing data). Hence, at any node i, with probability $\exp (- λ) \frac{λ^{b_{i}}}{b_{i}!}$ , there are b_i branches from node i. This implies that with probability $\exp (- λ)$ , the node is a leaf. In summary,

P (Multiset of branches = B | λ) \propto N_{B} [\prod_{i \in I} e^{- λ} \frac{λ^{b_{i}}}{b_{i}!}] [\prod_{l \in L} e^{- λ}] .

Among trees with multiset B, tree T is chosen uniformly, with probability $\frac{1}{N_{B}} .$ This means that the probability to choose a particular tree shape is

\begin{array}{l} P (T | λ) & \propto P (T | B_{T}) P (B_{T} | λ) \propto \frac{1}{N_{B_{T}}} N_{B_{T}} [\prod_{i \in I} e^{- λ} \frac{λ^{b_{i}}}{b_{i}!}] [\prod_{l \in L} e^{- λ}] \\ = [\prod_{i \in I} e^{- λ} \frac{λ^{b_{i}}}{b_{i}!}] [\prod_{l \in L} e^{- λ}] . \end{array}

(2)

3.2.2. Tree Propagation Prior.

After the tree structure is determined, we need a generative process for how the data propagate through each internal node. We denote θ_l as the probability to land in leaf l. We denote ${\tilde{θ}}_{i j}$ as the probability to traverse to node j from internal node i. Notation $θ$ is the vector of leaf probabilities (the θ_l’s), $\tilde{θ}$ is the set of all ${\tilde{θ}}_{i j}$ ’s, and ${\tilde{θ}}_{i}$ is the set of all internal node transition probabilities from node i (the ${\tilde{θ}}_{i j}$ ’s).

We compute $P (\tilde{θ_{i}} | α, T)$ for all internal nodes i of tree T. As before, α is a pseudocount to avoid zero-valued estimated densities. At each internal node, we draw a sample from a Dirichlet distribution with parameter $[α, \dots, α]$ (of size equal to the number of branches b_i of i) to determine the proportion of data, ${\tilde{θ}}_{i, j}$ , that should go along the branch leading to each child node j from the internal parent node i. Thus, $\tilde{θ_{i}} \sim Dir (α)$ for each internal node i: that is,

P ({\tilde{θ}}_{i} | α, T) = \frac{1}{{Beta}_{b_{i}} (α)} \prod_{j \in C_{i}} {\tilde{θ}}_{i j}^{α - 1},

where

{Beta}_{k} (α)

is the normalizing constant for the Dirichlet distribution with parameter

α

and k categories and C_i are the indices of the children of i. Thus,

P (\tilde{θ} | α, T) = \prod_{i} P ({\tilde{θ}}_{i} | α, T) = \prod_{i} \frac{1}{{Beta}_{b_{i}} (α)} \prod_{j \in C_{i}} {\tilde{θ}}_{i j}^{α - 1} .

(3)

Thus, the prior is $P (T | λ) \cdot P (\tilde{θ} | α, T)$ , where $P (T | λ)$ is in (2) and $P (\tilde{θ} | α, T)$ is in (3).

In summary, the prior of method II is as follows given hyperparameters λ and α:

multiset of branches $B_{T} \propto N_{B_{T}} [\prod_{i \in I} e^{- λ} \frac{λ^{b_{i}}}{b_{i}!}] [\prod_{l \in L} e^{- λ}],$
tree shape $T \sim Uniform over trees with branches B_{T},$ and
prior distribution over each branch ${\tilde{θ}}_{i} \sim Dir (α) .$

The likelihood is the same as that for method I.

3.2.3. Posterior.

The posterior is proportional to the prior times the likelihood terms. Here, we are integrating over the ${\tilde{θ}}_{i}$ terms for each of the internal nodes i:

\begin{array}{l} P (T | λ, α, X) \\ \propto \int P (B_{T} | λ) \cdot P (T | B_{T}) \cdot P (\tilde{θ} | α, T) \cdot P (X | \tilde{θ}, T) d \tilde{θ} \\ \propto [\prod_{l \in L} (\frac{e^{- λ}}{V_{l}^{n_{l}}})] (\prod_{i \in I} e^{- λ} \frac{λ^{b_{i}}}{b_{i}!} \frac{1}{{Beta}_{b_{i}} ([α, \dots, α])} \times \int_{{\tilde{θ}}_{i} \in simplex} \prod_{c \in C_{i}} {\tilde{θ}}_{i, c}^{α - 1} {\tilde{θ}}_{i, c}^{n_{c}} d {\tilde{θ}}_{i}) \\ = e^{- λ (| I | + | L |)} (\prod_{i \in I} \frac{λ^{b_{i}}}{b_{i}!} \frac{1}{{Beta}_{b_{i}} ([α, \dots, α])} \times \int_{{\tilde{θ}}_{i} \in simplex} \prod_{c \in C_{i}} {\tilde{θ}}_{i, c}^{n_{c} + α - 1} d {\tilde{θ}}_{i}) \prod_{l \in L} (\frac{1}{V_{l}^{n_{l}}}) \\ = e^{- λ (| I | + | L |)} λ^{\sum_{i \in I} b_{i}} (\prod_{i \in I} \frac{1}{b_{i}!} \frac{{Beta}_{b_{i}} ([α + n_{c_{1}}, \dots, α + n_{c_{b_{i}}}])}{{Beta}_{b_{i}} ([α, \dots, α])}) \times \prod_{l \in L} (\frac{1}{V_{l}^{n_{l}}}) \\ = e^{- λ (| I | + | L |)} λ^{| L | + | I | - 1} (\prod_{i \in I} \frac{1}{b_{i}!} \frac{B_{b_{i}} ([α + n_{c_{1}}, \dots, α + n_{c_{b_{i}}}])}{{Beta}_{b_{i}} ([α, \dots, α])}) \times \prod_{l \in L} (\frac{1}{V_{l}^{n_{l}}}), \end{array}

where

c_{1}, \dots, c_{b_{i}} \in C_{i}

in the second to last expression. We used the equation

\sum_{i \in I} b_{i} = | L | + | I | - 1

for a tree in the last line.

3.2.4. Possible Extension.

We can include an upper layer of the hierarchical Bayesian model to control (regularize) the number of features d (of a total of p dimensions) that are used in the model. This would introduce an extra multiplicative factor within the posterior of $(\begin{matrix} p \\ d \end{matrix}) γ^{d} {(1 - γ)}^{p - d}$ , where γ is a parameter between zero and one and a smaller value favors a simpler model. γ corresponds to the probability that a feature is chosen to be included in the model. For example, the value 0.5 corresponds to the case where the user prefers to have half of the features to be chosen. The posterior would become

\begin{array}{l} (\begin{matrix} p \\ d \end{matrix}) γ^{d} {(1 - γ)}^{p - d} e^{- λ (| I | + | L |)} λ^{| I | + | L | - 1} \\ (\prod_{i \in I} \frac{1}{b_{i}!} \frac{{Beta}_{b_{i}} ([α + n_{c_{1}}, \dots, α + n_{c_{b_{i}}}])}{{Beta}_{b_{i}} ([α, \dots, α])}) \prod_{l \in L} (\frac{1}{V_{l}^{n_{l}}}) . \end{array}

3.3. Method III: Sparse Density Rule List

Rather than producing a general tree, an alternative approach is to produce a rule list, which is a one-sided tree. Rule lists are easier to optimize than trees. Each tree can be expressed as a rule list by creating a rule for each leaf, where the conditions defining the leaf also define the rule. By using lists, we implicitly hypothesize that the full space of trees may not be necessary and that simpler rule lists may suffice.

An example of a sparse density rule list is as follows: if x obeys a₁then density $(x) = f_{1}$ else if x obeys a₂then density $(x) = f_{2}$ $\dots$ else if x obeys a_m then density $(x) = f_{m}$ else density $(x) = f_{0}$ .

Here, as with the trees, the density is the probability mass function, which is constant for the entire portion of the feature space that falls into the leaf.

The antecedents a₁,…,a_m are Boolean assertions that are either true or false for each data point x_i. They are chosen from a large premined collection of possible antecedents called A. We define A to be the set of all possible antecedents of size at most H, where the user chooses H. The size of A is $| A | = \sum_{j = 0}^{H} A_{j},$ where A_j is the number of antecedents of size j:

\begin{array}{l} A_{j} = \sum_{[\begin{matrix} t_{1}, t_{2}, \dots, t_{j} \in {1, \dots, p} \\ s . t . t_{1} > t_{2} \dots > t_{j} \end{matrix}]} \prod_{i = 1}^{j} q_{t_{i}}, \end{array}

where feature i consists of q_i categories.

For example, say the features consist of two, three, four, and five categories, respectively. If H = 2, then the total number of elements of A is $| A | =$ 1 (no feature is chosen) + 2 (because there are two categories for the first feature) + 3 (because there are three categories for the second feature) + 4 (because there are four categories from the third feature) + 4 (because there are five for the fourth feature) + six (possible combinations of feature 1 and feature 2) + $\dots$ + 20 (possible combinations of feature 3 and feature 4).

3.3.1. Generative Process.

We now sketch the generative process for the tree from the observations $X = {x_{i}}$ and antecedents $A = {a_{j}}$ . Prior parameter λ is the user’s preference of the length of the density list (in the absence of data), and η is the user’s preference for the number of conjunctions in each subrule a_j.

Define $a_{< j}$ as the antecedents before j in the rule list if there are any. For example $a_{< 3} = {a_{1}, a_{2}}$ . Similarly, let c_j be the cardinalities of the antecedents before j in the rule list. Let d denote the rule list. Following the exposition of Letham et al. (2015), we use a prior over rule lists to encourage sparsity. The generative process is described in Algorithm 1. It depends on the input prior parameters $λ, η$ , and $α$ , which is a user-chosen vector of size m + 1 (as before, usually all elements in $α$ are the same and indicate pseudocounts).

Algorithm 1

(Density Rule Lists Generation Procedure)

Input: Prior parameters λ and η, pseudocount $α$ , observations $X = {x_{i}}$ , antecedents $A = {a_{j}}$

Output: Density rule list

1: Sample a decision list length $m \sim P (m | A, λ)$
2: for decision list rule $j = 1, \dots, m$ do
3: Sample the cardinality of antecedent a_j in d as $c_{j} \sim P (c_{j} | c_{< j}, A, η)$ .
4: Sample a_j of cardinality c_j from $P (a_{j} | a_{< j}, c_{j}, A)$ .
5: end for
6: for observation $i = 1, \dots, n$ do
7: Find the antecedent a_j in d that is the first that applies to x_i.
8: If no antecedents in d applies, set j = 0.
9: end for
10: Sample parameter $θ \sim$ Dirichlet ( $α$ ) for the probability to be in each of the leaves
11: for each leaf l in the rule list do
12: Compute volume V_l according to (1)
13: Compute density $f_{l} = \frac{θ_{l}}{V_{l}}$
14: end for

For instance, in step 1 of Algorithm 1’s generation process, we might find out that the rule list is of size m = 4. Then, in steps 2–5, we sample the cardinality of each rule, so we may find that the four rules are of cardinality $2, 1, 2, 3$ . Then, in steps 6–9, we sample to determine which rules are actually used; for instance, the first rule might be $a_{1} : {x_{14} = blue & x_{21} = large}$ , the second rule could be $a_{2} : {x_{52} = feathered}$ , the third rule might be $a_{3} : {x_{13} = 10,110 & x_{3} = angled}$ , and the fourth rule might be $a_{4} : {x_{55} = 000 & x_{8} = hypothetical & x_{21} = small}$ . The rest of the feature space (that does not fall into any of these rules) would go to the default rule, again having constant density within the rule.

3.3.2. Prior.

The distribution of m in step 1 is the Poisson distribution truncated at the total number of preselected antecedents:

P (m | A, λ) = \frac{λ^{m} / m!}{\sum_{j = 0}^{| A |} (λ^{j} / j!)}, m = 0, \dots, | A | .

When $| A |$ is huge, we can use the approximation $P (m | A, λ) \approx λ^{m} / m!$ as the denominator would be close to one.

For step 2, we let $R_{j} (c_{1}, \dots, c_{j}, A)$ be the set of antecedent cardinalities that are available after drawing antecedent j, and we let $P (c_{j} | c_{< j}, A, η)$ be a Poisson truncated to remove values for which no rules are available with that cardinality:

P (c_{j} | c_{< j}, A, η) = \frac{(η^{c_{j}} / c_{j}!)}{\sum_{k \in R_{j - 1} (c_{< j}, A)} (η^{k} / k!)}, c_{j} \in R_{j - 1} (c_{< j}, A) .

We use a uniform distribution over antecedents in A of size c_j excluding those in a_j:

P (a_{j} | a_{< j}, c_{j}, A) \propto 1, a_{j} \in {a \in A \ a_{< k} : | a | = c_{j}} .

The sparse prior for the antecedent lists is thus

P (d | A, λ, η) = P (m | A, λ) \cdot \prod_{j = 1}^{m} P (c_{j} | c_{< j}, A, η) \cdot P (a_{j} | a_{< j}, c_{j}, A) .

The prior distribution over the leaves $θ = [θ_{1}, \dots, θ_{m}, θ_{0}]$ is drawn from Dir( $α_{m + 1}$ ):

P (θ | α) = \frac{1}{{Beta}_{m + 1} ([α, \dots, α])} \prod_{l = 0}^{m} θ_{l}^{α - 1} .

It is straightforward to sample an ordered antecedent list d from the prior by following the generative process that we just specified, generating rules from the top down.

3.3.3. Likelihood.

Similar to method I, the probability to land at any specific value within leaf l is $\frac{θ_{l}}{V_{l}}$ . Hence, the likelihood for the full data set is thus $P (X | θ, d) = \prod_{l = 0}^{m} {(\frac{θ_{l}}{V_{l}})}^{n_{l}} .$

3.3.4. Posterior.

The posterior can be written as

\begin{array}{l} P (d | A, λ, η, α, X) \propto \int_{θ \in simplex} P (d | A, λ, η) \cdot P (θ | α) \cdot P (X | θ, d) d θ \\ = P (d | A, λ, η) \int_{θ \in simplex} \frac{1}{{Beta}_{m + 1} ([α, \dots, α])} \times \prod_{l = 0}^{m} θ_{l}^{α - 1} {(\frac{θ_{l}}{V_{l}})}^{n_{l}} d θ \\ = P (d | A, λ, η) \frac{\prod_{l = 0}^{m} Γ (n_{l} + α) V_{l}^{- n_{l}}}{Γ (\sum_{l = 0}^{m} (n_{l} + α))}, \end{array}

where the last equality uses the standard Dirichlet-multinomial distribution derivation.

4. Numerical Methods to Optimize the Objective Function

In the previous section, we presented the posterior functions for three generative modeling methods. Because the search space of our problems is large, we use simulated annealing, a metaheuristic optimization algorithm that allows us to approximate global solutions. More specifically, in this section we describe a simulated annealing scheme that we implemented in order to find the optimal tree that maximizes the posterior for method I and method II as well as discuss method III’s optimization details.

4.1. Simulated Annealing for Tree-Based Methods

A successful simulated annealing scheme requires us to create a useful definition of a neighborhood. We define our neighborhood such that each move explores a neighboring tree where we are able to extend or shrink the tree.

At each iteration, we need to determine which neighboring tree to move to. To decide which neighbor to move to, we fix a parameter $ϵ > 0$ beforehand, where ϵ is small (approximately 0.01 in our experiments). The parameter ϵ is the probability that we will perform a structural change to jump out of a possible local minimum. All other actions are taken with equal probability. Thus, at each time, we generate a number from the uniform distribution on (0, 1). Then, we do the following.

(Shrink at leaf) If the number is smaller than $\frac{1 - ϵ}{4}$ , we select uniformly at random a “parent” node that has leaves as its children and remove its children. This is always possible unless the tree is the root node itself, in which case we cannot remove it and this step is skipped.
(Expand) If the random number is between $\frac{1 - ϵ}{4}$ and $\frac{1 - ϵ}{2}$ , we pick a leaf randomly and a feature randomly. If it is possible to split on that feature, then we create children for that leaf. (If the feature has been used up by the leaf’s ancestors, we cannot split, and we then skip this round.)
(Regroup) If the random number is between $\frac{1 - ϵ}{2}$ and $\frac{3 (1 - ϵ)}{4}$ , we pick a node randomly, delete its descendants, and split the node, creating two child nodes where the splitting is done on subsets of the node’s feature values. Sometimes this is not possible: for example, if we pick a node where all the features have been used up by the node’s ancestors or if the node has only one category. In that case, we skip this step.
(Merge sibling nodes) If the random number is between $\frac{3 (1 - ϵ)}{4}$ and $(1 - ϵ)$ , we choose two nodes that share a common parent, delete all their descendants, and merge the two nodes (e.g., black, white, red, and green become black or white, red, and green).
(Structural change) If the random number is more than $1 - ϵ$ , we perform a structural change operation, where we remove all the children of a randomly chosen node of the tree.

See Algorithm 2 for the pseudocode of the simulated annealing procedure. The last three actions avoid problems with local minima. The algorithms can be warm started using solutions from other algorithms (e.g., DET trees). We found it useful to occasionally reset to the best tree encountered so far or the trivial root node tree.

Algorithm 2

(Simulated Annealing for Tree-Based Methods)

Input: Prior parameters, θ, maximum number of iterations, N, ϵ, cooling schedule Cool(iteration) for the simulated annealing

Output: Optimal density tree

1: Initialize the initial tree T to be a single node, compute the posterior using the objective function and the prior parameters. Set iteration number to be zero.
2: while iteration number $< N$ do
3: Draw a random number r, from Uni(0, 1).
4: if $r < \frac{1 - ϵ}{4}$ then
5: Perform shrink at leaf operation.
6: else if $r < \frac{2 (1 - ϵ)}{4}$ then
7: Perform expand operation.
8: else if $r < \frac{3 (1 - ϵ)}{4}$ then
9: Perform regroup operation.
10: else if $r < 1 - ϵ$ then
11: Perform merge sibling nodes operation.
12: else
13: Perform structural change.
14: end if
15: Compute the objective value for modified tree.
16: if a better objective value is obtained than current best then
17: Update T to be the current tree.
18: end if
19: if the current tree is worse than the current best tree T then
20: With probability defined by the cooling schedule Cool(iteration), update T to be the current tree. This will always be a small probability.
21: end if
22: end while
23: Return T

4.2. Sparse Density Rule List Optimization

To search for optimal density rule lists that fit the data, we use local moves (adding rules, removing rules, and swapping rules) and use the Gelman–Rubin convergence diagnostic applied to the log posterior function.

A technical challenge that we need to address in our problem is the computation of the volume of a leaf. Volume computation is not needed in the construction of a decision list classifier like that of Letham et al. (2015), but it is needed in the computation of density list. There are multiple ways to compute the volume of a leaf of a density rule list.

4.2.1. Approach 1: Analytical Computation.

Use the inclusion-exclusion principle to directly compute the volume of each leaf. Consider computing the volume of the ith leaf in a density rule list. Let $V_{a_{i}}$ denote the volume induced by the rule a_i: that is, the number of points in the domain that satisfy a_i. To belong to that leaf, a data point has to satisfy a_i and not earlier rules $a_{< i}$ . Hence, the volume of the ith leaf is equal to the volume obeying a_i alone minus the volume that has been used by earlier rules. Using notation $a_{k}^{c}$ to denote the complement of condition a_k, we have the following:

\begin{array}{l} V_{i} & = V_{a_{i} \land ⋀_{k = 1}^{i - 1} a_{k}^{c}} (i is in a_{i} and in the complement of all previous rules) \\ = V_{a_{i}} - V_{a_{i} \land (⋁_{k = 1}^{i - 1} a_{k})} \\ = V_{a_{i}} - V_{(⋁_{k = 1}^{i - 1} a_{i} \land a_{k})} \\ = V_{a_{i}} - \sum_{k = 1}^{i - 1} {(- 1)}^{k + 1} \sum_{1 \leq j_{1} \leq \dots j_{k} \leq n} V_{a_{i} \land a_{j_{1}} \dots \land a_{j_{k}}}, \end{array}

(4)

where the last expression is because of the inclusion-exclusion principle, and it only involves the volume resulting from conjunctions. The volume resulting from conjunctions can be easily computed from data. Without loss of generality, suppose we want to compute the volume of

V_{a_{1} \land \dots \land a_{k}}

. For each feature that appears, we examine if there is any contradiction between the rules; for example, if feature 1 is present in both a₁ and a₂ where rule a₁ specifies feature 1 to be zero, whereas a₂ specifies feature 1 to be one, then we have found a contradiction, and the volume of the intersection of a₁ and a₂ should be zero. If there is no contradiction, then the volume is equal to the product of the number of distinct categories of all the features that are not used. If all features are used, then the volume is one. By using the inclusion-exclusion principle, we reduce the problem to just computing a volume of conjunctions as in (4). Note that for this approach, we still need to iterate over all conjunctions for each volume computation, which can be computationally expensive.

4.2.2. Approach 2: Uniform Sampling.

Create uniform data over the whole domain, and count the number of points that satisfy the antecedents. This approach would be expensive when the domain is huge but easy to implement for smaller problems.

4.2.3. Approach 3: MCMC.

Use an Markov chain Monte Carlo (MCMC) sampling approach to sample the whole domain space. This approach is again not practical when the domain size is huge as the number of samples required will increase exponentially because of the curse of dimensionality.

We use the analytical computation approach 1 in our implementation.

Some works (Angelino et al. 2018, Rudin and Ertekin 2018) have achieved provable optimality on minimization of objectives over a set of premined rules, although for supervised classification (not density estimation). They have also noted that randomized methods, such as those of Yang et al. (2017) (or those considered in the present work), tend to produce models that are close to these optimal solutions, leading us to believe that our search methods may actually achieve close-to-optimal solutions fairly often, particularly for problems where there may be a large “Rashomon” set of good models (Fisher et al. 2019, Semenova et al. 2022). None of these earlier works consider density estimation, but nonetheless, we have reason to hypothesize that simulated would also yield near-optimal models for density estimation.

5. Experiments

Our experimental setup is as follows. We considered five methods: the leaf-sparse density tree, the branch-sparse density tree, the sparse density rule list, regular histograms, and DETs (Wu et al. 2018). Note that the implementation of DET is meant for continuous variables, but we use it anyway for comparison. To our knowledge, these methods can serve to represent the full set of logic-based, high-dimensional density estimation methods. To ascertain uncertainty, we split the data in five folds and assessed test log likelihood (i.e., out-of-sample performance) and sparsity of the trees for each method on every fold. A model with fewer bins and higher test likelihood is a better model.

Let us discuss how the baselines were implemented. For the standard high-dimensional histogram baseline, we treated each possible set of feature values (e.g., $x_{.1} = 1, x_{.2} = 0, \dots, x_{.10} = 1$ ) as a separate bin. (We call a set of feature values a configuration; it is a point in our feature space.) Histograms have the disadvantage that they create a large number of bins and thus, may not generalize well to the test set; they are also not interpretable because they cannot be visualized in a tree or list.

DET was designed for continuous data, which meant that the computation of volume needed to be adapted for discrete data; it is the number of configurations in the bin rather than the lengths of each bin multiplied together. Thus, we used our own computations for the density in each leaf following volume computations in (1).

The DET method has two parameters: the minimum allowable support in a leaf and the maximum allowable size of a leaf. We originally planned to use a minimum of zero and a maximum of the size of the full data set, but the algorithm often produced trivial models when we did this (i.e., models with one leaf). Therefore, we tried values ${1, 3, 5}$ for the minimum size of a leaf and ${5, 10, 50, 100, ⌊ \frac{n}{10} ⌋, ⌊ \frac{n}{5} ⌋}$ for the maximum size of a leaf, where n is the number of training data points. We use nested crossvalidation over five folds. For each fold, we optimize parameters for validation log likelihood and report out-of-distribution log likelihood. DET has the disadvantage of being a greedy algorithm, and the available implementation of DET is not designed for categorical data (see Appendix B); thus, DET may not produce trees that are as useful or sparse as those from method I, II, or III.

For the data sets in the following two sections for the leaf-sparse density tree model (method I), the mean of the Poisson prior was chosen from the set ${5, 8, 10}$ using nested crossvalidation. For the branch-sparse density tree model (method II), the parameter to control the number of branches was chosen from the set ${2, 3}$ . The parameter α was set to be two for the experiment. This corresponds to a pseudocount of two data points in each bin (to prevent bins with zero data points). For the sparse density rule list (method III), the parameters $λ, η$ , and α were chosen among $[3, 5, 7], 1$ , and 1, respectively. We provide a summary of parameters and their suggested values in Appendix D.

5.1. Titanic Data Set

As discussed earlier, a sparse density tree or list would help us understand the distribution of people on board the Titanic. The Titanic data set has an observation for each of the 2,201 people aboard the Titanic. There are three features: gender, whether someone is an adult, and the class of the passenger (first class, second class, third class, or crew member) (Figures 5 and 6).

Figure 5. (Color online) Density Rule Lists with Different Parameter λ That Indicates Preferred List Length
*Notes*. λ is set to two for panel (a), four for panel (b), and seven for panel (c). Parameters η and α were chosen as two and one, respectively. These density lists were chosen based on the maximum log likelihood over five repeats. Each arrow represents an “else if” statement (e.g., for panel (b), if the passenger is adult and female, then density is constant with respect to other variables at 0.0491; else if the passenger is a child and male, density is constant at 0.0072, etc.).

**Figure 6. (Color online) Performance Comparison Between Our Methods and Baselines for the Titanic (Panel (a)) and Crime (Panel (b)) Data Sets**

Figure 6 shows the results, both for out-of-sample log likelihood (on the y axis) and sparsity (on the x axis), for each method for each of the five test folds. The histogram method had the most leaves (by design) and thus, tended to overfit. Our methods performed well; arguably, the sparse density rule list method performed slightly better in the likelihood-sparsity trade-off. In general, the results are consistent across folds; the histogram produces too many bins, the sparse density rule list and density tree methods performs well, and DET has worse log likelihood.

Figure 3 shows one of the density models generated by the leaf-sparse density tree method. The reason for the split is clear; there were fewer children than adults, the distributions of the males and females were different (mainly because of the fact that the crew was mostly male), and the number of crew members was very different than the numbers of first, second, and third class passengers.

Figure 5 shows density rule lists for the Titanic data set by choosing parameter λ, which indicates preferred list length, from set {2, 4, 7}; we change the density rule lists as well as its length. Lower values of the parameter lead to shorter density rule lists, whereas larger preferred lengths correspond to longer density rule lists.

5.2. Crime Data Set

The house break-in data used for this experiment are from the Cambridge Police Department, Cambridge, Massachusetts. The motivation is to understand the common types of modus operandi characterizing house break-ins, which is important in crime analysis. The data consist of 3,739 separate house break-ins occurring in Cambridge between 1997 and 2012 inclusive. The six categorical features for the crime data set are as follows: (1) location of entry (“window,” “door,” “wall,” and “basement”); (2) means of entry (“forceful” (cut, broke, cut screen, etc.), “open area,” “picked lock,” “unlocked,” and “other”); (3) whether the resident was inside; (4) whether the premise was judged to be ransacked by the reporting officer; (5) whether the entry happened on a “weekday” or “weekend”; and (6) type of premise. The first category is “residence” (including apartment, residence/unknown, dormitory, single-family house, two-family house, garage (personal), porch, apartment hallway, residence unknown, apartment basement, condominium). The second category is nonmedical, nonreligious “workplace” (commercial unknown, accounting firm, research, school). The third group “medical” consists of halfway houses, nursing homes, medical buildings, and assisted living. The fourth group “parking” consists of parking lots and parking garages, and the fifth group “social” consists of Young Women’s Christian Associations, Young Men’s Christian Associations, and social clubs. The last groups are “storage,” “construction site,” “street,” and “church,” respectively.

The experiments in Figure 6(b) show that our approaches dominate DET for the crime data set and are sparser than DET trees. The standard histogram’s results were not reported because they involve too many bins (1,440) to fit on the figure, and they are thus not competitive.

One of the trees obtained from the leaf-sparse density tree method is in Figure 7. It states that most burglaries happen at residences; the nonresidential density has values less than 2 × 10⁻⁴. Given that a crime scene is a residence, most crimes happened on weekends. For residential crimes, burglary is more likely to happen when the owner is not inside (density 0.0046 if weekday and 0.2 if weekend, the premise is judged to be not ransacked, and there is forceful entry through the door or window). When the premise is judged to be ransacked, the crime is more likely to happen with the door as the location of entry (density 0.0042) compared with wall, window, and basement (density 4.72 × 10⁻⁴). On weekends, for residential and not-ransacked premises, doors and windows are more common locations of entry. If the entry is not forceful, unlocked windows and doors are the most common means of entry (density is 0.0361). If the means of entry is picked lock, the density is 0.0011, and if the area is open, the density is 0.0068.

Figure 7. (Color online) Leaf-Sparse Density Tree Representing the Crime Data Set
*Notes*. Density is constant in each leaf. Different node colors represent different features in the data set. This tree contains 20 leaves. It took around 1.4 seconds to create the tree and around 4 seconds to run the validation process.

Important aspects of the modus operandi are within the leaves of the tree: for instance, that the owner of a residence was not inside, that the house was not ransacked, and that the entry was forceful through the door or window. If this approach would have been performed using a regular histogram, it would require 1,440 different markers (discrete states), whereas the crime tree groups the crimes into just 20 bins.

These types of results can be useful for crime analysts to assess whether a particular modus operandi is unusual. For instance, according to the tree, it is clearly more unusual for the owner to be inside during the break-in, as shown by the smaller density values in the leaves when the owner is inside. Also, according to the densities in the leaves, it is more common for the means of entry to be forceful and for the location of entry to be windows and doors. A density list for these data is presented in Figure 8. The preferred length of the list was chosen from the set {3, 5, 7}.

Figure 8. (Color online) List Representing the Crime Data Set
*Note*. Each arrow represent an “else if” statement.

6. Empirical Performance Analysis

The experiments are designed to provide insight into how the methods operate.

6.1. Run-Time Analysis

We studied the performance of the density estimation methods on data sets of different complexity. We chose 17 data sets, including financial data sets (Bank-Full, Telco Customer Churn, home equity line of credit (HELOC)), recidivism risk score data (COMPAS) (Larson et al. 2016), UCI repository data sets (e.g., Car, Mushroom, U.S. Census data) (Markelle Kelly, Rachel Longjohn, Kolby Nottingham, The UCI Machine Learning Repository, https://archive.ics.uci.edu), and detection labels of COCO-Stuff + thing data. The complexity of the data set, in this case, is defined based on the number of samples and/or the number of feature-value pairs (the sum of categories for each feature). For the considered data sets, the number of samples ranged from 625 to around 2.5 million. The number of feature-value pairs ranged from 9 to around 400, and there were from 3 to 68 features. See Table C.1 in Appendix C for data set statistics and the preprocessing steps taken.

For data sets with fewer than 100,000 samples and fewer than 200 feature-value pairs, the tree-based methods (methods I and II) performed in under a minute, and the rule list method (method III) performed in under seven minutes. The most complicated data set in terms of both the number of samples and feature-value pairs is the U.S. Census data set (∼2.09 million training samples, ∼400 feature-value pairs), which took around 1.75 hours for method I, 2.5 hours for method II, and 8 and $\frac{2}{3}$ hours for method III (note that majority of the time for method III went into data loading and preprocessing). In our implementation, we represent data through bit vectors; thus, an increase in the number of samples causes a relatively smaller increase in the run time compared with the run-time increase caused by a larger number of feature categories. We also found that the leaf-sparse density tree method (method I) performed fastest on average for all data sets considered. In Figure 9, we provide a visualization of the time taken to estimate the density of each data set given its complexity for all three methods. Table C.2 in Appendix C shows more details on the timing. All time measurements are averaged over five repeats.

Figure 9. (Color online) Algorithm Run Times for All Data Sets as a Function of Data Set Complexity (the Log of the Number of Samples and the Number of Feature-Value Pairs) for (a) Leaf-Sparse Density Tree, (b) Branch-Sparse Density Tree, and (c) Sparse Density Rule List Estimation Methods
*Note*. Refer to Table C.2 for more details.

6.2. Recovering a Sparse Tree That Generates Data

This experiment is a test to see whether we can recover a tree that actually generates a data set. Specifically, we generated a data set that arises from a tree with six leaves, involving three features. The data consist of 1,000 data points, where 100 points are tied at value (1, 2, 1), 100 points are at (1, 2, 2), 100 points are at (2, 1, 1), 400 points are at (2, 1, 2), and 300 points are at (2, 2, 2).

We trained the methods on half of the data set and tested on the other half. Figure 10(c) shows the scatterplot of out-of-sample performance and sparsity. This is a case where the DET failed badly to recover the true model. It produced a model that was too sparse, with only three leaves. The leaf-sparse density tree method recovered the full tree. Figure 10(a) is the density tree that generated the data set, and we present the tree that we obtained in Figure 10(b), where it is recovered automatically.

Figure 10. (Color online) The Leaf-Sparse Density Tree Method Recovers a Sparse Tree That Generates Data
*Notes*. (a) Tree diagram for the sparse tree data set. (b) Leaf-sparse density tree model that recovers the true generative structure on half of the data set. (c) Performance vs. sparsity on the sparse tree data set.

7. Consistency

A consistent method has estimates that converge to the real densities as the size of the training set grows. Consistency of conventional histograms is well studied (for example, Abou-Jaoude 1976, Devroye and Györfi 1983). More generally, consistency for general rectangular partitions has been studied by Zhao et al. (1991) and Lugosi and Nobel (1996).

Typical consistency proofs (e.g., Devroye et al. 1996, Ram and Gray 2011) require the leaf diameters to become asymptotically smaller as the size of the data grows. In our case, if the ground truth density is a tree, we do not want our methods to asymptotically produce smaller and smaller bin sizes; we would rather they reproduce the ground truth tree. This means we require a new type of consistency proof.

Definition 1.

Trees have a single root, and there are conditions on each branch. A density value, f_l is associated with each leaf l of the tree.

Definition 2.

Two trees, T₁ and T₂, are equivalent with respect to density f if they assign the same density values to every data point on the domain: that is, $f_{T_{1}} (x) = f_{T_{2}} (x)$ for all x. We denote the class of trees that are equivalent to T as ${[T]}_{f}$ .

Theorem 1.

Let Θ be the set of all trees. Consider these conditions.

$T_{n} \in {arg max}_{T} O b j (T)$ . The objective function can be decomposed into $Obj (T) = \ln q_{n} (T | X) + \ln g_{n} (T | X)$ , where ${arg max}_{T} [\ln q_{n} (T | X) + \ln g_{n} (T | X)] \equiv {arg max}_{T} \ln g_{n} (T | X)$ as $n \to \infty$ .
$\ln g_{n} (T | X)$ converges in probability, for any tree T, to the empirical log likelihood that is obtained by the maximum likelihood principle, ${\hat{l}}_{n} (T | X) = \frac{1}{n} \sum_{i = 1}^{n} \ln {\hat{f}}_{n} (x_{i} | T)$ .
$\sup_{T \in Θ} | {\hat{l}}_{n} (T | X) - l (T) | \overset{P}{\to} 0$ ,
where $l (T) = E_{x} (\ln (f (x | T)))$ .
$T_{MLE}^{*} \in {arg max}_{T} l (T)$ is unique up to equivalence among elements of ${[T_{MLE}^{*}]}_{f}$ .

If these conditions hold, then the trees T_n that we learned, $T_{n} \in {arg max}_{T} Obj (T)$ , obey $T_{n} \in {[T_{MLE}^{*}]}_{f}$ for n > M for some M.

The proof of the result is as follows.

Proof of Theorem 1.

The first condition and the second condition are true any time we use a Bayesian method. They are also true any time we use regularized empirical likelihood, where the regularization term’s effect fades with the number of observations. Note that the third condition is automatically true by the law of large numbers. The last condition is not automatically true and requires regularity conditions for identifiability. The result states that our learned trees are equivalent to maximum likelihood trees when there are enough data.

From definition of T_n, T_n is an optimal value of the log-objective function, and hence, it is also an optimal solution to g_n as n is sufficiently large because of the first condition. We have that

\ln Obj (T_{n} | X) - \ln Obj (T_{MLE}^{*} | X) \geq 0

by definition of T_n as the maximizer of Obj. Because Obj becomes close to g_n, we have that

\ln g_{n} (T_{n} | X) - \ln g_{n} (T_{MLE}^{*}) \geq 0

(5)

as n is sufficiently large.

From condition (2), we know that $\ln g_{n} (T | X) - {\hat{l}}_{n} (T | X) \overset{P}{\to} 0$ , and from condition (3), we have ${\hat{l}}_{n} (T | X) - l (T) \overset{P}{\to} 0 .$ Adding this up using the fact that convergence in probability is preserved under addition, we know that $\ln g_{n} (T | X) \overset{P}{\to} l (T)$ .

Hence, by taking the limit of (5) as n grows, we have that

\lim_{n \to \infty} l (T_{n} | X) \geq l (T_{MLE}^{*}) .

Because $T_{MLE}^{*}$ is optimal for l(T) by definition and by condition (4), we conclude that T_n stays in $[T_{MLE}^{*}]$ when n is sufficiently large. □

8. Conclusion

In this work, we have presented a Bayesian approach to density estimation using sparse piecewise constant estimators for categorical (or binary) data. Our methods have nice properties, including that their prior encourages sparsity, which permits interpretability.

For tree-based methods, the prior is the user’s desired number of leaves or branches in each node in the tree, whereas for the density rule list, the prior regularizes the length of the list. We designed a simulated annealing scheme, which alongside the inclusion-exclusion principle and efficient data representation via bit vectors, allows us to find an optimal solution relatively fast.

Our methods outperform existing baselines. They do not have the pitfalls of other nonparametric density estimation methods like density estimation trees, which are top-down greedy. Further, they are consistent, without needing to asymptotically produce infinitesimally small leaves.

The interpretability of density trees and rule lists allows for easier visualization of the estimated density values, aiding in the understanding of the data distribution, detection of outliers and errors, and model selection, and it could assist with decision making. The approaches presented here have given us insight into real data sets (including the house break-in data set from the Cambridge Police and COCO-Stuff labels) that we could not have obtained reliably in any other way.

Acknowledgments

The authors thank the review and editorial team of INFORMS Journal on Data Science for their valuable feedback on the manuscript. S. T. Goh and L. Semenova contributed equally.

Appendix A. Optimal Density for the Likelihood Function

Denote the pointwise density estimate at x to be ${\hat{f}}_{n, x} = \frac{n_{x}}{n}$ . Denote the density estimate for all points within leaf l similarly as ${\hat{f}}_{n, l} = \frac{\sum_{j \in leaf l} n_{j}}{n V_{l}}$ .

The true density from which the data are assumed to be generated is denoted D. We assume that D arises from a tree over the input space.

Lemma A.1.

Any tree achieving the maximum likelihood on the training data has pointwise density equal to ${\hat{f}}_{n} (x)$ . This means that for any l in the tree and for any x, ${\hat{f}}_{n, l} (x) = {\hat{f}}_{n, x} (x)$ .

Proof.

We would like to show that points should be grouped together if and only if they share the same density. Clearly, if points have the same density, grouping them together will preserve constant density. Thus, the backward implication holds. We have only to show that if points do not share the same density, we should not group them together.

We will show that the pointwise histogram becomes better than using the tree if the tree is not correct. This is a variation of a well-known result that the maximum likelihood is the pointwise maximum likelihood. We will show

\prod_{j \in leaf l} {\hat{f}}_{n, j}^{n_{j}} \geq {\hat{f}}_{n, l}^{\sum_{j \in leaf l} n_{j}} .

By taking logarithms, this reduces to

\begin{array}{l} \sum_{j \in leaf l} n_{j} \log {\hat{f}}_{n, j} & \geq \sum_{j \in leaf l} n_{j} \log {\hat{f}}_{n, l} \\ \sum_{j \in leaf l} n_{j} \log (\frac{n_{j}}{n}) & \geq \sum_{j \in leaf l} n_{j} \log (\frac{\sum_{m \in leaf l} n_{m}}{n V_{l}}) \\ \sum_{j \in leaf l} n_{j} \log n_{j} & \geq \sum_{j \in leaf l} n_{j} \log (\frac{\sum_{m \in leaf l} n_{m}}{V_{l}}) \\ \sum_{j \in leaf l} n_{j} \log (\frac{n_{j}}{\sum_{m \in leaf l} n_{m}}) & \geq \sum_{m \in leaf l} n_{j} \log (\frac{1}{\sum_{m \in leaf l} V_{m}}) \\ \sum_{j \in leaf l} \frac{n_{j}}{\sum_{m \in leaf l} n_{m}} \log (\frac{n_{j}}{\sum_{m \in leaf l} n_{m}}) & \geq \sum_{j \in leaf l} \frac{n_{j}}{\sum_{m \in leaf l} n_{m}} \times \log (\frac{1}{\sum_{m \in leaf l} V_{m}}) . \end{array}

(A.1)

The last equation follows from Gibb’s inequality. Hence, we have proven that the statement is true.

To avoid singularities, we separately consider the case when one of the ${\hat{f}}_{n, j} = 0$ . For a particular value of q, if ${\hat{f}}_{n, q} = 0$ , then n_q = 0 by definition. Hence, if we include a new x within the leaf that has no training examples, we will find that the left-hand side term of (A.1) remains the same but because the volume increases when we add the new point, the quantity on the right decreases. Hence the inequality still holds. □

Appendix B. Discussion on Implementation of DET

Algorithm B.1

(Conversion of Categorical Feature to Real)

Input: categorical feature f_c, $c_{1}, \dots, c_{J}$ —categories of the feature f_c

Output: real-valued feature f_r

1: for every category c_j do
2: compute its frequency p_j in the data set
3: end for
4: sort the categories from the most frequent to the least frequent
5: split interval $[0, 1]$ in J frequency intervals $I_{j} = [a_{j}, b_{j}]$ based on the cumulative probability, meaning that the length of the interval with index j is p_j
6: for every sample i do
7: find its category $c_{j}^{i}$
8: find the corresponding frequency interval $I_{j} = [a_{j}, b_{j}]$
9: chose a value $v \in I_{j}$ by sampling from a truncated Gaussian distribution with $μ = a_{j} + p_{j} / 2$ and $σ = (b_{j} - a_{j}) / 6$
10: assign v to $f_{r}^{i}$
11: end for

DET implemented in Python from MLPACK library (Ram and Gray 2011) is designed for continuous variables. In order to compare our methods with DET, we preprocessed data sets using one-hot encoding and the synthetic data vault algorithm from Patki et al. (2016) (see Algorithm B.1). For the one-hot encoding, we create dummy variables for every category, and then, we called DET on the data set. The result included probability densities that sum to values much larger than one, which is not a valid result. Specifically, the DET code returned density values for a single observation as large as 12.71 for Titanic data (when the sum of density values should be 1). Such values should not be returned if the code was designed for categorical data because density is equal to the probability divided by the volume, and the volume for each categorical bin is at least one; the density value in each bin must be at most one in order to return valid density results. Given such high-density values, the log likelihood was positive: for example, in the range [396, 13,107] for crime data for different hyperparameter settings (the correct values are always negative). Utilizing other preprocessing methods, such as the synthetic data vault algorithm, decreased values of the log likelihood, but the problem of densities larger than one for single observations was still present.

Because of the issues in the density computation (and thus, likelihood computation) discussed, we needed to figure out a way to compare the performance of DET with other methods. In order to compare the performance, we used a one-hot encoding representation of the data and used the DET algorithm to create leaves. Then, we utilized our own objective function to compute the log likelihood from methods I and II. Trees that were returned by DET had splits based on one category in the nodes (because of the nature of encoding and the DET algorithm), which resulted in more leaves on average and lower log likelihood as compared with our optimized methods.

If in the future, one created an implementation of DET that produces valid probability densities for categorical data, our objective functions can be used as criteria to select a suitable density tree. This algorithm would still have the disadvantage of producing greedy trees rather than optimized trees.

Appendix C. Discussion on Run-Time Performance

We measured the performance of density estimation methods on 17 categorical data sets that are described in Table C.1. Continuous features in HELOC and Telco Customer Churn data sets were divided into 10 bins uniformly to create categorical features. We considered labels from the COCO-Stuff + thing data set, and this resulted in three data sets: (1) COCO-Stuff that consists of binary features, where each indicates whether the stuff category is present in the image; (2) COCO-Thing that is built the same way except features are thing categories; and (3) COCO-Stuff + thing, a four-feature data set that utilizes the hierarchical structure of COCO detection labels, where each feature is a hierarchy level (such as “animal,” “dog,” “things,” or “outdoor,” where a “dog” is an “animal,” is “outdoor,” and is in the “thing” data set). We removed data samples with missing values. The train and validation data split for the run-time experiments is fixed and shown in Table C.1.

Table C.1. Data Sets Statistics, Including Details on Data Set Size, Number of Features and Categories, and Preprocessing Notes If Any

Table C.1. Data Sets Statistics, Including Details on Data Set Size, Number of Features and Categories, and Preprocessing Notes If Any

Data set	Total number of points	Number of train points	Number of validation points	Train validation split, %	Number of features	Number of feature-value pairs	Processing notes
Balance	625	531	94	15	4	20
Bank Full	45,211	38,429	6,782	15	12	51
Car	1,728	1,468	260	15	6	21
Chess (King-Rook vs. King-Pawn)	3,196	2,716	480	15	36	73
COCO-Stuff + thing labels hierarchy	1,677,309	1,607,458	69,582	4	4	206	Features are formed from COCO detection labels hierarchy: stuff or thing, indoor or outdoor, supercategory, and category
COCO-Stuff labels	123,280	118,280	5,000	4	91	182	Features are detection label categories
COCO-Thing labels	122,218	117,266	4,952	4	80	160	Features are detection label categories
COMPAS	7,210	6,489	721	10	7	19
Connect 4	67,557	57,423	10,134	15	42	126
Crime	3,739	3,178	561	15	6	24
HELOC	10,459	8,890	1,569	15	23	195	Cut all features in 10 bins
Mushroom	5,644	4,797	847	15	23	100	Dropped entries with missing values
Nursery	12,960	11,016	1,944	15	8	27
Telco Customer Churn	7,032	5,977	1,055	15	19	73	Cut 3 continuous features in 10 bins; dropped entries with missing values
Titanic	2,201	1,761	440	20	3	8
U.S. Census 1990 (1m)	1,150,000	1,000,000	150,000	15	68	396	Considered 1 million samples
U.S. Census 1990	2,458,285	2,089,542	368,743	13	68	396

For each setting of the parameters for each algorithm, we ran the algorithm five times to account for randomness in the optimization. We chose the best parameter values and reported the average run time (over the five repeats) for these best parameters. We also reported the average (over the five repeats) total run time, including the time needed to choose parameter values. For the leaf-sparse density tree model, parameter λ (number of leaves) was chosen from the set $5, 8, 10$ . For the branch-sparse density tree, λ (number of branches) was chosen from $2, 3$ ; for the sparse density rule list, λ (length of the list) was chosen from the set $3, 5, 7$ , and η (number of conjunctions in a rule) was chosen from $1, 2$ . α was fixed to be two for the tree-based methods and one for the density rule list. Run-time results for all 17 data sets are shown in Table C.2. For tree-based methods, the run time for evaluating multiple parameters is approximately three times (for leaf sparse) and two times (for branch sparse) larger than the run time for the methods when we knew the best parameters simply because each run of the algorithm took approximately the same amount of time. For the density rule list, a significant portion of the run time is spent on data and volume preprocessing computations that are executed only once at the beginning. Thus, running the algorithm with multiple (i.e., six) parameters is not six times the run time for running once when knowing the best parameters.

Table C.2. Run-Time Analysis of Three Methods for Different Complexity Data Sets

Table C.2. Run-Time Analysis of Three Methods for Different Complexity Data Sets

Data set	Number of train points	Number of feature-value pairs	Leaf sparse, best, seconds	Branch sparse, best, seconds	Rule list, best, seconds	Leaf sparse, multiple, seconds	Branch sparse, multiple, seconds	Rule list, multiple, seconds
Balance	531	20	0.207	1.036	8.226	0.587	2.166	82.402
Bank Full	38,429	51	19.652	21.115	31.340	61.567	42.971	189.415
Car	1,468	21	0.276	1.324	7.123	0.826	2.610	88.097
Chess (King-Rook vs. King-Pawn)	2,716	73	2.808	3.631	58.269	8.683	7.147	246.194
COCO-Stuff + thing labels hierarchy	1,607,458	206	315.418	339.948	788.144	886.045	632.347	1,175.841
COCO-Stuff labels	118,280	182	119.142	169.533	1,056.583	382.756	316.800	1,999.329
COCO-Things labels	117,266	160	104.491	127.561	545.789	278.445	247.029	1,476.896
COMPAS	6,489	19	1.225	2.171	14.503	3.606	4.345	95.327
Connect 4	57,423	126	53.692	58.286	404.347	143.833	111.995	882.699
Crime	3,178	24	1.393	1.998	24.709	3.990	3.933	120.641
HELOC	8,890	195	19.402	23.349	111.346	62.415	48.889	607.118
Mushroom	4,797	100	5.317	6.362	31.890	16.360	12.685	699.592
Nursery	11,016	27	1.196	4.418	10.667	3.587	8.571	106.414
Telco Customer Churn	5,977	73	6.405	7.402	138.462	17.196	14.860	370.096
Titanic	1,761	8	0.324	0.425	7.323	0.947	0.804	41.436
U.S. Census 1990 (1m)	1,000,000	396	3,580.249	2,813.563	8,382.199	10,602.726	6,299.193	10,145.797
U.S. Census 1990	2,089,542	396	6,216.533	8,965.318	31,199.718	19,469.654	16,118.220	39,081.668

Notes. All time measurements are averaged over five repeats. “Leaf sparse, best,” “Branch sparse, best,” and “Rule list, best” show the average run times in seconds of the methods with best parameters λ and η, where in bold we show better run time. “Leaf sparse, multiple,” “Branch sparse, multiple,” and “Rule list, multiple” show the average run times in seconds of the methods, including the time needed to try multiple parameters.

Run times in Table C.2 are computed by running our methods on the Duke University’s Computer Science Department cluster. On a single processor machine (Intel(R) Core(TM) i7-6700 processor at 3.40 GHz), for the Titanic data set the average run time (over five iterations) for the leaf-based method is 0.276 seconds, the average run time for branch based is 0.416 seconds, and the average run time for density list is 6.928 seconds. For the crime data set, the average run time for leaf-based is 0.1 seconds, the average run time for branch based is 1.4 seconds, and the average run time for density list is 12.28 seconds. For the balance data set, the average run time for leaf-based is 0.202 seconds, the average run time for branch based is 0.99 seconds, and the average run time for density list is 7.544 seconds. For the bank full data set, the average run time for leaf-based is 20.218 seconds, the average run time for branch based is 23.077 seconds, and the average run time for density list is 23.722 seconds. The run times are reported for one average run of the algorithms.

Appendix D. Recommendations on the Choice of Algorithm and Parameters

From the experiments we conducted, we found that the leaf-based density trees were the most useful and intuitive. They also run the fastest (Table C.2) for the vast majority of the data sets we considered.

D.1. Leaf-Based vs. Branch-Based

Leaf-based and branch-based methods are similar in structure but differ in how the shape of the model is controlled by the prior; specifically, the user controls either the number of leaves or the number of branches. This matters when there are many categories per feature; the leaf-based approach may try to put these categories in one node, whereas the branch-based method might keep them in separate nodes. However, it takes longer to run the branch-based method (Table C.2), and its models are typically more complex than the leaf-based method’s models on the same data set (Table D.1).

Table D.1. Description of Priors and the Model Complexities for Models That Maximized Log-Likelihood During the Tuning Procedure Described in Appendix C

Table D.1. Description of Priors and the Model Complexities for Models That Maximized Log-Likelihood During the Tuning Procedure Described in Appendix C

Data set	Number of train points	Number of feature-value pairs	Leaf-sparse		Branch-sparse		Rule list
Data set	Number of train points	Number of feature-value pairs	Prior, λ	Number of leaves in the optimal model	Prior, λ	Number of leaves in the optimal model	Prior, λ	Prior, η	Length of the optimal model
Balance	531	20	5	2	3	40	3	1	3
Bank Full	38,429	51	8	100	2	52	5	2	8
Car	1,468	21	8	2	3	31	3	1	4
Chess (King-Rook vs. King-Pawn)	2,716	73	5	26	3	47	7	2	8
COCO-Stuff + thing labels hierarchy	1,607,459	206	5	202	2	208	3	2	6
COCO-Stuff labels	118,280	182	8	32	2	41	7	2	9
COCO-Thing labels	117,266	160	5	47	3	29	7	1	7
COMPAS	6,489	19	8	22	2	32	3	1	5
Connect 4	57,423	126	10	47	3	42	7	2	7
Crime	3,178	24	5	20	2	40	7	1	7
HELOC	8,890	195	8	84	3	93	7	1	8
Mushroom	4,797	100	10	52	2	73	3	1	6
Nursery	11,016	27	8	2	3	48	3	2	2
Telco Customer Churn	5,977	73	8	41	3	73	7	1	10
Titanic	1,761	8	5	11	2	13	5	1	7
U.S. Census 1990 (1m)	1,000,000	396	5	78	3	82	7	1	7
U.S. Census 1990	2,089,542	396	5	100	2	105	7	1	5

D.2. Trees vs. Lists

When comparing density trees and density rule lists, rule lists are one-sided trees, but they have multiple conditions defining each rule. Density rule lists can be more helpful if the user prefers a very sparse density model or has a smaller data set. Concerning run time, the rule lists were the slowest method per run on average. However, one of the major bottlenecks for rule lists was memory space and time needed to process the data and mine the rules.

D.3. Parameters

Our sparse density lists and trees have priors on the model structure, such as the number of leaves (method I), branches (method II), or length of the list (method III). In Table D.2, we summarize all parameters that one needs to define in order to run our methods. Pseudocounts are typically set to small values in order to avoid zero densities. For all methods, λ regularizes the complexity of the model and reflects the prior belief on how sparse the user expects/would like the model to be. However, the resulting model complexity also depends on the data distribution. To give specific examples of priors and optimal model complexities, we analyzed trees and lists that we computed while evaluating run time in Appendix C. For every data set, we reported the prior value that led to the maximum log-likelihood model and the complexity of this model (see Table D.1). For example, for the COMPAS data set with $\sim 6, 500$ training samples and 19 feature-value pairs, a prior of eight on the number of leaves led to a tree with 22 leaves, a prior of two on the number of branches led to a tree with 32 leaves, and a prior of three on the length of the rule list led to a model of length 5. Although Table D.1 is a post hoc analysis of experiments conducted in Appendix C, it can still serve as a reference point for the prior value and optimal model complexity for different data sets. We also encourage users to perform crossvalidation to choose parameters similar to the experiments we conducted in Section 5 and Appendix C.

Table D.2. Description of the Parameters for Sparse Density Trees and Lists

Table D.2. Description of the Parameters for Sparse Density Trees and Lists

Parameter	Meaning	Recommendations
Leaf-sparse density tree
λ	Desired number of leaves in the tree	8, 10. Experimentally, we found that setting the prior for the number of leaves to 8 achieves good crossvalidation log likelihood.
α	Pseudocount, used to avoid zero values for the estimated densities	1, 2. A small value.
Branch-sparse density tree
λ	Desired number of branches at each internal node of the tree	2, 3. Depends on the number of categories for each feature and how much the user would like to aggregate them. If sparsity is preferred, then 2 or 3 is a good choice. Otherwise, may be set to 4 or 5.
α	Pseudocount, used to avoid zero values for the estimated densities	1, 2. A small value.
Sparse density rule list
λ	Desired length of the rule list	7 or higher (for larger data sets).
η	Desired number of conjunctions in a rule	1, 2, or 3 (for a smaller number of feature/value pairs). For larger data sets, mining rules and storing the data can be memory expensive, so smaller values of this parameter may be preferred.
α	Pseudocount, used to avoid zero values for the estimated densities	1, 2. A small value.

References

Abou-Jaoude S (1976) Conditions nécessaires et suffisantes de convergence l1 en probabilité de l’histogramme pour une densité. Ann. Inst. Henri Poincare Probab. Statist. 12(3):213–231.Google Scholar
Akaike H (1954) An approximation to the density function. Ann. Inst. Statist. Math. 6(2):127–132.Google Scholar
Anderlini L (2015) Density estimation trees in high energy physics. Preprint, submitted February 3, https://arxiv.org/abs/1502.00932.Google Scholar
Anderlini L (2016) Density estimation trees as fast non-parametric modelling tools. J. Physics Conf. Ser. 762(1):012042.Google Scholar
Angelino E, Larus-Stone N, Alabi D, Seltzer M, Rudin C (2018) Learning certifiably optimal rule lists for categorical data. J. Machine Learn. Res. 18(234):1–78.Google Scholar
Cacoullos T (1966) Estimation of a multivariate density. Ann. Inst. Statist. Math. 18(1):179–189.Google Scholar
Caesar H, Uijlings J, Ferrari V (2018) COCO-Stuff: Thing and stuff classes in context. 2018 IEEE Conf. Comput. Vision Pattern Recognition (CVPR) (IEEE, Piscataway, NJ).Google Scholar
Cattaneo MD, Jansson M, Ma X (2020) Simple local polynomial density estimators. J. Amer. Statist. Assoc. 115(531):1449–1455.Google Scholar
Chakrabarti S, Ester M, Fayyad U, Gehrke J, Han J, Morishita S, Piatetsky-Shapiro G, Wang W (2006) Data mining curriculum: A proposal (version 1.0). Intensive Working Group of ACM SIGKDD Curriculum Committee, vol. 140 (ACM, New York), 1–10.Google Scholar
Chen T, Morris J, Martin E (2006) Probability density estimation via an infinite gaussian mixture model: Application to statistical process monitoring. J. Roy. Statist. Soc. Ser. C Appl. Statist. 55(5):699–715.Google Scholar
Chipman HA, George EI, McCulloch RE (2010) Bart: Bayesian additive regression trees. Ann. Appl. Statist. 4(1):266–298.Google Scholar
Devroye L (1991) Exponential inequalities in nonparametric estimation. Roussas G, ed. Nonparametric Functional Estimation and Related Topics, NATO ASI Series, vol. 335 (Springer, Dordrecht, Netherlands), 31–44.Google Scholar
Devroye L, Györfi L (1983) Distribution-free exponential bound on the l1 error of partitioning estimates of a regression function. Konecny F, Mogyoródi J, Wertz W, eds. Probability and Statistical Decision Theory (D. Reidel, Dordrecht, Netherlands), 67–76.Google Scholar
Devroye L, Györfi L, Lugosi G (1996) A Probabilistic Theory of Pattern Recognition (Springer, Berlin).Google Scholar
Fisher A, Rudin C, Dominici F (2019) All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. J. Machine Learn. Res. 20(177):1–81.Google Scholar
Friedman JH, Stuetzle W, Schroeder A (1984) Projection pursuit density estimation. J. Amer. Statist. Assoc. 79(387):599–608.Google Scholar
Holmström L, Karttunen K, Klemelä J (2015) Estimation of level set trees using adaptive partitions. Comput. Statist. 32(3):1139–1163.Google Scholar
Jebara TS (2008) Bayesian out-trees. Proc. Twenty-Fourth Conf. Uncertainty Artificial Intelligence (UAI) (AUAI Press, Arlington, VA), 315–324.Google Scholar
Larson J, Mattu S, Kirchner L, Angwin J (2016) How we analyzed the COMPAS recidivism algorithm. ProPublica, https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm.Google Scholar
Letham B, Rudin C, McCormick TH, Madigan D (2015) Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model. Ann. Appl. Statist. 9(3):1350–1371.Google Scholar
Li JQ, Barron AR (2000) Mixture density estimation. Solla SA, Leen TK, Müller K, eds. Adv. Neural Inform. Processing Systems, vol. 12 (MIT Press, Cambridge, MA), 279–285.Google Scholar
Li D, Yang K, Wong WH (2016) Density estimation via discrepancy based adaptive sequential partition. Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 29 (Curran Associates, Inc., Red Hook, NY), 1091–1099.Google Scholar
Liu H, Lafferty J, Wasserman L (2007) Sparse nonparametric density estimation in high dimensions using the rodeo. Meila M, Shen X, eds. Proc. Eleventh Internat. Conf. Artificial Intelligence Statist., vol. 2 (PMLR, New York), 283–290.Google Scholar
Liu H, Xu M, Gu H, Gupta A, Lafferty J, Wasserman L (2011) Forest density estimation. J. Machine Learn. Res. 12(25):907–951.Google Scholar
Liu Q, Xu J, Jiang R, Wong WH (2021) Density estimation using deep generative neural networks. Proc. Natl. Acad. Sci. USA 118(15):e2101344118.Google Scholar
Lu L, Jiang H, Wong WH (2013) Multivariate density estimation by Bayesian sequential partitioning. J. Amer. Statist. Assoc. 108(504):1402–1410.Google Scholar
Lugosi G, Nobel A (1996) Consistency of data-driven histogram methods for density estimation and classification. Ann. Statist. 24(2):687–706.Google Scholar
Luo R, Liu A, Wang Y (2019) Combining smoothing spline with conditional Gaussian graphical model for density and graph estimation. Preprint, submitted March 30, https://arxiv.org/abs/1904.00204.Google Scholar
Mahapatruni RSG, Gray A (2011) Cake: Convex adaptive kernel density estimation. Gordon G, Dunson D, Dudík M, eds. Proc. Fourteenth Internat. Conf. Artificial Intelligence Statist., vol. 15 (PMLR, New York), 498–506.Google Scholar
Müller P, Quintana FA (2004) Nonparametric Bayesian data analysis. Statist. Sci. 19(1):95–110.Google Scholar
Nadaraya ÉA (1970) Remarks on non-parametric estimates for density functions and regression curves. Theory Probab. Its Appl. 15(1):134–137.Google Scholar
Ooi H (2012) Density visualization and mode hunting using trees. J. Comput. Graphical Statist. 11(2):328–347.Google Scholar
Ormoneit D, Tresp V (1996) Improved gaussian mixture density estimates using Bayesian penalty terms and network averaging. Touretzky DS, Mozer MC, Hasselmo ME, eds. Adv. Neural Inform. Processing Systems, vol. 8 (MIT Press, Cambridge, MA), 542–548.Google Scholar
Ormoneit D, Tresp V (1998) Averaging, maximum penalized likelihood and Bayesian estimation for improving gaussian mixture probability density estimates. IEEE Trans. Neural Networks Learn. Systems 9(4):639–650.Google Scholar
Parzen E (1962) On estimation of a probability density function and mode. Ann. Math. Statist. 33(3):1065–1076.Google Scholar
Patki N, Wedge R, Veeramachaneni K (2016) The synthetic data vault. 2016 IEEE Internat. Conf. Data Sci. Adv. Analytics (DSAA) (IEEE, Piscataway, NJ), 399–410.Google Scholar
Ram P, Gray AG (2011) Density estimation trees. Proc. 17th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 627–635.Google Scholar
Rehn P, Ahmadi Z, Kramer S (2018) Forest of normalized trees: Fast and accurate density estimation of streaming data. 2018 IEEE 5th Internat. Conf. Data Sci. Adv. Analytics (DSAA) (IEEE, Piscataway, NJ), 199–208.Google Scholar
Rejtö L, Révész P (1973) Density estimation and pattern classification. Problems Control Inform. Theory 2(1):67–80.Google Scholar
Rosenblatt M (1956) Remarks on some nonparametric estimates of a density function. Ann. Math. Statist. 27(3):832–837.Google Scholar
Rudin C, Ertekin S (2018) Learning customized and optimized lists of rules with mathematical programming. Math. Programming Comput. 10:659–702.Google Scholar
Sasaki H, Hyvärinen A (2018) Neural-kernelized conditional density estimation. Preprint, submitted June 5, https://arxiv.org/abs/1806.01754.Google Scholar
Scott DW (1979) On optimal and data-based histograms. Biometrika 66(3):605–610.Google Scholar
Seidl T, Assent I, Kranen P, Krieger R, Herrmann J (2009) Indexing density models for incremental learning and anytime classification on data streams. Proc. 12th EDBT/ICDT (ACM, New York), 311–322.Google Scholar
Semenova L, Rudin C, Parr R (2022) On the existence of simpler machine learning models. Proc. ACM Conf. Fairness Accountability Transparency (ACM FAccT) (ACM, New York), 1827–1858.Google Scholar
Silverman BW (1986) Density Estimation for Statistics and Data Analysis, vol. 26 (CRC Press, Boca Raton, FL).Google Scholar
Varet S, Lacour C, Massart P, Rivoirard V (2023) Numerical performance of penalized comparison with overfitting for multivariate kernel density estimation. ESAIM Probab. Statist. 27:621–667.Google Scholar
Wand MP (1997) Data-based choice of histogram bin width. Amer. Statist. 51(1):59–64.Google Scholar
Wasserman L (2006) All of Nonparametric Statistics (Springer Science & Business Media, New York).Google Scholar
Willett RM, Nowak RD (2007) Minimax optimal level-set estimation. IEEE Trans. Image Processing 16(12):2965–2979.Google Scholar
Wong WH, Ma L (2010) Optional pólya tree and Bayesian inference. Ann. Statist. 38(3):1433–1459.Google Scholar
Wu K, Hou W, Yang H (2018) Density estimation via the random forest method. Comm. Statist. Theory Methods 47(4):877–889.Google Scholar
Wu Y, Tjelmeland H, West M (2007) Bayesian CART: Prior specification and posterior simulation. J. Comput. Graphical Statist. 16(1):44–66.Google Scholar
Yang K, Wong WH (2014a) Density estimation via adaptive partition and discrepancy control. Preprint, submitted April 5, https://arxiv.org/abs/1404.1425.Google Scholar
Yang K, Wong WH (2014b) Discovering and visualizing hierarchy in multivariate data. Preprint, submitted March 18, https://arxiv.org/abs/1403.4370.Google Scholar
Yang H, Rudin C, Seltzer M (2017) Scalable Bayesian rule lists. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (PMLR, New York), 3921–3930.Google Scholar
Zhao LC, Krishnaiah PR, Chen XR (1991) Almost sure $ℓ_{r}$ -norm convergence for data-based histogram density estimates. Theory Probab. Its Appl. 35(2):396–403.Google Scholar
Zhuang X, Huang Y, Palaniappan K, Zhao Y (1996) Gaussian mixture density modeling, decomposition, and applications. IEEE Trans. Image Processing 5(9):1293–1302.Google Scholar

cover image INFORMS Journal on Data Science

Volume 3, Issue 1

April-June 2024

Pages 1-104, C2

Article Information

Supplemental Material

Metrics

Information

Received:January 01, 2021
Accepted:October 25, 2023
Published Online:January 11, 2024

Cite as

Siong Thye Goh, Lesia Semenova, Cynthia Rudin (2024) Sparse Density Trees and Lists: An Interpretable Alternative to High-Dimensional Histograms. INFORMS Journal on Data Science 3(1):28-48.

https://doi.org/10.1287/ijds.2021.0001

Keywords

Acknowledgments

The authors thank the review and editorial team of INFORMS Journal on Data Science for their valuable feedback on the manuscript. S. T. Goh and L. Semenova contributed equally.

PDF download

Available Issues

Available Issues

Sparse Density Trees and Lists: An Interpretable Alternative to High-Dimensional Histograms

Abstract

1. Introduction

2. Related Work

2.1. Nonparametric Density Estimation

2.2. Bayesian Tree Models

3. Methods

3.1. Method I: Leaf-Sparse Density Trees

3.1.1. Prior.

3.1.2. Likelihood.

3.1.3. Posterior.

3.2. Method II: Branch-Sparse Density Trees

3.2.1. Tree Structure Prior.

3.2.2. Tree Propagation Prior.

3.2.3. Posterior.

3.2.4. Possible Extension.

3.3. Method III: Sparse Density Rule List

3.3.1. Generative Process.

3.3.2. Prior.

3.3.3. Likelihood.

3.3.4. Posterior.

4. Numerical Methods to Optimize the Objective Function

4.1. Simulated Annealing for Tree-Based Methods

4.2. Sparse Density Rule List Optimization

4.2.1. Approach 1: Analytical Computation.

4.2.2. Approach 2: Uniform Sampling.

4.2.3. Approach 3: MCMC.

5. Experiments

5.1. Titanic Data Set

5.2. Crime Data Set

6. Empirical Performance Analysis

6.1. Run-Time Analysis

6.2. Recovering a Sparse Tree That Generates Data

7. Consistency

8. Conclusion

Appendix A. Optimal Density for the Likelihood Function

Appendix B. Discussion on Implementation of DET

Appendix C. Discussion on Run-Time Performance

Appendix D. Recommendations on the Choice of Algorithm and Parameters

D.1. Leaf-Based vs. Branch-Based

D.2. Trees vs. Lists

D.3. Parameters

References

Volume 3, Issue 1

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords