Degree Distribution Preserving Network Sampling: The Case of Relational Learning
Abstract
Networks are relevant to many business applications, and the benefits of analyzing networks are recognized by researchers and practitioners alike. When networks become impractically large, analyzing samples drawn from them is the only way to get insights. The need for effective approaches to generate samples from large networks is also well documented as are the difficulties associated with doing so. The distribution associated with the degrees of nodes in a network is known to affect many phenomena of interest directly or indirectly and is recognized as one of the fundamental properties of networks. In this paper, we consider the problem of sampling from an undirected network and preserving its relative degree distribution. We formulate it as an optimization problem that minimizes the Kolmogorov–Smirnov distance between the distributions of the original network and the sample. We identify conditions involving the neighborhoods of nodes that result in a perfect sample. Whereas perfect samples are unlikely to exist in practice, these conditions provide intuition that we leverage to develop effective heuristics. We illustrate the effectiveness of the proposed heuristics through two categories of computational experiments. The first set of experiments shows that the proposed approaches preserve degree distributions better than state-of-the-art approaches available for sampling. In the second set of experiments, we evaluate the benefits of our approach on relational learning–based classification tasks. These experiments show that samples obtained using our approach outperform those derived using existing approaches and that our approach is scalable.
History: Accepted Ram Ramesh, Area Editor for Data Science & Machine Learning.
Supplemental Material: The software that supports the findings of this study is available within the paper and its Supplemental Information (https://pubsonline.informs.org/doi/suppl/10.1287/ijoc.2024.1002) as well as from the IJOC GitHub software repository (https://github.com/INFORMSJoC/2024.1002). The complete IJOC Software and Data Repository is available at https://informsjoc.github.io/.

