Inferring the minimum spanning tree from a sample network
Minimum spanning trees (MSTs) are used in a variety of fields, from computer science to geography. Infectious disease researchers have used them to infer the transmission pathway of certain pathogens. However, these are often the MSTs of sample networks, not population networks, and surprisingly little is known about what can be inferred about a population MST from a sample MST. We prove that if n nodes (the sample) are selected uniformly at random from a complete graph with N nodes and unique edge weights (the population), the probability that an edge is in the population graph's MST given that it is in the sample graph's MST is n/N. We use simulation to investigate this conditional probability for G(N,p) graphs, Barabási-Albert (BA) graphs, graphs whose nodes are distributed in ℝ^2 according to a bivariate standard normal distribution, and an empirical HIV genetic distance network. Broadly, results for the complete, G(N,p), and normal graphs are similar, and results for the BA and empirical HIV graphs are similar. We recommend that researchers use an edge-weighted random walk to sample nodes from the population so that they maximize the probability that an edge is in the population MST given that it is in the sample MST.
READ FULL TEXT