Statistical Models for the Number of Successful Cyber Intrusions
We propose several generalized linear models (GLMs) to predict the number of successful cyber intrusions (or "intrusions") into an organization's computer network, where the rate at which intrusions occur is a function of the following observable characteristics of the organization: (i) domain name server (DNS) traffic classified by their top-level domains (TLDs); (ii) the number of network security policy violations; and (iii) a set of predictors that we collectively call "cyber footprint" that is comprised of the number of hosts on the organization's network, the organization's similarity to educational institution behavior (SEIB), and its number of records on scholar.google.com (ROSG). In addition, we evaluate the number of intrusions to determine whether these events follow a Poisson or negative binomial (NB) probability distribution. We reveal that the NB GLM provides the best fit model for the observed count data, number of intrusions per organization, because the NB model allows the variance of the count data to exceed the mean. We also show that there are restricted and simpler NB regression models that omit selected predictors and improve the goodness-of-fit of the NB GLM for the observed data. With our model simulations, we identify certain TLDs in the DNS traffic as having significant impact on the number of intrusions. In addition, we use the models and regression results to conclude that the number of network security policy violations are consistently predictive of the number of intrusions.
READ FULL TEXT