PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems

by   Khalifeh Aljadda, et al.

In the big data era, scalability has become a crucial requirement for any useful computational model. Probabilistic graphical models are very useful for mining and discovering data insights, but they are not scalable enough to be suitable for big data problems. Bayesian Networks particularly demonstrate this limitation when their data is represented using few random variables while each random variable has a massive set of values. With hierarchical data - data that is arranged in a treelike structure with several levels - one would expect to see hundreds of thousands or millions of values distributed over even just a small number of levels. When modeling this kind of hierarchical data across large data sets, Bayesian networks become infeasible for representing the probability distributions for the following reasons: i) Each level represents a single random variable with hundreds of thousands of values, ii) The number of levels is usually small, so there are also few random variables, and iii) The structure of the network is predefined since the dependency is modeled top-down from each parent to each of its child nodes, so the network would contain a single linear path for the random variables from each parent to each child node. In this paper we present a scalable probabilistic graphical model to overcome these limitations for massive hierarchical data. We believe the proposed model will lead to an easily-scalable, more readable, and expressive implementation for problems that require probabilistic-based solutions for massive amounts of hierarchical data. We successfully applied this model to solve two different challenging probabilistic-based problems on massive hierarchical data sets for different domains, namely, bioinformatics and latent semantic discovery over search logs.


page 1

page 2

page 3

page 4


Mining Massive Hierarchical Data Using a Scalable Probabilistic Graphical Model

Probabilistic Graphical Models (PGM) are very useful in the fields of ma...

Designing neural networks that process mean values of random variables

We introduce a class of neural networks derived from probabilistic model...

Probabilistic graphs using coupled random variables

Neural network design has utilized flexible nonlinear processes which ca...

AMIDST: a Java Toolbox for Scalable Probabilistic Machine Learning

The AMIDST Toolbox is a software for scalable probabilistic machine lear...

Scalable Exact Parent Sets Identification in Bayesian Networks Learning with Apache Spark

In Machine Learning, the parent set identification problem is to find a ...

Learning Functional Dependencies with Sparse Regression

We study the problem of discovering functional dependencies (FD) from a ...

Graphical Modelling in Genetics and Systems Biology

Graphical modelling has a long history in statistics as a tool for the a...

Please sign up or login with your details

Forgot password? Click here to reset