Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential Data

by   Payam Siyari, et al.

Data represented as strings abounds in biology, linguistics, document mining, web search and many other fields. Such data often have a hierarchical structure, either because they were artificially designed and composed in a hierarchical manner or because there is an underlying evolutionary process that creates repeatedly more complex strings from simpler substrings. We propose a framework, referred to as "Lexis", that produces an optimized hierarchical representation of a given set of "target" strings. The resulting hierarchy, "Lexis-DAG", shows how to construct each target through the concatenation of intermediate substrings, minimizing the total number of such concatenations or DAG edges. The Lexis optimization problem is related to the smallest grammar problem. After we prove its NP-Hardness for two cost formulations, we propose an efficient greedy algorithm for the construction of Lexis-DAGs. We also consider the problem of identifying the set of intermediate nodes (substrings) that collectively form the "core" of a Lexis-DAG, which is important in the analysis of Lexis-DAGs. We show that the Lexis framework can be applied in diverse applications such as optimized synthesis of DNA fragments in genomic libraries, hierarchical structure discovery in protein sequences, dictionary-based text compression, and feature extraction from a set of documents.


page 1

page 2

page 3

page 4


Emergence and Evolution of Hierarchical Structure in Complex Systems

It is well known that many complex systems, both in technology and natur...

Optimal Reference for DNA Synthesis

In the recent years, DNA has emerged as a potentially viable storage tec...

A Data-Structure for Approximate Longest Common Subsequence of A Set of Strings

Given a set of k strings I, their longest common subsequence (LCS) is th...

Evolution of Hierarchical Structure Reuse in iGEM Synthetic DNA Sequences

Many complex systems, both in technology and nature, exhibit hierarchica...

Hierarchy Builder: Organizing Textual Spans into a Hierarchy to Facilitate Navigation

Information extraction systems often produce hundreds to thousands of st...

Assessing the best edit in perturbation-based iterative refinement algorithms to compute the median string

Strings are a natural representation of biological data such as DNA, RNA...

Identifying Hierarchical Structure in Sequences: A linear-time algorithm

SEQUITUR is an algorithm that infers a hierarchical structure from a seq...

Please sign up or login with your details

Forgot password? Click here to reset