Similarity Driven Approximation for Text Analytics

by   Guangyan Hu, et al.

Text analytics has become an important part of business intelligence as enterprises increasingly seek to extract insights for decision making from text data sets. Processing large text data sets can be computationally expensive, however, especially if it involves sophisticated algorithms. This challenge is exacerbated when it is desirable to run different types of queries against a data set, making it expensive to build multiple indices to speed up query processing. In this paper, we propose and evaluate a framework called EmApprox that uses approximation to speed up the processing of a wide range of queries over large text data sets. The key insight is that different types of queries can be approximated by processing subsets of data that are most similar to the queries. EmApprox builds a general index for a data set by learning a natural language processing model, producing a set of highly compressed vectors representing words and subcollections of documents. Then, at query processing time, EmApprox uses the index to guide sampling of the data set, with the probability of selecting each subcollection of documents being proportional to its similarity to the query as computed using the vector representations. We have implemented a prototype of EmApprox as an extension of the Apache Spark system, and used it to approximate three types of queries: aggregation, information retrieval, and recommendation. Experimental results show that EmApprox's similarity-guided sampling achieves much better accuracy than random sampling. Further, EmApprox can achieve significant speedups if users can tolerate small amounts of inaccuracies. For example, when sampling at 10%, EmApprox speeds up a set of queries counting phrase occurrences by almost 10x while achieving estimated relative errors of less than 22% for 90% of the queries.


page 1

page 10


Task-agnostic Indexes for Deep Learning-based Queries over Unstructured Data

Unstructured data is now commonly queried by using target deep neural ne...

Random Sampling for Group-By Queries

Random sampling has been widely used in approximate query processing on ...

Optimally Summarizing Data by Small Fact Sets for Concise Answers to Voice Queries

Our goal is to find combinations of facts that optimally summarize data ...

PolyFit: Polynomial-based Indexing Approach for Fast Approximate Range Aggregate Queries

Range aggregate queries find frequent application in data analytics. In ...

To Ship or Not to (Function) Ship (Extended version)

Sampling is often used to reduce query latency for interactive big data ...

Electra: Conditional Generative Model based Predicate-Aware Query Approximation

The goal of Approximate Query Processing (AQP) is to provide very fast b...

MISS: Finding Optimal Sample Sizes for Approximate Analytics

Nowadays, sampling-based Approximate Query Processing (AQP) is widely re...

Please sign up or login with your details

Forgot password? Click here to reset