dna2vec: Consistent vector representations of variable-length k-mers

01/23/2017
by   Patrick Ng, et al.
0

One of the ubiquitous representation of long DNA sequence is dividing it into shorter k-mer components. Unfortunately, the straightforward vector encoding of k-mer as a one-hot vector is vulnerable to the curse of dimensionality. Worse yet, the distance between any pair of one-hot vectors is equidistant. This is particularly problematic when applying the latest machine learning algorithms to solve problems in biological sequence analysis. In this paper, we propose a novel method to train distributed representations of variable-length k-mers. Our method is based on the popular word embedding model word2vec, which is trained on a shallow two-layer neural network. Our experiments provide evidence that the summing of dna2vec vectors is akin to nucleotides concatenation. We also demonstrate that there is correlation between Needleman-Wunsch similarity score and cosine similarity of dna2vec vectors.

READ FULL TEXT
research
08/15/2016

Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks

There is a lot of research interest in encoding variable length sentence...
research
10/21/2019

A Comparison of Semantic Similarity Methods for Maximum Human Interpretability

The inclusion of semantic information in any similarity measures improve...
research
12/22/2017

Novel Ranking-Based Lexical Similarity Measure for Word Embedding

Distributional semantics models derive word space from linguistic items ...
research
09/12/2019

A Deep Learning-Based Approach for Measuring the Domain Similarity of Persian Texts

In this paper, we propose a novel approach for measuring the degree of s...
research
11/08/2017

Learning K-way D-dimensional Discrete Code For Compact Embedding Representations

Embedding methods such as word embedding have become pillars for many ap...
research
05/18/2022

Exploring the Advantages of Dense-Vector to One-Hot Encoding of Intent Classes in Out-of-Scope Detection Tasks

This work explores the intrinsic limitations of the popular one-hot enco...
research
06/21/2018

Learning K-way D-dimensional Discrete Codes for Compact Embedding Representations

Conventional embedding methods directly associate each symbol with a con...

Please sign up or login with your details

Forgot password? Click here to reset