Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)

by   Ravi Shankar Mishra, et al.

In this paper, we present SANTA, a scalable framework to automatically normalize E-commerce attribute values (e.g. "Win 10 Pro") to a fixed set of pre-defined canonical values (e.g. "Windows 10"). Earlier works on attribute normalization focused on fuzzy string matching (also referred as syntactic matching in this paper). In this work, we first perform an extensive study of nine syntactic matching algorithms and establish that 'cosine' similarity leads to best results, showing 2.7 Next, we argue that string similarity alone is not sufficient for attribute normalization as many surface forms require going beyond syntactic matching (e.g. "720p" and "HD" are synonyms). While semantic techniques like unsupervised embeddings (e.g. word2vec/fastText) have shown good results in word similarity tasks, we observed that they perform poorly to distinguish between close canonical forms, as these close forms often occur in similar contexts. We propose to learn token embeddings using a twin network with triplet loss. We propose an embedding learning task leveraging raw attribute values and product titles to learn these embeddings in a self-supervised fashion. We show that providing supervision using our proposed task improves over both syntactic and unsupervised embeddings based techniques for attribute normalization. Experiments on a real-world attribute normalization dataset of 50 attributes show that the embeddings trained using our proposed approach obtain 2.3 best unsupervised embeddings.


LaTeX-Numeric: Language-agnostic Text attribute eXtraction for E-commerce Numeric Attributes

In this paper, we present LaTeX-Numeric - a high-precision fully-automat...

Enhancing the Input Representation: From Complexity to Simplicity

We introduce an efficient algorithm for mining informative combinations ...

Recognizing Variables from their Data via Deep Embeddings of Distributions

A key obstacle in automated analytics and meta-learning is the inability...

Towards Open-World Product Attribute Mining: A Lightly-Supervised Approach

We present a new task setting for attribute mining on e-commerce product...

Attribute Extraction from Product Titles in eCommerce

This paper presents a named entity extraction system for detecting attri...

Heterogeneous Entity Matching with Complex Attribute Associations using BERT and Neural Networks

Across various domains, data from different sources such as Baidu Baike ...

Please sign up or login with your details

Forgot password? Click here to reset