Learning Social Image Embedding with Deep Multimodal Attention Networks

by   Feiran Huang, et al.

Learning social media data embedding by deep models has attracted extensive research interest as well as boomed a lot of applications, such as link prediction, classification, and cross-modal search. However, for social images which contain both link information and multimodal contents (e.g., text description, and visual content), simply employing the embedding learnt from network structure or data content results in sub-optimal social image representation. In this paper, we propose a novel social image embedding approach called Deep Multimodal Attention Networks (DMAN), which employs a deep model to jointly embed multimodal contents and link information. Specifically, to effectively capture the correlations between multimodal contents, we propose a multimodal attention network to encode the fine-granularity relation between image regions and textual words. To leverage the network structure for embedding learning, a novel Siamese-Triplet neural network is proposed to model the links among images. With the joint deep model, the learnt embedding can capture both the multimodal contents and the nonlinear network information. Extensive experiments are conducted to investigate the effectiveness of our approach in the applications of multi-label classification and cross-modal search. Compared to state-of-the-art image embeddings, our proposed DMAN achieves significant improvement in the tasks of multi-label classification and cross-modal search.


page 1

page 8


Diachronic Cross-modal Embeddings

Understanding the semantic shifts of multimodal information is only poss...

Deep Unified Multimodal Embeddings for Understanding both Content and Users in Social Media Networks

There has been an explosion of multimodal content generated on social me...

Towards Learning Cross-Modal Perception-Trace Models

Representation learning is a key element of state-of-the-art deep learni...

Employ Multimodal Machine Learning for Content quality analysis

The task of identifying high-quality content becomes increasingly import...

VICSOM: VIsual Clues from SOcial Media for psychological assessment

Sharing multimodal information (typically images, videos or text) in Soc...

Beyond Narrative Description: Generating Poetry from Images by Multi-Adversarial Training

Automatic generation of natural language from images has attracted exten...

Learning Shared Semantic Space with Correlation Alignment for Cross-modal Event Retrieval

In this paper, we propose to learn shared semantic space with correlatio...

Please sign up or login with your details

Forgot password? Click here to reset