CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts

by   Rabindra Lamsal, et al.

Social media platforms play an essential role in crisis communication, but analyzing crisis-related social media texts is challenging due to their informal nature. Transformer-based pre-trained models like BERT and RoBERTa have shown success in various NLP tasks, but they are not tailored for crisis-related texts. Furthermore, general-purpose sentence encoders are used to generate sentence embeddings, regardless of the textual complexities in crisis-related texts. Advances in applications like text classification, semantic search, and clustering contribute to effective processing of crisis-related texts, which is essential for emergency responders to gain a comprehensive view of a crisis event, whether historical or real-time. To address these gaps in crisis informatics literature, this study introduces CrisisTransformers, an ensemble of pre-trained language models and sentence encoders trained on an extensive corpus of over 15 billion word tokens from tweets associated with more than 30 crisis events, including disease outbreaks, natural disasters, conflicts, and other critical incidents. We evaluate existing models and CrisisTransformers on 18 crisis-specific public datasets. Our pre-trained models outperform strong baselines across all datasets in classification tasks, and our best-performing sentence encoder improves the state-of-the-art by 17.43 investigate the impact of model initialization on convergence and evaluate the significance of domain-specific models in generating semantically meaningful sentence embeddings. All models are publicly released (, with the anticipation that they will serve as a robust baseline for tasks involving the analysis of crisis-related social media texts.


page 1

page 2

page 3

page 4


RoBERTuito: a pre-trained language model for social media text in Spanish

Since BERT appeared, Transformer language models and transfer learning h...

Text Encoders Lack Knowledge: Leveraging Generative LLMs for Domain-Specific Semantic Textual Similarity

Amidst the sharp rise in the evaluation of large language models (LLMs) ...

Not Enough Labeled Data? Just Add Semantics: A Data-Efficient Method for Inferring Online Health Texts

User-generated texts available on the web and social platforms are often...

Estimating Confidence of Predictions of Individual Classifiers and Their Ensembles for the Genre Classification Task

Genre identification is a subclass of non-topical text classification. T...

Generating Informative Conclusions for Argumentative Texts

The purpose of an argumentative text is to support a certain conclusion....

PromptCARE: Prompt Copyright Protection by Watermark Injection and Verification

Large language models (LLMs) have witnessed a meteoric rise in popularit...

Investigating Chain-of-thought with ChatGPT for Stance Detection on Social Media

Stance detection predicts attitudes towards targets in texts and has gai...

Please sign up or login with your details

Forgot password? Click here to reset