What is the Best Automated Metric for Text to Motion Generation?

by   Jordan Voas, et al.

There is growing interest in generating skeleton-based human motions from natural language descriptions. While most efforts have focused on developing better neural architectures for this task, there has been no significant work on determining the proper evaluation metric. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. Since descriptions are compatible with many motions, determining the right metric is critical for evaluating and designing effective generative models. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better. Our findings indicate that none of the metrics currently used for this task show even a moderate correlation with human judgments on a sample level. However, for assessing average model performance, commonly used metrics such as R-Precision and less-used coordinate errors show strong correlations. Additionally, several recently developed metrics are not recommended due to their low correlation compared to alternatives. We also introduce a novel metric based on a multimodal BERT-like model, MoBERT, which offers strongly human-correlated sample-level evaluations while maintaining near-perfect model-level correlation. Our results demonstrate that this new metric exhibits extensive benefits over all current alternatives.


page 1

page 10

page 11


VIFIDEL: Evaluating the Visual Fidelity of Image Descriptions

We address the task of evaluating image description generation systems. ...

The Glass Ceiling of Automatic Evaluation in Natural Language Generation

Automatic evaluation metrics capable of replacing human judgments are cr...

QAScore – An Unsupervised Unreferenced Metric for the Question Generation Evaluation

Question Generation (QG) aims to automate the task of composing question...

NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

In this study, we analyze NLG automatic metrics based on whether human e...

RoViST:Learning Robust Metrics for Visual Storytelling

Visual storytelling (VST) is the task of generating a story paragraph th...

Flatness-Aware Prompt Selection Improves Accuracy and Sample Efficiency

With growing capabilities of large language models, prompting them has b...

Better Smatch = Better Parser? AMR evaluation is not so simple anymore

Recently, astonishing advances have been observed in AMR parsing, as mea...

Please sign up or login with your details

Forgot password? Click here to reset