TS-RGBD Dataset: a Novel Dataset for Theatre Scenes Description for People with Visual Impairments

by   Leyla Benhamida, et al.

Computer vision was long a tool used for aiding visually impaired people to move around their environment and avoid obstacles and falls. Solutions are limited to either indoor or outdoor scenes, which limits the kind of places and scenes visually disabled people can be in, including entertainment places such as theatres. Furthermore, most of the proposed computer-vision-based methods rely on RGB benchmarks to train their models resulting in a limited performance due to the absence of the depth modality. In this paper, we propose a novel RGB-D dataset containing theatre scenes with ground truth human actions and dense captions annotations for image captioning and human action recognition: TS-RGBD dataset. It includes three types of data: RGB, depth, and skeleton sequences, captured by Microsoft Kinect. We test image captioning models on our dataset as well as some skeleton-based human action recognition models in order to extend the range of environment types where a visually disabled person can be, by detecting human actions and textually describing appearances of regions of interest in theatre scenes.


page 1

page 3

page 4

page 5

page 6

page 7


Human Action Adverb Recognition: ADHA Dataset and A Three-Stream Hybrid Model

We introduce the first benchmark for a new problem --- recognizing human...

Semantic Labeling of Human Action For Visually Impaired And Blind People Scene Interaction

The aim of this work is to contribute to the development of a tactile de...

Towards Real Time Egocentric Segment Captioning for The Blind and Visually Impaired in RGB-D Theatre Images

In recent years, image captioning and segmentation have emerged as cruci...

Evaluation Framework for Computer Vision-Based Guidance of the Visually Impaired

Visually impaired persons have significant problems in their everyday mo...

Making the Invisible Visible: Action Recognition Through Walls and Occlusions

Understanding people's actions and interactions typically depends on see...

KShapeNet: Riemannian network on Kendall shape space for Skeleton based Action Recognition

Deep Learning architectures, albeit successful in most computer vision t...

Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels

When human annotators are given a choice about what to label in an image...

Please sign up or login with your details

Forgot password? Click here to reset