Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams

by   Bochen Xie, et al.

Event cameras are neuromorphic vision sensors representing visual information as sparse and asynchronous event streams. Most state-of-the-art event-based methods project events into dense frames and process them with conventional learning models. However, these approaches sacrifice the sparsity and high temporal resolution of event data, resulting in a large model size and high computational complexity. To fit the sparse nature of events and sufficiently explore their implicit relationship, we develop a novel attention-aware framework named Event Voxel Set Transformer (EVSTr) for spatiotemporal representation learning on event streams. It first converts the event stream into a voxel set and then hierarchically aggregates voxel features to obtain robust representations. The core of EVSTr is an event voxel transformer encoder to extract discriminative spatiotemporal features, which consists of two well-designed components, including a multi-scale neighbor embedding layer (MNEL) for local information aggregation and a voxel self-attention layer (VSAL) for global representation modeling. Enabling the framework to incorporate a long-term temporal structure, we introduce a segmental consensus strategy for modeling motion patterns over a sequence of segmented voxel sets. We evaluate the proposed framework on two event-based tasks: object classification and action recognition. Comprehensive experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity. Additionally, we present a new dataset (NeuroHAR) recorded in challenging visual scenarios to address the lack of real-world event-based datasets for action recognition.


page 1

page 3

page 4

page 9


EV-VGCNN: A Voxel Graph CNN for Event-based Object Classification

Event cameras report sparse intensity changes and hold noticeable advant...

Spatiotemporal Filtering for Event-Based Action Recognition

In this paper, we address the challenging problem of action recognition,...

Event Transformer+. A multi-purpose solution for efficient event data processing

Event cameras record sparse illumination changes with high temporal reso...

Invariant feature extraction from event based stimuli

We propose a novel architecture, the event-based GASSOM for learning and...

Multi-axis Attentive Prediction for Sparse EventData: An Application to Crime Prediction

Spatiotemporal prediction of event data is a challenging task with a lon...

Learning Bottleneck Transformer for Event Image-Voxel Feature Fusion based Classification

Recognizing target objects using an event-based camera draws more and mo...

Point-Voxel Absorbing Graph Representation Learning for Event Stream based Recognition

Sampled point and voxel methods are usually employed to downsample the d...

Please sign up or login with your details

Forgot password? Click here to reset