LSCM E-Newsletter

LSCM eNewsletter – Sep 2022

Technology Corner

Adaptive Voice Localisation System for Service Robot

Reliable speech recognition is a crucial function of interactive service robots. In this project, LSCM aims to develop vision-based beamformer technology and achieve real-time high noise suppression voice capturing to support natural language interaction in public environment. The visual information captured by the depth camera is used to build 3D facial model in order to locate the position of the user’s mouth. Beamformer will be parameterised and optimised based on the solid angle and distance towards the position of the user’s mouth.

The research approach consists of visual-audio synchronisation and speech processing. Matching the stereoscopic images features, a 3D point cloud can be extracted through the image processor. A time-of-flight (TOF) depth camera will be included as a complementary sensor to enable it to adapt to different interaction scenarios. Beamformer steers to the mouth direction and optimises the array pattern for target co-ordination. The challenge of developing the technology is the need to build an alignment with the beamformer filter coefficients and image frame to improve voice processing. Therefore, a compilation algorithm will be developed to achieve real-time visual-audio synchronisation.

In the future, this technology could be applied in service robots to enhance the audio processing functions, thereby enabling the robots to provide better response to users’ command.

Noise Rejection Comparison Graph

Word Error Rate Result (the lower the better)