| The research approach consists of visual-audio  synchronisation and speech processing. Matching the stereoscopic images  features, a 3D point cloud can be extracted through the image processor. A  time-of-flight (TOF) depth camera will be included as a complementary sensor to  enable it to adapt to different interaction scenarios. Beamformer steers to the  mouth direction and optimises the array pattern for target co-ordination. The challenge  of developing the technology is the need to build an alignment with the  beamformer filter coefficients and image frame to improve voice processing. Therefore, a compilation algorithm will be developed  to achieve real-time visual-audio synchronisation.  In the future, this technology could be applied in service robots to enhance the audio processing functions, thereby enabling the robots to provide better response to users’ command. |