In complex environments, humans can understand the meaning of speech better than AI, because we use not only our ears but our eyes as well.
For example, we see someone's mouth moving and may intuitively know that the sound we hear must be coming from that person.
Meta AI is working on a new AI dialogue system,which is to teach AI to also learn to recognize subtle correlations between what it sees and hears in a conversation.
VisualVoice learns in a similar way to how humans learn to master new skills,enabling audio-visual speech separation by learning visual and auditory cues from unlabeled videos.
For machines, this creates better perception, while human perception improves.
Imagine being able to participate in group meetings in the metaverse with colleagues from all over the world, joining smaller group meetings as they move through the virtual space, during which the sound reverbs and timbres in the scene do according to the environment Adjust accordingly.
That is, it can obtain audio,video and text information at the same time,and has a richer environmental understanding model,allowing users to have a "very wow" sound experience.
Post time: Jul-20-2022