Google has been evolving with new technologies not just in Manufacturing things like Google Glass and Android things. They are also improving their products such as Google Voice search with better systems to analyse and predict or recognise the voice commands in various environments.
One such improvement has lodged into Google Voice search and that is Connectionist Temporal Classification (CTC) and sequence discriminative training techniques, which are extensions of the recurrent neutral networks (RNN) used for voice recognition. Here’s what Google said about the system they just started with:
RNNs have feedback loops in their topology, allowing them to model temporal dependencies: when the user speaks /u/ in the previous example, their articulatory apparatus is coming from a /j/ sound and from an /m/ sound before. Try saying it out loud – “museum” – it flows very naturally in one breath, and RNNs can capture that. The type of RNN used here is a Long Short-Term Memory (LSTM) RNN which, through memory cells and a sophisticated gating mechanism, memorizes information better than other RNNs. Adopting such models already improved the quality of our recognizer significantly.
Here’s what Google was using after 2012:
In a traditional speech recognizer, the waveform spoken by a user is split into small consecutive slices or “frames” of 10 milliseconds of audio. Each frame is analyzed for its frequency content, and the resulting feature vector is passed through an acoustic model… The recognizer then reconciles all this information to determine the sentence the user is speaking. If the user speaks the word “museum” for example – /m j u z i @ m/ in phonetic notation – it may be hard to tell where the /j/ sound ends and where the /u/ starts, but in truth the recognizer doesn’t care where exactly that transition happens: All it cares about is that these sounds were spoken.