Deng Huizhen and Zhang zhaogong, Heilongjiang University of China, China
Connectionist temporal classification (CTC) has been successfully applied to end-to-end speech recognition tasks, but its main body recurrent neural network makes parallelization very difficult. Since the attention mechanism has shown very good performance on a series of tasks such as machine translation, handwriting synthesis, and image caption generation for loop sequence generators conditioned on input data. This paper applies the attention mechanism to CTC, and proposes a connectionist temporal classification based on the local self-attention mechanism, in which the cyclic neural network module in the traditional CTC model is replaced by the self-attention module. It shows that it is attractive and competitive in end-to-end speech recognition. The proposed mechanism is based on local self-attention, which uses a sliding mechanism to obtain acoustic features locally. This mechanism effectively models long-term scenarios by stacking multiple sliders to obtain a larger receiving field to achieve online decoding. Moreover, the CTC training joint cross-entropy criterion makes the model converge better. We have completed experiments on the AISHELL-1 dataset. The experiments show that the basic model has a lower character error rate than the existing state-of-the-art models, and the model after cross entropy has been further improved.
Connectionist temporal classification, self-attention mechanism, cross entropy, Speech Recognition.