Standard neural network-based speech recognition pipelines (i.e. RNN + CTC) always use a language model. Unlike a seq2seq model (or any autoregressive model, or a structured prediction output), CTC models output timesteps as conditionally independent. Hence, everyone uses an RNN LM or n-gram LM or both when retrieving probable sequences from a CTC model (e.g. with beam search).