Language modeling has long been an important research area since Shannon (1951) estimated the information in language with next word prediction.