IT博客汇 | [汇总]Google十亿单词语言模型

[汇总]Google十亿单词语言模型

我爱机器学习(52ml.net)发表于 2016-09-24 15:15:30

Paper: Exploring the Limits of Language Modeling

Authors: Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu

Github：https://github.com/tensorflow/models/tree/master/lm_1b

Abstract:

In this work we explore recent advances in Recurrent Neural Networks for large scale Language Modeling, a task central to language understanding. We extend current models to deal with two key challenges present in this task: corpora and vocabulary sizes, and complex, long term structure of language. We perform an exhaustive study on techniques such as character Convolutional Neural Networks or Long-Short Term Memory, on the One Billion Word Benchmark. Our best single model significantly improves state-of-the-art perplexity from 51.3 down to 30.0 (whilst reducing the number of parameters by a factor of 20), while an ensemble of models sets a new record by improving perplexity from 41.0 down to 23.7. We also release these models for the NLP and ML community to study and improve upon.

解读

Dissecting Google’s Billion Word Language Model Part 1: Character Embeddings

Background – language models
The lm_1b architecture
Char CNN?
Character embeddings?
Vector math
Vector math – for real this time
Making sense of it all
Generalizing over characters?

A Billion Words and The Limits of Language Modeling

RNN-based models

1. Captures long range history instead of being fixed-order Markov

2. Competitive perplexity

3 . Can control/limit number of parameters in the RNN

4. RNN-based models do well on rare words

N-gram models

1. Super fast to train — at least an order of magnitude fewer hours needed to train.
2. Works well for small quantities of data. With the right smoothing/priors, you could make this work decently with a fraction of the data. Let’s say if the government comes to you asking to build a language model in Dari or Pashto, and you have just a few hundred thousand words, you really cannot do better than using an n-gram model and a visit to your friendly neighborhood linguistics department. (Using those languages as an example; quite sure, given the interest, by now we have enough data for those languages).
3. Still, the best option to use in combination: Even when you have tonnes of data, you want to use the n-gram model in combination, as they usually give better results, when interpolated, than using just the LSTM model alone. Even when you add more states to your LSTM, you will continue to get better performance interpolating with a 5-gram model. Another case in point is the Speech Recognition system from MSR that’s making the news now for getting the best Word Error Rate: They use an n-gram interpolated RNN model too.

The Exploring Limits paper counters this argument. They suggest n-gram interpolation is not necessary and with careful tuning it is possible to find an LSTM+CNN architecture that will provide competitive results as an interpolated model. That’s just an exercise in exploring limits of your patience or your GPU infrastructure.

I’m going to leave you all with a decision tree of what to do when faced with building LMs in your startups, if LMs are a means to an end for the problem you are solving.

201-lm-flow