give neighboring words, guess pivot word (cbow); or the other way around (skip-gram)
basically adding a paragraph vector to neighboring words while training
Transformer w/ self-attn.; take whole doc. at once; (1) randomly masks out + replaces 10% words and try guess original; (2) try predict whether is next sent. VERY resource-hungry.
context-dep. bidirectional LSTM.
char-level, context-dep., LSTM. Looks cool but why so few mentions?