💬
Char/Word/Sent/Doc Embedding Models
Comparison of a few embedding algorithms used in natural language processing (NLP) tasks.
Name | vectorizes... | derived from | description | Can be trained with... |
word2vec | word | | give neighboring words, guess pivot word (cbow); or the other way around (skip-gram) | gensim |
doc2vec | paragraph | word2vec | basically adding a paragraph vector to neighboring words while training | gensim |
sub-word n-grams | word2vec | gensim | ||
GloVe | word | | glove-python | |
| | Transformer w/ self-attn.; take whole doc. at once; (1) randomly masks out + replaces 10% words and try guess original; (2) try predict whether is next sent. VERY resource-hungry. | | |
word | | context-dep. bidirectional LSTM. | | |
char | | char-level, context-dep., LSTM. Looks cool but why so few mentions? | |
There are also many sentence embedding algorithms that worth looking at: https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a.
Last modified 3yr ago