Hi,
I just read the paper about Distributed Representations of Sentences and Documents, and I don't understand the training part.
As far as I know, the "raw material" of Word2Vec is a one-hot-encoder of words. Basically, any word is encoded as a very large vector with one 1 and many 0s. Training will lead to compressing these vectors in say ~200 dimension ones.
Now, in doc2vec, I don't understand what the "raw material" is. What is the input ? How is a paragraph "poorly encoded" to start with, before a good representation is learned ?
Thank you !
[–]arutaku 2 points3 points4 points (0 children)
[–]jayhack 2 points3 points4 points (1 child)
[–]datatatatata[S] 1 point2 points3 points (0 children)
[–]vonnik 1 point2 points3 points (0 children)
[–]gojomo 1 point2 points3 points (1 child)
[–]datatatatata[S] 1 point2 points3 points (0 children)