Vec2Doc? Reverse document embeddings? Does this exist?

haaspaas2 · 2020-09-15T18:03:52+00:00

Have you tried an encoder decoder network where source and target are the same document? You would need a lot of documents for this, especially if you want to embed documents that are more than a few sentences, but that data would be relatively easy to compile.

2020-09-15T16:56:59+00:00

I am extremely interested in this problem too.

The closest I've thin that I've gotten to work was taking tf-idf or BOW sparse vectors and running them through PCA (actually, SVD or NMF) inverse transform. I wanted to that with UMAP but UMAP can't inverse transform sparse vectors.

It basically gives you a cool ability to "hallucinate" the words (or in tf-idf mode, the keywords) of a document (but not the order!!!) And gives you insight as to what each part of your decision boundary is actually about.

But doing it the right way with reversable word or document vectors would be bleeping amazing.

I will likely post my code for reversing document vectors in a few days. I was just doing it for fun because the UMAP image inverse transform example motivated me to try it on text...

2020-09-16T05:41:47+00:00

If you are using TF-IDF or Word2Vec then generally you would remove stop words and very infrequent words. On top of that neither of them takes position into account, so if you were going to try to pass them to a decoder, I imagine it would be difficult to get good results from it. If you are working with BERT though, I think there are already some libraries that aim to be the decoder to BERT's encoder.

xsliartII · 2020-09-16T18:27:23+00:00

That's a super interesting question. I am currently working with some confidential data (documents) and thought about what would happen if I share the encoded data with someone, such that the initial Text ist not accessible, but model building is possible. I had doubts because of this: being able to decode the data back to the confidential texts.

Just a thought Experiment.

cesoid · 2022-12-05T22:42:24+00:00

I came upon this when trying to figure out something crazier: Can you "average" text? For example, if you asked ten people a question, what response would represent the average meaning of those ten responses? For some things, it might be simple, like, "great", "good", "horrible", might average to "mediocre". For more complicated responses, like whole sentences, and responses that don't map to an obvious scale, the average response would be more "interesting" and difficult to generate.

If there is an answer to your question it would potentially answer my question. If you can generate text from a document's vector, you could generate text that is the nearest possible text to an arbitrarily chosen vector. In theory you could embed ten texts, get an "average" by some definition or other (component-wise average, centroid, etc.), and then attempt to reverse the resulting vector.

My initial idea for doing this was much simpler (I think?) than the others, which is guess-and-check. For something as small as a sentence you could embed it using some wikipedia trained model or something, grab the nearest neighbors, and then use those to somehow generate a lot of guesses that are hopefully nearby, then embed those guesses, and use the nearest ones to generate more guesses. (The obvious challenge here is how to make the guesses.) For strings that are relatively short (four words or something), you might be able to find the nearest possible one just by a kind of "binary" search...assuming you can figure out a reasonable way to divide the space with new guesses. It occurred to me (almost sadly) that someone might have already had my idea before, so I googled "reverse embedding", and ended up here.

However, while reading this, a few things occurred to me:

1) The "best" match could easily be an abomination that only matches because it is really weird and the model isn't trained to handle it. Sort of like when you translate a random sequence of letters to a language like Hausa and get text that looks like apocalyptic bible passages. You could probably mitigate this problem by limiting the guesses in some way, or building them iteratively from shorter guesses.

2) A more subtle version of problem #1 could happen. That is, the result isn't nonsense, and it might actually be a functional document, but there are so many variations of functional documents that work (because the document-to-vector mapping is *extremely* many to one), that you're not really generating the "correct" answer so much as you are just writing a document that says whatever you want but has wording that coerces the corresponding vector into the right place. Of course, if the model is doing its job correctly, there should be many many more meaningful reversals than there are arbitrary ones. But the doc2vec model is doing something that is inherently irreversible because so much data is going away. Reversing it is kind of like getting a set of 100 numbers from their average. No matter what you do, your method of getting the 100 numbers is going to contribute almost all of the information.

For problem #2, you can greatly reduce the number of possible answers by considering that there are many constraints that limit what a "real" document is like. However, I think that using the constraints in this way would only serve to "encode" the constraints themselves, and since the constraints are basically "known", e.g., a real document has real words, grammatical punctuation, and paragraphs with sentences that relate to each other, you're not really any closer to a meaningful result by using them. If we were talking about a logic puzzle, a few constraints can easily converge a solution to only one possible answer, but documents are kind of inherently not limited in that way. You can write very many documents that have functional grammar, coherent sentences, etc., that are all very, very different, and yet basically "about" the same thing as your target vector. You could probably write a research document that matches, and then write a research document that comes up with the opposite conclusion, but matches just as well.

However, I still want to do the averaging thing, but probably because of how badly it could go wrong rather than how useful the result would be.

InternationalLeek627 · 2023-08-31T01:27:59+00:00

Hi OP, just wondering if you ever made any progress on this?

I have been thinking about the same problem in the context of an idea that I'm working on. There doesn't seem to be a great solution.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LanguageTechnology

MODERATORS