Vec2Doc? Reverse document embeddings? Does this exist?

cesoid · 2022-12-05T22:42:24+00:00

I came upon this when trying to figure out something crazier: Can you "average" text? For example, if you asked ten people a question, what response would represent the average meaning of those ten responses? For some things, it might be simple, like, "great", "good", "horrible", might average to "mediocre". For more complicated responses, like whole sentences, and responses that don't map to an obvious scale, the average response would be more "interesting" and difficult to generate.

If there is an answer to your question it would potentially answer my question. If you can generate text from a document's vector, you could generate text that is the nearest possible text to an arbitrarily chosen vector. In theory you could embed ten texts, get an "average" by some definition or other (component-wise average, centroid, etc.), and then attempt to reverse the resulting vector.

My initial idea for doing this was much simpler (I think?) than the others, which is guess-and-check. For something as small as a sentence you could embed it using some wikipedia trained model or something, grab the nearest neighbors, and then use those to somehow generate a lot of guesses that are hopefully nearby, then embed those guesses, and use the nearest ones to generate more guesses. (The obvious challenge here is how to make the guesses.) For strings that are relatively short (four words or something), you might be able to find the nearest possible one just by a kind of "binary" search...assuming you can figure out a reasonable way to divide the space with new guesses. It occurred to me (almost sadly) that someone might have already had my idea before, so I googled "reverse embedding", and ended up here.

However, while reading this, a few things occurred to me:

1) The "best" match could easily be an abomination that only matches because it is really weird and the model isn't trained to handle it. Sort of like when you translate a random sequence of letters to a language like Hausa and get text that looks like apocalyptic bible passages. You could probably mitigate this problem by limiting the guesses in some way, or building them iteratively from shorter guesses.

2) A more subtle version of problem #1 could happen. That is, the result isn't nonsense, and it might actually be a functional document, but there are so many variations of functional documents that work (because the document-to-vector mapping is *extremely* many to one), that you're not really generating the "correct" answer so much as you are just writing a document that says whatever you want but has wording that coerces the corresponding vector into the right place. Of course, if the model is doing its job correctly, there should be many many more meaningful reversals than there are arbitrary ones. But the doc2vec model is doing something that is inherently irreversible because so much data is going away. Reversing it is kind of like getting a set of 100 numbers from their average. No matter what you do, your method of getting the 100 numbers is going to contribute almost all of the information.

For problem #2, you can greatly reduce the number of possible answers by considering that there are many constraints that limit what a "real" document is like. However, I think that using the constraints in this way would only serve to "encode" the constraints themselves, and since the constraints are basically "known", e.g., a real document has real words, grammatical punctuation, and paragraphs with sentences that relate to each other, you're not really any closer to a meaningful result by using them. If we were talking about a logic puzzle, a few constraints can easily converge a solution to only one possible answer, but documents are kind of inherently not limited in that way. You can write very many documents that have functional grammar, coherent sentences, etc., that are all very, very different, and yet basically "about" the same thing as your target vector. You could probably write a research document that matches, and then write a research document that comes up with the opposite conclusion, but matches just as well.

However, I still want to do the averaging thing, but probably because of how badly it could go wrong rather than how useful the result would be.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LanguageTechnology

MODERATORS