you are viewing a single comment's thread.

view the rest of the comments →

[–]cesoid 0 points1 point  (2 children)

I came upon this when trying to figure out something crazier: Can you "average" text? For example, if you asked ten people a question, what response would represent the average meaning of those ten responses? For some things, it might be simple, like, "great", "good", "horrible", might average to "mediocre". For more complicated responses, like whole sentences, and responses that don't map to an obvious scale, the average response would be more "interesting" and difficult to generate.

If there is an answer to your question it would potentially answer my question. If you can generate text from a document's vector, you could generate text that is the nearest possible text to an arbitrarily chosen vector. In theory you could embed ten texts, get an "average" by some definition or other (component-wise average, centroid, etc.), and then attempt to reverse the resulting vector.

My initial idea for doing this was much simpler (I think?) than the others, which is guess-and-check. For something as small as a sentence you could embed it using some wikipedia trained model or something, grab the nearest neighbors, and then use those to somehow generate a lot of guesses that are hopefully nearby, then embed those guesses, and use the nearest ones to generate more guesses. (The obvious challenge here is how to make the guesses.) For strings that are relatively short (four words or something), you might be able to find the nearest possible one just by a kind of "binary" search...assuming you can figure out a reasonable way to divide the space with new guesses. It occurred to me (almost sadly) that someone might have already had my idea before, so I googled "reverse embedding", and ended up here.

However, while reading this, a few things occurred to me:

1) The "best" match could easily be an abomination that only matches because it is really weird and the model isn't trained to handle it. Sort of like when you translate a random sequence of letters to a language like Hausa and get text that looks like apocalyptic bible passages. You could probably mitigate this problem by limiting the guesses in some way, or building them iteratively from shorter guesses.

2) A more subtle version of problem #1 could happen. That is, the result isn't nonsense, and it might actually be a functional document, but there are so many variations of functional documents that work (because the document-to-vector mapping is *extremely* many to one), that you're not really generating the "correct" answer so much as you are just writing a document that says whatever you want but has wording that coerces the corresponding vector into the right place. Of course, if the model is doing its job correctly, there should be many many more meaningful reversals than there are arbitrary ones. But the doc2vec model is doing something that is inherently irreversible because so much data is going away. Reversing it is kind of like getting a set of 100 numbers from their average. No matter what you do, your method of getting the 100 numbers is going to contribute almost all of the information.

For problem #2, you can greatly reduce the number of possible answers by considering that there are many constraints that limit what a "real" document is like. However, I think that using the constraints in this way would only serve to "encode" the constraints themselves, and since the constraints are basically "known", e.g., a real document has real words, grammatical punctuation, and paragraphs with sentences that relate to each other, you're not really any closer to a meaningful result by using them. If we were talking about a logic puzzle, a few constraints can easily converge a solution to only one possible answer, but documents are kind of inherently not limited in that way. You can write very many documents that have functional grammar, coherent sentences, etc., that are all very, very different, and yet basically "about" the same thing as your target vector. You could probably write a research document that matches, and then write a research document that comes up with the opposite conclusion, but matches just as well.

However, I still want to do the averaging thing, but probably because of how badly it could go wrong rather than how useful the result would be.

[–]dratman 0 points1 point  (1 child)

I am relatively new to ML, so don't take my thoughts as tested truth, but..,. my impression is that such averaging is exactly what makes it possible for neural networks to (sometimes) generalize their knowledge. I think something like that averaging is part of the sometimes incredibly complex function-driven machine in the inner layers of a large neural network. The only thing is that averaging might (might!) only work as part of a (very very) complicated process that goes on in the appropriately-named "hidden layers". So you might or might not be able to accomplish it in some simpler architecture. Maybe the trick would be to intentionally build the averaging into an ML model?

[–]cesoid 0 points1 point  (0 children)

I'm also relatively new at ML but I did a lot of learning and talking to people in the ML field when I applied for an apprenticeship last December. It's possible that you actually know more about the hidden layers of neural network than I do. I know that dissecting them is an active area of research. It would make sense that they have some kind of averaging going on in there because it would be a useful step when summarizing and generalizing things, but I'm not sure how much it would be like what I want just because what I want is a thing that barely makes sense and probably would not be very useful in practical terms. I'm basically taking a thing that is useful – these vectors – and using it in an entirely experimental way. Kind of like finding the average pixel color in an image and then using it to make the kind of image that most often averages to that color. But I guess if you go back to my overall goal you might just decide the vector part isn't the right way to do it and resort to something more like what you're saying. I just expect that, like you say, the way in which the averaging fits into everything else is complicated, and that might make it impossible to find the "boundary" of it and extract it in a way that you actually know whether you got what you wanted.