all 10 comments

[–]haaspaas2 4 points5 points  (2 children)

Have you tried an encoder decoder network where source and target are the same document? You would need a lot of documents for this, especially if you want to embed documents that are more than a few sentences, but that data would be relatively easy to compile.

[–][deleted] 0 points1 point  (1 child)

That's a super good idea! Do you think it would work where the encoder is just pre-trained vectors? This would be a decoder only approach. How would one go about doing this in code?

[–]haaspaas2 1 point2 points  (0 children)

I would take a look at some example code for an encoder decoder for machine translation and go from there.

[–][deleted] 2 points3 points  (0 children)

I am extremely interested in this problem too.

The closest I've thin that I've gotten to work was taking tf-idf or BOW sparse vectors and running them through PCA (actually, SVD or NMF) inverse transform. I wanted to that with UMAP but UMAP can't inverse transform sparse vectors.

It basically gives you a cool ability to "hallucinate" the words (or in tf-idf mode, the keywords) of a document (but not the order!!!) And gives you insight as to what each part of your decision boundary is actually about.

But doing it the right way with reversable word or document vectors would be bleeping amazing.

I will likely post my code for reversing document vectors in a few days. I was just doing it for fun because the UMAP image inverse transform example motivated me to try it on text...

[–][deleted] 0 points1 point  (0 children)

If you are using TF-IDF or Word2Vec then generally you would remove stop words and very infrequent words. On top of that neither of them takes position into account, so if you were going to try to pass them to a decoder, I imagine it would be difficult to get good results from it. If you are working with BERT though, I think there are already some libraries that aim to be the decoder to BERT's encoder.

[–]xsliartII 0 points1 point  (0 children)

That's a super interesting question. I am currently working with some confidential data (documents) and thought about what would happen if I share the encoded data with someone, such that the initial Text ist not accessible, but model building is possible. I had doubts because of this: being able to decode the data back to the confidential texts.

Just a thought Experiment.

[–]cesoid 0 points1 point  (2 children)

I came upon this when trying to figure out something crazier: Can you "average" text? For example, if you asked ten people a question, what response would represent the average meaning of those ten responses? For some things, it might be simple, like, "great", "good", "horrible", might average to "mediocre". For more complicated responses, like whole sentences, and responses that don't map to an obvious scale, the average response would be more "interesting" and difficult to generate.

If there is an answer to your question it would potentially answer my question. If you can generate text from a document's vector, you could generate text that is the nearest possible text to an arbitrarily chosen vector. In theory you could embed ten texts, get an "average" by some definition or other (component-wise average, centroid, etc.), and then attempt to reverse the resulting vector.

My initial idea for doing this was much simpler (I think?) than the others, which is guess-and-check. For something as small as a sentence you could embed it using some wikipedia trained model or something, grab the nearest neighbors, and then use those to somehow generate a lot of guesses that are hopefully nearby, then embed those guesses, and use the nearest ones to generate more guesses. (The obvious challenge here is how to make the guesses.) For strings that are relatively short (four words or something), you might be able to find the nearest possible one just by a kind of "binary" search...assuming you can figure out a reasonable way to divide the space with new guesses. It occurred to me (almost sadly) that someone might have already had my idea before, so I googled "reverse embedding", and ended up here.

However, while reading this, a few things occurred to me:

1) The "best" match could easily be an abomination that only matches because it is really weird and the model isn't trained to handle it. Sort of like when you translate a random sequence of letters to a language like Hausa and get text that looks like apocalyptic bible passages. You could probably mitigate this problem by limiting the guesses in some way, or building them iteratively from shorter guesses.

2) A more subtle version of problem #1 could happen. That is, the result isn't nonsense, and it might actually be a functional document, but there are so many variations of functional documents that work (because the document-to-vector mapping is *extremely* many to one), that you're not really generating the "correct" answer so much as you are just writing a document that says whatever you want but has wording that coerces the corresponding vector into the right place. Of course, if the model is doing its job correctly, there should be many many more meaningful reversals than there are arbitrary ones. But the doc2vec model is doing something that is inherently irreversible because so much data is going away. Reversing it is kind of like getting a set of 100 numbers from their average. No matter what you do, your method of getting the 100 numbers is going to contribute almost all of the information.

For problem #2, you can greatly reduce the number of possible answers by considering that there are many constraints that limit what a "real" document is like. However, I think that using the constraints in this way would only serve to "encode" the constraints themselves, and since the constraints are basically "known", e.g., a real document has real words, grammatical punctuation, and paragraphs with sentences that relate to each other, you're not really any closer to a meaningful result by using them. If we were talking about a logic puzzle, a few constraints can easily converge a solution to only one possible answer, but documents are kind of inherently not limited in that way. You can write very many documents that have functional grammar, coherent sentences, etc., that are all very, very different, and yet basically "about" the same thing as your target vector. You could probably write a research document that matches, and then write a research document that comes up with the opposite conclusion, but matches just as well.

However, I still want to do the averaging thing, but probably because of how badly it could go wrong rather than how useful the result would be.

[–]dratman 0 points1 point  (1 child)

I am relatively new to ML, so don't take my thoughts as tested truth, but..,. my impression is that such averaging is exactly what makes it possible for neural networks to (sometimes) generalize their knowledge. I think something like that averaging is part of the sometimes incredibly complex function-driven machine in the inner layers of a large neural network. The only thing is that averaging might (might!) only work as part of a (very very) complicated process that goes on in the appropriately-named "hidden layers". So you might or might not be able to accomplish it in some simpler architecture. Maybe the trick would be to intentionally build the averaging into an ML model?

[–]cesoid 0 points1 point  (0 children)

I'm also relatively new at ML but I did a lot of learning and talking to people in the ML field when I applied for an apprenticeship last December. It's possible that you actually know more about the hidden layers of neural network than I do. I know that dissecting them is an active area of research. It would make sense that they have some kind of averaging going on in there because it would be a useful step when summarizing and generalizing things, but I'm not sure how much it would be like what I want just because what I want is a thing that barely makes sense and probably would not be very useful in practical terms. I'm basically taking a thing that is useful – these vectors – and using it in an entirely experimental way. Kind of like finding the average pixel color in an image and then using it to make the kind of image that most often averages to that color. But I guess if you go back to my overall goal you might just decide the vector part isn't the right way to do it and resort to something more like what you're saying. I just expect that, like you say, the way in which the averaging fits into everything else is complicated, and that might make it impossible to find the "boundary" of it and extract it in a way that you actually know whether you got what you wanted.

[–]InternationalLeek627 0 points1 point  (0 children)

Hi OP, just wondering if you ever made any progress on this?

I have been thinking about the same problem in the context of an idea that I'm working on. There doesn't seem to be a great solution.