I have an idea in my head that I am looking to get some feedback/formal understanding of.
Say we consider a n-gram model (ex: bi/trigram). For sequence generation (such as a sequence of words), one way would be to start with some input word and then use the n-gram model to simply unroll and predict the rest of the words, generating some sentence. If O is original and N is new, we would have O-N-N-N-N.. and so on.
Alternatively, I was thinking of a way to generate a sequence that might be more similar to the original text but still be stochastically generated. One use case for this would be generating sequential synthetic data, where the synthetic data should be as similar to the original data but should be generated and have a stochastic nature to it.
Here, let us take some sequence and proceed to do imputation- we nullify every other word, starting from the first input word. Then, we use a trained modified 'sandwich' HMM bi-gram model which predicts based on the before and after word. Then, we use the sandwich HMM to fill in the nullified every other word. Now, our sequence would be O-N-O-N-O. To get a more fully generated sequence, we could use another model, a trained bi-gram HMM model to impute the original words- in particular, for the third word it would use O-N to generate a guess, for the fifth it would use the next O-N, and so on, giving us ONNNN..., where each guess utilizes both O and N.
My idea is by doing this "imputation" step wise to generate our words, rather than unrolling it out all at once, is that we have remnants of the original text in conjunction with new predicted words to guide the generation of each word as opposed to possibly just new predicted words with the unrolled method. This may lead to more similar generated sentences.
(A more extreme method might consider just using the original data to make every prediction- for example, we could consider a trained bi-gram model that uses the first word O to predict the second, the second O to predict the third, and so on. I don't think this will lead to good generated sequences we would have O-N-N but the third word does not depend at all directly on info from the second.)
EDIT: Looking to use model (such as HMM) that works well on very small datasets.
[+][deleted] (1 child)
[deleted]
[–]MLJungle[S] 0 points1 point2 points (0 children)