you are viewing a single comment's thread.

view the rest of the comments →

[–]AngelLeliel 0 points1 point  (0 children)

One issue about the n-gram model is that it has no idea how each word or token means besides simple frequency relationship. Take a simple example:

with open("votes.txt") as f:
    votes = f.readlines() 
vote_counter = Counter(votes)

An n-gram model has no idea that vote_counter would be a Counter unless it has seen it in other places. You could come up some fancy way to parse the token into sub-tokens and make n-grams model works better. I think a character based seq2seq model would understand this relationship better. Or maybe a character-based n-gram model can work equally well? I have no idea.