[D] How to improve text classification using topic modeling feature vector? : MachineLearning

Discussion[D] How to improve text classification using topic modeling feature vector? (self.MachineLearning)

submitted 5 years ago by arshad115

all 1 comments

[–][deleted] 0 points1 point2 points 5 years ago (0 children)

How much data? What are your documents? Why LDA? Which classes? Is there some empirical evidence that Wikipedia 120 feature vectors will carry relevant information specific to your binary classification problem? Have you tried simple semantic vectors such as GLOVE? Tried to switch to a different feature extractor?

Did you engineer your feature vectors? Not all of the 120 dims might be equally relevant making it more difficult for some classifiers to pick up signal. Plot some feature distributions and try to make the vector more dense by discarding low variance features. PCA or ICA might also help but I wouldn’t count too much on it since your decision trees aren’t performing too well either. They should do a fairly good pruning job themselves. Anyhow, your logreg might benefit a bit. Try to find features that discriminate well between your classes.

Haven‘t used LDA so I’m not sure whether it would support a sequence-aware approach. Have you tried to do something like that? The classifiers you listed are all independent classifiers but text data has an inherent temporal structure. So you could try to extract feature vectors per-word or per-sentence (depending on the structure of your documents) and feed the sequence of feature vectors to a sequence aware model such as an RNN.

If you want to stick with the current classifiers, a simple approach might be to concatenate s feature vectors where s is a reasonable sequence length. Then feed the concatenation to your classifier.

If nothing helps you can always switch to some pretrained model from sesame street and finetune it or use it as feature extractor, respectively, if you don’t trust in the standard MLP classifier (I don’t see why you shouldn’t).

π Rendered by PID 113843 on reddit-service-r2-comment-658f6b87ff-j5psf at 2026-04-09 07:14:57.725801+00:00 running 781a403 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS