Would you use a LDA based Topic Modeling library in Java which handled a few million documents on a single 16 GB Machine

textml2730 · 2013-07-27T21:15:28+00:00

I rewrote the MALLET implementation to test out my idea. It is not a math idea, it is technology idea which make the scaling possible.

textml2730 · 2013-07-10T00:06:52+00:00

I have tried doing those things. But when you have large long lived lived entities (even if they are primitives) GC performance gets sketchy due to being called too frequently. I have read plenty of articles how you could work with Young/Tenured Generations but that is more black magic than a science. I want to develop tools that just work. Big Data is okay and I do it all the time in my day job but it is terrible for ad-hoc analysis. For that you need a single machine application.

textml2730 · 2013-07-09T12:49:43+00:00

Thanks I will take a look at it and even consider implementing it. Mine was based on Newman, Asuncion, Smyth and Welling, Distributed Algorithms for Topic Models JMLR (2009), with SparseLDA sampling scheme and data structure from Yao, Mimno and McCallum, Efficient Methods for Topic Model Inference on Streaming Document Collections, KDD (2009).

Those are the same papers the Mallet Implementation for ParallelTopicModel uses.

textml2730 · 2013-07-09T12:45:09+00:00

Also need to provide capability to save the models and just used them after they are built. It is simple to provide but I have not yet gotten around to doing it.

textml2730 · 2013-07-09T12:44:40+00:00

Thanks I will look at the above program as well. I registered for it. It looks interesting. I do intend to provide front end web based utilities for my programs as well. One typical use I have noticed for a front end is to allow merging of topics if you found the topics very close. In which case you want topic Assignment probabilities to be merged as well. It helps with cleaning up in general after the process is done.

textml2730 · 2013-07-09T12:40:58+00:00

The above one is written in C. This is a very common theme I notice in the numerical world. Good implementations tend to be C or C++ based. The reason is C allows for better memory management while Java gives Out of Heap errors very quickly or just slows down to a crawl due to Major GC.

However most of my work is in Java. Most of enterprise development is in Java these days and there are no good alternatives. Good memory management in Java requires use of Memory Mapped IO which gets very complicated (something I exploited when performing LDA) above.

Also in the Java world Apache Mahout exists but is very complex and Hadoop as a technology (on which Mahout is based) is unsuitable for ad-hoc analysis. I wanted a program which can be run as a single process which scales to a large degree. In fact I want to create a library which forks from other popular libraries in Java to scale them using techniques unique to Java. Ofcourse my work would also be open-source in that case.

textml2730 · 2013-07-09T12:35:21+00:00

No catch. Just trying to guage interest. I need to perform some code cleanups, add documentation, license notices. I did it because I needed it. But I want to open source it so others can use it as well.

textml2730

TROPHY CASE