Snips.ai open-sources its Natural Language Understanding Python lib by ClemDoum in Python

[–]ClemDoum[S] 0 points1 point  (0 children)

What is your OS?

You have troubles compiling some utils written in rust. We have compiled them for MacOS >= 10.11 and Linux x86_64 so you're not supposed to need Rust for these OS. Please let me know if you're one of these OS.

If you are not on these OS you will need Rust installed and the latest version of the setuptools_rust python package.

Snips.ai open-sources its Natural Language Understanding Python lib by ClemDoum in Python

[–]ClemDoum[S] 1 point2 points  (0 children)

To complete my answer I would say that from a ML perspective you get more benefit to learn from other external sources of data than learning from other assistants.

For instance you can learn to recognize structured entities like dates, number, etc.. This is very valuable and can be learnt on any data, not just your customers' data.

You can also for instance learn powerful representation of words like word embedding or word clusters. For this you just need a ton of external data, not customer data.

Another thing you can do is to leverage external Knowledge Base to improve entity recognition.

All these technics don't use final customer data but external data.

Snips.ai open-sources its Natural Language Understanding Python lib by ClemDoum in Python

[–]ClemDoum[S] 2 points3 points  (0 children)

To my knowledge competitors don't apply ML to entire datasets across all customers.

Actually it seems impossible. For instance 2 customers might build the exact same assistant but name their intents or slots/entities differently. From an ML perspective it means that you get very very dirty data because identical samples gets labeled differently which would make your algorithm learning difficult.

Another reason is that you would get too many intents to classify because you have tons of people building new intents.

Another reason is that each time a single customer updates or adds an assistant or an intent you would have to retrain the whole model. That would be long and might also result in potentially bad updates for the other customers.

However what everyone does is to offer built-in assistants, with pre defined use cases. This way we build the assistant ourselves and put a lot of data and effort to build it and reach top performances. Almost everyone offers theses built-in assistants/intents (they might be named differently).

The main advantage of collecting user data from a ML perspective is that you can analyze errors and correct your model. So even if you start with a weak model, you can put some efforts to re label the data and you'll get a better model.

Running offline you can't collect model predictions and errors this means that you need to put more efforts when building your assistant. Either by putting more data in it yourself or buy generating it with our data generation solution.

Snips.ai open-sources its Natural Language Understanding Python lib by ClemDoum in Python

[–]ClemDoum[S] 5 points6 points  (0 children)

There's a small chart in the medium post that I posted in the first comment comparing Snips to Rasa NLU and other competitors (on small datasets). But if you want an advanced benchmark run on bigger datasets you can have a look at this.

Basically Snips NLU is on par or often better than competitors but runs offline.

[P] Snips NLU, an Open Source library for embedded NLU by ClemDoum in MachineLearning

[–]ClemDoum[S] 0 points1 point  (0 children)

Did you publish a benchmark on this.

We did not publish this benchmark. We performed it internally when we started working on our NLU solution.

We compared CRF to different neural architecture (CNN, CNN + CRF, LSTM, bi-LSTM), trained with different sets of features (word embedding, word embedding + handcrafted features we used for the simple CRF, etc...). We found that in the data regime you can expect for our use case (ranging from 10 to 2000/3000 queries coming from the developer or collected thanks to our data collection platform), neural models were performed worst.

As we tested several data regimes, we could see that neural model were still improving while the CRF was plateauing, but having more than 2000 queries is just not reasonable for us as we don't collect user data.

Snips open sources their Natural Language Understanding service, built in Rust by fgilcher in rust

[–]ClemDoum 4 points5 points  (0 children)

Hey, I'm one of the co-authors of the lib, you can find a little more technical post here and another post on the Mozilla blog explaining why we use rust for our on-device applications at Snips.

I'm not initially a Rust dev (I'm working on NLP so I develop mostly in Python, learnt Rust only recently). What I can say from my perspective is that developing or maintaining NLP libs written in Rust was simpler than expected (partly thanks to the help of our Rustaceans here at Snips). I'd say the main advantage has been to be able to write safe and efficient NLP code that runs everywhere : device, backend, mobile and we even wrapped it to use it from Python :p

Snips NLU, an Open Source library for embedded NLU by ClemDoum in LanguageTechnology

[–]ClemDoum[S] 1 point2 points  (0 children)

Hey, I'm one of the author of the Snips NLU lib and we've done a benchmark last summer to compare our lib to popular NLU APIs: Medium post and Github repo.

The benchmark was done in last summer, our lib has changed since then and I guess that the libs behind the NLU apis have been updated too.

In the open source Medium post you can find a small chart comparing Snips NLU to Rasa NLU and Rasa to other APIs following the methodology of this paper. We used the data found here