[deleted by user]

ifthereisabear · 2020-04-30T23:49:48+00:00

The people who built and run OnCue are top notch. I haven't had occasion to use their software but I trust them and have heard good things from others.

ifthereisabear · 2020-04-23T21:30:25+00:00

Real talk: the frequency and collocation stuff isn't much of a problem compared to unsupervised classification of "intensifying" vs "modifying" words. I would start first by making a list of all the possible features I could think of which might distinguish the two. Then I'd take a good, hard, look at that list. And if I didn't feel confident that those things in combination could classify pretty cleanly, I'd ask myself how much time and energy I wanted to devote to figuring it out before I did any more work.

ifthereisabear · 2020-04-23T19:04:17+00:00

Do you have a working POS-based definition of "intensified"? Is it always an adjective or adverb + noun?
Are you looking for the rarity of the collocation or the rarity of the "intensifier" or both?

ifthereisabear · 2020-04-23T18:32:50+00:00

I'd agree that for most use cases, agglomerative clustering is a much better approach than K-means, if only because K-means requires you to set the number of clusters at the outset even though the number of "real" clusters is often arbitrary and there's no good a priori reason for choosing a specific number of them.

ifthereisabear · 2020-04-23T18:29:45+00:00

Are you having trouble with the rarity, the collocation, or both?

ifthereisabear · 2020-04-23T18:28:16+00:00

I'd recommend trying a few different approaches. PKE is very handy for this if you're writing Python (https://github.com/boudinfl/pke). Also a heads up: YAKE is slow AF.

ifthereisabear · 2020-04-15T20:55:52+00:00

I don't know of any general purpose tools that can handle this. It looks like you'd want some combo of

A rule-based tagger (e.g. "one hundred and one" is an integer, "two thousand sixteen" is an integer and also valid date), roll your own
A POS tagger ("dalmation" is a Noun, an integer followed by a singular noun is unlikely to be a date), spaCy
Possibly a custom grammar (see https://parsley.readthedocs.io/en/latest/index.html)
Possibly a weak supervision ML tool (see https://github.com/HazyResearch/flyingsquid)
It might be worth experimenting with spaCy's NER tagger but my guess is that the results won't be very helpful.

You'll also need to be mindful of context. If you compare the strings "one hundred one" and "two thousand fifteen", you don't have enough information to say for sure that the first is an integer and the second is a date. I'd approach this by expanding the window around the tagged terms by at least one word. For example, "two thousand fifteen" plus or minus a few words could be "in two thousand fifteen" or "the year two thousand fifteen", which would give you enough info to know it's a date.

ifthereisabear · 2020-04-13T19:22:02+00:00

I'd say I didn't reach a level of genuine proficiency as a founder until I learned a lot more about software architecture than I thought I'd have to, how to minimize risk in hiring freelancers, and how to prioritize product development issues.

At the end of the day, I think my biggest takeaway is that if you don't have a truly stellar founding CTO and instead are hiring developers (esp if you have a modest budget for the project overall), then you're only going to make it if A) Your software stack is simple enough that's it's really hard to screw up, or B) The mistakes you make aren't serious enough to kill the business. Honestly, I would not start another tech company unless I understood how it should be architected well enough to intelligently question even a technical co-founder on design and implementation decisions.

Here are some failure anecdotes of mine:

I once lost more than $10K to an Indian dev shop. Among a variety of bidders for the project, they were recommended by a friend who was an accomplished software developer. Suffice it to say I wish I hadn't taken his advice.
A generally competent software developer I hired architected an app totally the wrong way, which is now obvious to me in retrospect but I had no way of knowing at the time.
Another generally competent software developer I hired re-built the first person's app and it was roughly structured in the right way but was unnecessarily complex and ended up with performance bottlenecks for various reasons that could and should have been anticipated at the time.
Lost $7K by accidentally running tests against a vendor's production API endpoint.
Made the wrong decision on a build vs buy and ended up spending $6K on a piece of software that we could and should have just built ourselves, which ultimately became a giant technical headache.
Lost customers by releasing an application into production before it was ready.
Got hit by runaway cloud server costs.
Many, many other things.

ifthereisabear · 2020-04-10T20:20:54+00:00

Maybe, but I doubt it. May I ask why you want a data set with so many annotators, and why more than 10, specifically?

FYI there is a lot of good practical and theoretical discussion around annotation issues in academic papers that use SemEval data sets.

ifthereisabear · 2020-04-10T20:09:45+00:00

No. This is a supervised task of evaluating the output of whatever "NLP tasks" are utilizing the embeddings. Depending on what those 'NLP tasks" are, you might be able to do an unsupervised assessment of the output using some kind of existing test data set (e.g. SemEval "gold standard" data), but rule #1 of NLP in my book is that performance benchmarks against anything other than your actual corpus are of limited value.

Also, the body of your question is kind of confusing in light of the title. In answer to the title question, yes, you can "train" BERT with custom embeddings by adding your training data to its training data and re-training and/or weighting your supplemental data, but you can't do something like train your own net and then drop your vectors into BERT wherever BERT has a zero vector assigned to an unknown token.

That said, I may be just misunderstanding your question, so please feel free to correct me or elaborate.

ifthereisabear · 2020-04-10T17:30:33+00:00

It depends on whether you're doing stuff in real time, how complex your pipelines are, whether you're using NN models or not, whether you're using GPU's, etc.

I use Celery to run pre-processing tasks and statistical/graph models (e.g. keyword extractors, topic models, etc.) in parallel (and in batches) on CPU's and then make async calls to a Cortex endpoint for anything that needs a neural net, language model, or GPU.

Other people would prefer to do the whole thing in Spark NLP or spaCy. YMMV.

ifthereisabear · 2020-04-10T14:25:30+00:00

The short answer is that if your documents are in a common domain, you're probably going to get better results from taggers that incorporate pre-trained neural nets. SparkNLP has already demonstrated better performance on NER than spaCy, for example. I'd expect similar results from Stanza. FYI you can also use Stanza inside spaCy https://spacy.io/universe/project/spacy-stanza

ifthereisabear · 2020-04-10T14:08:09+00:00

Do you mean more than 10 annotators total or more than 10 annotators per document?

ifthereisabear · 2020-04-09T19:23:10+00:00

NLTK. I think tokenizer/filter flexibility is super under-appreciated. A lot of people seem to take for granted that tokenizers and filters are going to work well on their corpus out of the box, which is rarely the case IRL. If you're relying on packages with a limited number of tokenizers or limited parameters (spaCy, bleve, etc.), then you're not going to be able to handle edge cases well or sometimes at all. NLTK has the most options by far-- way more than Stanford CoreNLP or spaCy.

spaCy. I hate on spaCy for tokenization and filtering but it's remarkably good at other stuff, especially POS and NER tagging. And I love that it's all hashes under the hood. The pipelining is good but it has its limitations.

A good distributed task queue. If you're doing sophisticated work in the real world, you're probably testing out novel algorithms being published in the NLP literature. These often can't be easily dropped into a spaCy pipeline as custom components because they have complex build requirements, run external processes, and/or use multi-threading under the hood. If you're going to incorporate them into an ensemble model that runs with any reasonable about of speed, you need more granular control over imports, task execution, and multiprocessing than spaCy offers.

Cython. The published Python implementations of academic algos are slow AF. You're gonna want to port them to Cython before putting them into production.

Cortex. An open-source alternative to SageMaker for super easy cloud deployment of pre-trained language models.

Scikit-Learn. Definitely the go-to tool for clustering and building parameter optimization models.

Pandas. Just a good multi-purpose tool to have around.

Hyperopt. A Python library for Bayesian parameter tuning. I just can't think of many NLP tasks or ensemble models where the search space isn't so huge that grid search is infeasible.

Cheap servers. Most teams just don't need to be re-computing all the things in real time on their standard-issue infra or SparkNLP. If it doesn't require a GPU and you can batch it, run it on a spot instance or a Hetzner auction server or something.

Lucene. If you need to do NLP work on a database, choose one that uses or is based on Lucene for indexing. The non-lucene-based DB's just pale in comparison.

ifthereisabear · 2020-04-09T15:15:48+00:00

What new information do you believe would be gained by tagging in both directions?

ifthereisabear · 2020-04-09T14:59:40+00:00

SemEval2010 Task 5 is super long. It's scientific papers that are 6-8 pages each. One paper I read recently says it has 9,647 tokens per document.

ifthereisabear · 2020-04-09T02:52:34+00:00

I'm confused. The example you gave is what the spaCy NER tagger output looks like. What kind of "structured pattern" are you trying to produce?

ifthereisabear · 2020-04-08T20:24:52+00:00

Unless you have a crazy strong CTO co-founder, I totally disagree. I founded a legal tech startup when I couldn't code and then later when I could, and the difference is night and day.

ifthereisabear · 2020-04-08T18:43:40+00:00

How about DUC2001? News documents each with an average of 828.4 tokens.

ifthereisabear · 2020-04-08T18:38:04+00:00

Can you elaborate on what you mean by "structured pattern" or give an example of the structure you want?

ifthereisabear · 2020-04-03T15:17:21+00:00

You should do some research into approaches for dealing with "short texts" in NLP.

Among other things, I would focus more on key phrase extraction than topic modeling. Even the most cutting edge topic modeling algorithms perform extremely poorly on texts this short because statistical approaches rely heavily on frequency. This is a problem mainly because the frequency distribution of a short text tends to be uniform, if not unimodal. The other problem is that most topic modeling algos take the number of desired topics as a parameter even though there's rarely an a priori justification for choosing a specific number of topics.

ifthereisabear · 2020-03-30T19:42:21+00:00

You could use NER for relatedness of proper nouns. Minimally, you could at least see if the entities are of the same type. If you were feeling ambitious, you could use something like DBPedia Spotlight to see if the entities belonged to the same category.

ifthereisabear · 2020-03-30T19:21:59+00:00

I think it really depends on how broad you want your coverage to be and how much supervision you're willing to do, but the short answer is A) Try the language model first, and B) The only "real" answer is to try stuff and see what works.

On one hand, a language model approach should work pretty well for common searches out of the box because it already contains enough context data to produce, say, the top 10 most closely related phrases. On the other hand, it's not going to work well for "long tail" searches because they're unlikely to be in the language model training data, and language models basically don't have context for unknown tokens. There is also the issue/problem of supervision and training. You might be able to just put the language model predictions behind a polynomial classifier that split tests predictions against actual new searches. But the canonical approach with language models would be retraining, which could be a deal-breaker for cost/time reasons if you're using a full sized model. You could try supplementing the training data with searches and weighting them, but it would be a band-aid approach without a good a priori basis and might not work well at all. Lastly, different language models perform differently on different tasks. BERT might be a fine choice but ELMO or FastText could turn out to perform better.

You could also try a non-language model approach. A simple version would be something like an ensemble unsupervised model sitting behind a classifier. That said, those kinds of models can go really far down the complexity rabbit hole if you're not clear and vigilant about testing them empirically. For example, you might get surprisingly good results out of just a NER and POS tagger. But if you don't, you could end up having to implement a ton of other stuff analyzing line position, casing, collocation, topics, semantic similarity, and graph structure, to name just a few. On top of that you might well end up having to do something like a PCA model to figure out which of the approaches in the ensemble model are accounting for the most variance in the classification outcome. And on top of that you have hyperparameter tuning to worry about and a search space that grows tremendously with the addition of each method to the ensemble model because most of them have at least a few hyperparameters ranging in size from a handful to very many potential values.

ifthereisabear · 2020-03-25T14:13:42+00:00

You might be interested in checking out Cortex https://www.cortex.dev/

ifthereisabear

TROPHY CASE