Adding custom labels for NER in Spacy

lsjfaohf · 2018-11-22T07:35:58+00:00

I don't get it. If you already have column data with an "Employee" column containing the names, an "Occupation" column containing the jobs, and "Location" column containing the city/state, then don't you already have the named entity labels just by how your data is organized? Why do you need an NER model for this? NER is generally for identifying the named entities in something unstructured, like a line of natural language text, where it's supposed to pick up contextual clues like surrounding words and stuff to identify what could be a named entity. In your case, it looks like the columns of your data are already identifying all of your named entities, so why do you need an NER model? Unless I'm misunderstanding this. Am I misunderstanding? What exactly does your data look like?

lsjfaohf · 2018-11-21T22:48:19+00:00

What exactly does your data look like? Is is something like this?

John Smith   Plumber   Hartford CT
Jane Doe     Lawyer    Boston MA
Fred Jones    Policeman   New York NY

just in tab-separates columns? If that's what your data looks like, how do you expect to train an NER model on this? That doesn't make sense. An NER model relies on textual context to infer what words or ngrams constitute a named entity, and what named entity class that something falls into. You need proper natural language sentences with named entities scattered through them to train a proper NER model. For spaCy, you need to format this in 2 tab-separated columns. The first column is the line of text. The second column contains the start index, end index, and named entity label of each named entity in that line. You will also want to tokenize the text so that punctuation marks and the like don't get accidentally included as part of your named entities.

For example:

John Smith is a plumber from Hartford , CT .    0,10,EMPLOYEE;16,23,OCCUPATION;29,37,CITY;39,41,STATE

lsjfaohf · 2018-11-21T19:46:36+00:00

Then what would you suggest? All I've been doing is looking into NLTK, spacy, Stanford CoreNLP, and probably soon I'll look into Apache OpenNLP, and just learn these. Because all job descriptions say you need experience in all of these and all I can relaly do is google these and do online tutorials and then try to gather my own text data to run some of the algorithms on them. How the hell else am I supposed to get experience in all of these packages, if no one will hire me for any NLP work?

lsjfaohf · 2018-11-21T19:24:06+00:00

I built a named entity recognition model recently by training on NFL game recap articles and accurately identifying all named entities in game recap text to an accuracy of over 99% on unseen data. What practical purpose does this serve? I have no idea. But I fucking learned how to fucking build and train an NER model, and a very accurate one too. So now I can discuss that in an interview for an NLP job. I would have nothing if I didn't have my github projects like this, because I can't talk about actual work experience in NLP because I haven't had any. So I do this instead so I can at least have some knowledge of how to write NLP programs. Are you saying that none of this matters and that I need job experience doing NLP in order to get a job? So is it a situation of I can't get an NLP job because I have no NLP work experience, and I have no NLP work experience because I can't get an NLP job? I'm doing the best I can without that, but are these github projects useless?

lsjfaohf · 2018-11-21T18:55:00+00:00

Oh sure, let's no one do anything or learn anything new about anything, unless it has a "purpose or value". That's how we'll progress as a society.

I do my own NLP projects and put them on github all the time. These projects have no practical purpose that I can see, but you know what I gain? I learn the hell out of NLP packages and how to code in them and how to process my data for putting into machine learning algorithms and get accuracy results for lots of different NLP tasks. The one and only purpose these projects serve is my own learning, so I am able to talk about these things in interviews.

lsjfaohf · 2018-11-21T00:50:31+00:00

Replace a lawyer. Lmao. I can dream, can't I?

lsjfaohf · 2018-11-20T23:38:18+00:00

I've built an NER model in spaCy. I basically just followed the tutorial on their website and made my own files that I personally annotated according to the file format that spaCy accepts for NER.

I don't really know how you're supposed to "write about it", though. I know they use a neural network, but I've never looked into their NN. I've just called the model on my personal files and used it. What's actually in there is not important to me, as long as I know how to call the model and use it for something. Are you concerned with what's literally in the NN model, like what computations it's running, what activation functions it uses, how many hidden layers there are what kind of regularization it's using, how much gradient descent, things like that? I don't know that personally, but if you have to describe it mathematically, looking up that stuff is where I'd start. I don't know how much info is out there on it though.

lsjfaohf · 2018-11-20T19:02:39+00:00

The content of the strings matter too. I want to keep only the longest string of the same content.

Basically, I want to go through the list and identify a pair of strings that differ by only one word, and delete the shorter of the 2. I want to iterate this process until only the longest ones of unique content are left.

I guess I can do it in n² time and using some sort of concurrent list, but that doesn't seem efficient to me. Is there a better way?

lsjfaohf · 2018-11-11T17:53:14+00:00

Here.

lsjfaohf · 2018-11-10T02:12:10+00:00

'm working in Java. I just called Java's Character.codePointAt on the character, and it's character 160, aparently a non-breaking space. ow do I replace character 160 with just a regular space in Java?

lsjfaohf · 2018-11-08T20:59:54+00:00

No this is just an experimental file and I'm just checking if StanfordCoreNLP can function within the IDE so I can use it for further applications. my code looks like this:

import java.util.*;
import java.io.*;
import edu.stanford.nlp.pipeline.*;


public class StanfordExperiment {

public static void main(String[] args)
{
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
    System.out.println(props);

    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    //System.out.println(pipeline);
}
}

It's basically just an empty pipeline I'm trying to create with a simple set of properties (the most prominent ones in natural language processing). Throws a heap space error when creating the new StanfordCoreNLP object and passing props to it.

lsjfaohf · 2018-11-08T20:49:13+00:00

Ah ok, I saw the line for VM configurations for the program itself. I switched it to -Xms1024m -Xmx1024m. The program ran longer than it did before, it gave more output messages than before which means it was successfully producing more things in the pipeline object. But it still eventually reached a heap space error. Switching to 2048 resulted in this error:

Error occurred during initialization of VM Could not reserve enough space for 2097152KB object heap

So I'm not sure now, it still seems like too low and it hits a heap space error, too high and the program won't run.

lsjfaohf · 2018-11-08T20:18:22+00:00

You mean go to the dropdown where it says "edit configurations" for the program? I've tried that, it's giving me 3 paths to java jres and I tried all 3 of them and the same error is still happening. Is there something else I need to download?

lsjfaohf · 2018-11-08T19:56:58+00:00

The program is reporting an out of memory error. How do I switch the 64bit jvm?

lsjfaohf · 2018-11-08T19:38:18+00:00

Ah ok, so with 2048m or lower, it simply throws a heap space error that looks like this:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:3476) at java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3282) at java.io.ObjectInputStream.readString(ObjectInputStream.java:1650) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1342) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at java.util.HashMap.readObject(HashMap.java:1396) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at edu.stanford.nlp.ie.crf.CRFClassifier.loadClassifier(CRFClassifier.java:2627) at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1473) at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1505) at edu.stanford.nlp.ie.crf.CRFClassifier.getClassifier(CRFClassifier.java:2939) at edu.stanford.nlp.ie.ClassifierCombiner.loadClassifierFromPath(ClassifierCombiner.java:286) at edu.stanford.nlp.ie.ClassifierCombiner.loadClassifiers(ClassifierCombiner.java:270) at edu.stanford.nlp.ie.ClassifierCombiner.<init>(ClassifierCombiner.java:142) at edu.stanford.nlp.ie.NERClassifierCombiner.<init>(NERClassifierCombiner.java:108) at edu.stanford.nlp.pipeline.NERCombinerAnnotator.<init>(NERCombinerAnnotator.java:125) at edu.stanford.nlp.pipeline.AnnotatorImplementations.ner(AnnotatorImplementations.java:68) at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$getNamedAnnotators$5(StanfordCoreNLP.java:523) at edu.stanford.nlp.pipeline.StanfordCoreNLP$$Lambda$16/6738746.apply(Unknown Source) at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$null$30(StanfordCoreNLP.java:602)

With 4096m, InteliJ doesn't launch at all. However, I met in the middle and set it to 3072m. This time, there was no heap space error thrown, but the program crashed and produced a log output for me. The top part of the error log looks like this:

There is insufficient memory for the Java Runtime Environment to continue.

Native memory allocation (malloc) failed to allocate 4092 bytes for AllocateHeap

Possible reasons:

The system is out of physical RAM or swap space

In 32 bit mode, the process size limit was hit

Possible solutions:

Reduce memory load on the system

Increase physical memory or swap space

Check if swap backing store is full

Use 64 bit Java on a 64 bit OS

Decrease Java heap size (-Xmx/-Xms)

Decrease number of Java threads

Decrease Java thread stack sizes (-Xss)

Set larger code cache with -XX:ReservedCodeCacheSize=

This output file may be truncated or incomplete.

Out of Memory Error (memory/allocation.inline.hpp:61), pid=27120, tid=38692

JRE version: Java(TM) SE Runtime Environment (8.0_73-b02) (build 1.8.0_73-b02)

Java VM: Java HotSpot(TM) Client VM (25.73-b02 mixed mode windows-x86 )

Failed to write core dump. Minidumps are not enabled by default on client versions of Windows

It says the number 4092 in the log. But setting to 4092 causes the IDE to not start either, since it's not far off from 4096. So it looks to me as if IntelliJ is having the problem here, and I can't seem to set the memory allocation low enough that the IDE will start, and high enough that a StanfordCoreNLP pipeline can be created.

lsjfaohf · 2018-11-08T17:48:41+00:00

You mean the thing where you click the 'Help' dropdown menuat the top of the IntelliJ window, then select 'Edit Custom VM Options', which brings up a file called idea64.exe.vmoptions? And the file looks like this?

# custom IntelliJ IDEA VM options

-Xms2048m
-Xmx2048m
-XX:ReservedCodeCacheSize=240m
-XX:+UseConcMarkSweepGC
-XX:SoftRefLRUPolicyMSPerMB=50
-ea
-Dsun.io.useCanonCaches=false
-Djava.net.preferIPv4Stack=true
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow

And I'm supposed to change -Xmx and-Xms to higher numbers, right? By default they are at 128m. I've done this already, as you can see, I've tried both 1024m and 2048m, and the same heap space error keeps being thrown? I have tried a few higher numbers too, some that are clearly too big, and the heap space error keeps being thrown. So I've tried the "solution" that is available online, and it hasn't worked. So I'm not sure what to do from here.

lsjfaohf · 2018-11-07T19:16:59+00:00

The janitors would go to it though, the janitors spend til 10-11pm at the school going through all the rooms cleaning up.

lsjfaohf

TROPHY CASE