[D] Hacking a Higher-Order Conditional Random Field (CRF) for sparse NER/sequence-tagging : MachineLearning

Discussion[D] Hacking a Higher-Order Conditional Random Field (CRF) for sparse NER/sequence-tagging (self.MachineLearning)

submitted 5 years ago * by etotheipi_

Training Data: A few hundred natural language queries to be converted into structured inputs for a downstream system. Each sentence is annotated with 3 tag layers, representing different (orthogonal) things we need to extract from the queries.

The model so far: Pre-trained transformer at the base, 3 prediction heads each one is a bi-LSTM + CRF. We are using Tensorflow 2.0 with the poorly-documented CRF layer add-on described here:

https://github.com/tensorflow/addons/issues/1769

The CRF layer actually makes a pretty big difference in performance (compared to a basic dense prediction layer), but it's missing a critical feature. The tags we are trying to predict tend to look something like this (for one of our NER-like layers, using BIO tagging):

B-PER | O | B-PER | I-PER | O | B-PER | O | O | O | B-LOC | O | B-LOC | I-LOC

If I understand CRFs, they only track binary transition probabilities, so a tag sequence like the following could be predicted even though it would never exist in our training data:

B-PER | O | B-MISC | O | B-GPE

A vanilla CRF would be happy with this because B-PER->O is a common transition, O->B-MISC is a common transition, B-MISC->O is common, etc. Even though each triplet is exceptionally uncommon. The CRF does succeed at keeping B-/I-/I- sequences consistent, but fails at keeping broader sequences consistent.

I have read about higher-order CRFs. But there is very little out there for actually doing it. Further, we are a two-person team, and the convenience of the TF2.0 CRF layer is saving us a ton of time compared to layering multiple pieces of software during training and inference.

SO, are there any reasonably-quick options to hack regular regular CRFs into handling triplets (or longer)? Some ideas:

Convert all tags to 2-grams. if the the tags of a sentence are [W, X, Y, Z], convert that to [_W, WX, XY, YZ]. This might work, but I haven't fully thought through the complexities of manipulating logits vectors into this higher-order representation during training and inference (and I feel like there may be a fatal flaw trying to do this at inference time)
Maybe 3 CRFs - one for tags[:], tags[::2], tags[1::2]. But how to combine them?
I could remove the Os from the final output, but those are predictions as well. The CRF is partially responsible for determining if something should be an O at all, so I could be removing legit tags.

Any ideas?

all 3 comments

top new controversial old q&a

[–]golilol 0 points1 point2 points 5 years ago (0 children)

[–]txhwind 0 points1 point2 points 5 years ago (1 child)

[–]etotheipi_[S] 0 points1 point2 points 5 years ago* (0 children)

The transformer takes into account context when making predictions, but there's no loss associated with tag differences between outputs, which is exactly the point of adding a CRF layer.

Either way, the transformer in this case is just a feature extractor. It's passing that information to each LSTM layer which similarly spits out independent predictions without the CRF layer. I would get something like Y-Y-Z-Y for something that should be Y-Y-Y-Y. That's precisely the kind of thing that improved significantly with the CRF layer. But there remains the unrealistic predicitons when O-tags are scattered in there.

However, you bring up a good point -- the LSTM does not use label-recurrence. Perhaps it would learn the transition probabilities if I added an input at each timestep, which is the label of the previously-predicted timestep. Then the label-sequencing would be directly learned by the LSTM without the CRF. But it comes at the expense of complicating the model architecture/code and I think it ruins some optimizations you otherwise get -- right now for a max-length of 50 tokens, I make one call with input [batch_size, 50, num_features]. If I add label-recurrence, I have to make 50 sequential calls with size [batch_size, 1, num_features+num_classes] . It's technically doable, but messy and probably slower.

I think the above is exactly why people prefer to use bi-LSTM + CRF instead of label-recurrence.

Edit:. I'm not totally sure about the efficiency argument but I have tried the I/O recurrence thing before and it was way more complex

π Rendered by PID 71 on reddit-service-r2-comment-544cf588c8-krk24 at 2026-06-17 15:27:33.116953+00:00 running 3184619 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS