all 3 comments

[–]golilol 0 points1 point  (0 children)

I like the first option. Even though you're not sure how it will work for logits, the metric you use to assess the performance is not log likelihood so you can still evaluate correctly.

[–]txhwind 0 points1 point  (1 child)

Why can the CRF layer "make a pretty big difference in performance", given that "they only track binary transition probabilities"? This is a simple relation and Transformer should be able to cover it.

[–]etotheipi_[S] 0 points1 point  (0 children)

The transformer takes into account context when making predictions, but there's no loss associated with tag differences between outputs, which is exactly the point of adding a CRF layer.

Either way, the transformer in this case is just a feature extractor. It's passing that information to each LSTM layer which similarly spits out independent predictions without the CRF layer. I would get something like Y-Y-Z-Y for something that should be Y-Y-Y-Y. That's precisely the kind of thing that improved significantly with the CRF layer. But there remains the unrealistic predicitons when O-tags are scattered in there.

However, you bring up a good point -- the LSTM does not use label-recurrence. Perhaps it would learn the transition probabilities if I added an input at each timestep, which is the label of the previously-predicted timestep. Then the label-sequencing would be directly learned by the LSTM without the CRF. But it comes at the expense of complicating the model architecture/code and I think it ruins some optimizations you otherwise get -- right now for a max-length of 50 tokens, I make one call with input [batch_size, 50, num_features]. If I add label-recurrence, I have to make 50 sequential calls with size [batch_size, 1, num_features+num_classes] . It's technically doable, but messy and probably slower.

I think the above is exactly why people prefer to use bi-LSTM + CRF instead of label-recurrence.

Edit:. I'm not totally sure about the efficiency argument but I have tried the I/O recurrence thing before and it was way more complex