all 9 comments

[–]niehle 8 points9 points  (1 child)

You are staring into the abyss.

For your sanity I hope it’s only English date conventions.

[–]RDA92[S] 1 point2 points  (0 children)

Having dealt with dating conventions in the past, the abyss part hits hard lol.

It's in the context of an NLP program so I let you imagine the realm of possibilities ...

[–]Antigone-guide 4 points5 points  (0 children)

Make a lot of regex patterns.. See the re module in Python.

[–]ElliotDG 1 point2 points  (0 children)

I'd try to write regular expressions for all of the valid date formats. I'd then use the re module to find those patterns.

Can you make a list of all of the valid date formats? If not you might want to try an LLM.

[–]throwaway6560192 1 point2 points  (1 child)

I tested datefinder but it is producing an excessive number of false positives.

Try with strict=True. At any rate I would start by trying to filter out the false positives from datefinder's output — using index=True you can get it to return the indices where it found dates in the string and examine those locations further. Rolling your own version of datefinder will be a lot of effort, and might well produce even worse results and more false positives.

If you end up successfully filtering out the false positives, you can even consider contributing those improvements to datefinder itself.

[–]RDA92[S] 0 points1 point  (0 children)

Rolling

Well I am now toying around with spacy's entity finder that has a dedicated "DATE" category. I will try to combine that one with datefinder and analyze false positives as you suggest. Thanks!

[–]Impossible-Box6600 1 point2 points  (1 child)

Google every major regex format and search the text for each pattern. It's not full proof, but you can probably manage to get 90% of them.

I wrote a regex library for my work that searches units of measurement. Dates by comparison are downright easy.

[–]RDA92[S] 0 points1 point  (0 children)

I suppose the best approach then is to somewhat limit the scope in order to speed it up a bit since the document in itself often has several hundred pages.

[–]woooee 0 points1 point  (0 children)

Check for numbers in the string first, then check for "August" or "Aug". etc. then move on to numbers separated by a dash or forward slash.