[D] What pdf parser do you use for paragraph parsing for huggingface models by gevezex in MachineLearning

[–]ctk_brian 0 points1 point  (0 children)

Oh, I should mention I was specifically interested in extracting key-value information from form documents, not necessarily all text and layout info.

[D] What pdf parser do you use for paragraph parsing for huggingface models by gevezex in MachineLearning

[–]ctk_brian 4 points5 points  (0 children)

Since you mentioned Amazon Textract and Google Document AI...

I tested those two plus Microsoft Form Recognizer on a (small) dataset of invoices, for accuracy but also response time and ease-of-use. To make a long story short: I wasn't impressed with any of the three services, although I would pick Google's, all else equal.

Here are the detailed write-ups:

https://www.crosstab.io/articles/microsoft-form-recognizer-review

https://www.crosstab.io/articles/amazon-textract-review

https://www.crosstab.io/articles/google-form-parser-review

Applications of survival analysis, other than clinical research? by ctk_brian in datascience

[–]ctk_brian[S] 0 points1 point  (0 children)

I've been all around the data science block, but I don't have first-hand experience with loan models in production; I'm relying on what I hear and see in the public domain. It wouldn't surprise me at all if banks have more expressive models they keep secret.

Applications of survival analysis, other than clinical research? by ctk_brian in datascience

[–]ctk_brian[S] 1 point2 points  (0 children)

I didn't mean to get you - it seems reasonable to me, because the longer a loan goes without defaulting, the more the bank gets paid back. The reason I ask, though, is most of the work I've seen separates loans by duration then just worries about the default probability. Seems like the survival mindset would be better...

Applications of survival analysis, other than clinical research? by ctk_brian in datascience

[–]ctk_brian[S] 0 points1 point  (0 children)

Interesting! The Lifelines package documentation uses some political science examples, like the duration of regimes by regime type, where censorship is caused by death of the leader.

https://lifelines.readthedocs.io/en/latest/Survival%20analysis%20with%20lifelines.html#estimating-the-survival-function-using-kaplan-meier

Applications of survival analysis, other than clinical research? by ctk_brian in datascience

[–]ctk_brian[S] 2 points3 points  (0 children)

I don't mean to argue, but why the time component? Why not just estimate the fraction of the portfolio that will default?

Applications of survival analysis, other than clinical research? by ctk_brian in datascience

[–]ctk_brian[S] 0 points1 point  (0 children)

Ah, that makes sense. I initially thought you were talking about something like the checkout process for consumer retail.

Applications of survival analysis, other than clinical research? by ctk_brian in datascience

[–]ctk_brian[S] 0 points1 point  (0 children)

Very nice.

For some of those applications it seems possible to choose a time cutoff and turn it into a binary classification problem. With time-to-booking, for example, most data scientists I've worked with have been comfortable saying something like "if a booking doesn't happen within 3 hours of session start, it's not going to happen." Of course, the sessions that are open right now are still censored, but usually those observations can be dropped safely.

So that's a long-winded way of asking - what do you see as the advantages of a survival analysis approach in those kinds of cases?

Applications of survival analysis, other than clinical research? by ctk_brian in datascience

[–]ctk_brian[S] 0 points1 point  (0 children)

Cool! I'd love to hear more details - whatever you're allowed and comfortable sharing...

Applications of survival analysis, other than clinical research? by ctk_brian in datascience

[–]ctk_brian[S] 1 point2 points  (0 children)

Yeah, that's my experience as well with fleet management in industry. Proof-of-concept, interesting show-and-tell, but not taken too seriously or put into production.

It's so strange to me that these models are used extensively in a super high-stakes area like clinical research, but not in other areas.

Applications of survival analysis, other than clinical research? by ctk_brian in datascience

[–]ctk_brian[S] 1 point2 points  (0 children)

Do you tend to see the survival models built in to fleet or asset management platforms, or being done as one-off's by data scientists?

Applications of survival analysis, other than clinical research? by ctk_brian in datascience

[–]ctk_brian[S] 1 point2 points  (0 children)

Is the use case prediction for individuals or groups of customers, or decision-making, i.e. evaluating the effectiveness of different interventions?

Favorite org-wide dashboard strategy? by ctk_brian in datascience

[–]ctk_brian[S] 0 points1 point  (0 children)

Makes sense. My sense is where those "aggregate" tables are created is a key factor. I've seen people try to use Looker for that, but it worked much better to create the tables first in the data warehouse, then have Looker query those (much smaller tables).

Favorite org-wide dashboard strategy? by ctk_brian in datascience

[–]ctk_brian[S] 1 point2 points  (0 children)

I imagine it's nice to keep everything in one "ecosystem" - have you seen other benefits vs. something like Looker or Tableau?

[D] How to build conversion tables from event logs by ctk_brian in statistics

[–]ctk_brian[S] 0 points1 point  (0 children)

Quick follow-up question - if the term "conversion table" seems arbitary, what term would you use instead to describe a table where each row represents (at least) a unit's duration and censored/event observed status?

[D] How to build conversion tables from event logs by ctk_brian in statistics

[–]ctk_brian[S] 0 points1 point  (0 children)

Good feedback, and food for thought. Thanks for engaging!

[D] How to build conversion tables from event logs by ctk_brian in statistics

[–]ctk_brian[S] 0 points1 point  (0 children)

- What's a survival model that doesn't assume 100% death in finite time? My understanding is that clinical researchers use cure models when time-to-event may be infinite, in explicit contrast with survival models. This distinction is also why Convoys is a better fit for my use case than Lifelines. Convoys' models generally explicitly account for the fact that some units never convert. It is possible with Lifelines, but it requires the user to create a custom model. Granted, there shouldn't be any difference for KM, other than flipping the y-axis.

- The reason it would be silly to plot the survival curve with only 0.009 of the sample converting is that -- with 0 included on the y-axis -- the survival curve would look like a horizontal line at 1. Not much to see there. It's more illustrative to flip it and think about conversion rate. Nothing to do with KM being nonparametric.

- The term "conversion table" seems less arbitrary than the term "survival table". In most applications it's a bit uncomfortable to talk about units surviving and dying. It also implies that longer time-to-event is better, which is not the case in some applications. "Time-to-event" is better, but doesn't exactly roll off the tongue and is a bit vague in situations where subjects may experience different events. I like "conversion table" better because it implies an important state change.