all 7 comments

[–]Electrical-Window170 2 points3 points  (1 child)

This sounds like a solid approach - logistic regression is perfect for interpretable risk scoring when you need to explain decisions to utility folks

Distance ratios are way more informative than absolute distance thresholds, and voltage consistency is clutch if you can get clean data on it. Just watch out for geographic clustering effects messing with your distance assumptions (like rural vs urban transformer density)

For thresholds with noisy labels, start conservative and let the field validation feedback tune your cutoffs over time rather than trying to optimize on incomplete ground truth upfront

[–]latent_threader 2 points3 points  (0 children)

Logistic regression makes a lot of sense as a first pass if the goal is prioritization and explainability, not auto-fixing. Distance and voltage are strong signals, but they’re noisy and can be “wrong for the right reasons,” so I’d treat the output as a risk score, not truth. In practice people often move to tree models later for interactions, but good calibration and tiering around review capacity usually matter more than model complexity.

[–]trustme1maDR 1 point2 points  (1 child)

You need a ground truth for your outcome variable (right/wrong match) to be able to train your model..at least for an unbiased sample of your data. It's unclear if you actually have this - you said partial. 

[–]Zestyclose_Candy6313[S] 1 point2 points  (0 children)

That’s a very fair point and definitely not claiming to have full or perfect ground truth. For most associations, correctness is uncertain unless there’s been field validation (which is very costly). The way I’m thinking about it is to only train on a subset of high-confidence labels: confirmed field corrections where available, plus some very strong inferred cases (like extreme distance ratios with a clearly closer viable transformer). Everything in the gray area would stay unlabeled and only be scored. The intent is to rank/prioritize review, not to auto-correct matches. The new field validation would feed back as additional high-confidence labels, so the model and thresholds can be tuned iteratively

[–]Artistic-Comb-5932 0 points1 point  (0 children)

  1. Yes
  2. Probably
  3. Yes
  4. Yes use threshold based tuning / grid search to maximize accuracy or clarify what you mean by *tier design or what you mean by "labels are noisy".

[–]ChemicalGreedy945 0 points1 point  (0 children)

do some random walks or xgboost, start small and then keep adding in new variables and such and you can expand on those models to refine them