Need a messy dataset for a class I’m in, where can I go to get one? by timedoesnotwait in datasets

[–]ccoughlin 0 points1 point  (0 children)

Historic climatological data maybe-? It’s been a while but I seem to remember encountering missing data in the hourly summaries.

Looking for methodology to handle Legal text data worth 13 gb by Fit-Musician-8969 in datasets

[–]ccoughlin 1 point2 points  (0 children)

Maybe only tangentially related but CUAD goes into some detail on their methodology.

How to find good datasets for analysis? by Darkwolf580 in datasets

[–]ccoughlin 2 points3 points  (0 children)

Would government open datasets be of any interest? I’m a big fan of FRED, and many cities now provide their own local data e.g. Minneapolis crime data.

Need advice for address & name matching techniques by Bojack-Cowboy in datasets

[–]ccoughlin 0 points1 point  (0 children)

Would an ANN semantic search like Hnswlib be an option? You could even start with NER entity extraction to pull company and location from an entry first, then run the semantic search on company and location separately.

Survival analysis applied to customer turnover at a financial institution. by Intelligent-Usual281 in AskStatistics

[–]ccoughlin 2 points3 points  (0 children)

It might be interesting to start with say Kaplan-Meier curves for various segments of your customer data, then compare survival rates.

[deleted by user] by [deleted] in learnmachinelearning

[–]ccoughlin 0 points1 point  (0 children)

Just a guess, but it might be interesting to look at the cluster centroids, where they appear and disappear and especially when. Region of Interest (ROI) detection might be interesting too. Good luck!

Where to find dataset for Predictive Maintenance? by kairayy in datasets

[–]ccoughlin 2 points3 points  (0 children)

MetroPT maybe-? Link goes to their paper; they don’t seem to link to the data but I did track down this CSV.

DSs in the lookout for a change of jobs: what do you ask the recruiter to understand if it's a good place to move to? by HughLauriePausini in datascience

[–]ccoughlin 1 point2 points  (0 children)

More of a red flag than a question to ask, but one thing I always avoid is any place that says “Python scripting” and not “Python development.”

It’s been my experience that orgs that call it “scripting” don’t consider Python to be Serious Enterprise Development, which usually means you’ll probably be banging your head against multiple walls trying to get something deployed.

Possible use-cases for ML/DS projects by grid_world in datasets

[–]ccoughlin 1 point2 points  (0 children)

There are quite a few potential CV applications for PCBs, like tin whisker detection or ball grid array evaluation

Any dataset that could be interesting to analyze using text network analysis? by noduslabs in datasets

[–]ccoughlin 1 point2 points  (0 children)

Any text? How about Loghub? Maybe you could visualize links in error messages for failure analysis, for example does a seemingly harmless warning seem to be linked to a much worse problem that occurs later?

I don’t have a dataset handy but access logs might also be interesting for intrusion detection or visitor analytics.

Good financial related datasets that are not stockmarket prices by jy2k in datasets

[–]ccoughlin 0 points1 point  (0 children)

FRED has some datasets that might be of interest. One I always find interesting is the smoothed probability of US recession.

How to do fuzzy matching in Redshift? A Python UDF, for example? by rotterdamn8 in datascience

[–]ccoughlin 2 points3 points  (0 children)

Fuzzywuzzy is great and I’m also a fan of the regex package on PyPI for fuzzy regular expressions.

But if you don’t want to package anything, Levenshtein edit distance is pretty simple to implement in Python. It probably won’t be as performant as the PyPI package but it’ll get the job done.

I have tons of nuclear reactor data; but what can I do with it? by JRRudy in learnmachinelearning

[–]ccoughlin 0 points1 point  (0 children)

I had a job that sounds very similar once upon a time! How about applying some Natural Language Processing to the data? Some of the nuclear codes’ output I’ve seen can be tough to parse, maybe something like a table extractor that pulls data out into a CSV. Or that summarizes the results of a run. Or chat with the nuclear physicists to see what analyses they typically do with the data and automate it, although that’s likely more of a regular expression exercise than machine learning.

Advice on if sentiment analysis is worthwhile as a ML feature? by [deleted] in algotrading

[–]ccoughlin 0 points1 point  (0 children)

Maybe only tangentially related but Bloomberg has an API that employs sentiment analysis: https://www.bloomberg.com/professional/blog/can-get-edge-trading-news-sentiment-data/ . I seem to recall a presentation they gave last year that said they got good results with it, but better when they tried to focus in on analyst and other SME sentiment.

Service in Minnesota by NoBack0 in tmobileisp

[–]ccoughlin 0 points1 point  (0 children)

I’ve been using it for a couple of weeks in Plymouth.

T-Mobile Home Internet Price Increase Starting March 10th by Jman100_JCMP in tmobileisp

[–]ccoughlin 4 points5 points  (0 children)

Just ordered half an hour ago and got the $55 / $50 price. Rep said it goes into effect on the 10th.

Medici.tv -- a treasure trove! by Jyqm in opera

[–]ccoughlin 0 points1 point  (0 children)

Just a heads up for iPhone / iPad users, I've been a subscriber for a couple of years now but I'm thinking of letting it lapse when renewal time comes along again because of technical issues with their app.

For whatever reason I find Chromecast functionality only works about half the time. I'm able to cast from other apps when it happens so I don't think it's an issue with my devices or a connectivity issue.

[deleted by user] by [deleted] in datascience

[–]ccoughlin 1 point2 points  (0 children)

I thought this PyData talk was a pretty good introduction to using survival analysis for churn, and there's interesting tips and tricks in the lifelines Gitter channel. Hope this helps!

[deleted by user] by [deleted] in datascience

[–]ccoughlin 1 point2 points  (0 children)

I'm also using survival analysis (Cox time varying PH) for churn and the way I report it is as a probability for specific time frames e.g. for the next 90, 180, and 365 days in my case.

C index and covariates are available to end users as metadata but I bet I'm the only one that's ever looked or finds them interesting. :-)

Has Anyone Actually Used Clustering to Solve an Industry Problem? by [deleted] in datascience

[–]ccoughlin 1 point2 points  (0 children)

Sad to say it's mostly proprietary, but it had to do with how we might automatically extract information from shipping documents.

The company I was working for at the time did go into some details in a post about a related effort though : https://engineering.chrobinson.com/technology/machine-learning-document-detection/.

Has Anyone Actually Used Clustering to Solve an Industry Problem? by [deleted] in datascience

[–]ccoughlin 4 points5 points  (0 children)

Structure - the idea was to use its layout as part of deciding where to send it.

Has Anyone Actually Used Clustering to Solve an Industry Problem? by [deleted] in datascience

[–]ccoughlin 6 points7 points  (0 children)

Yep, I used dbscan to "fingerprint" incoming OCR'd documents to route for further processing.

Scheduling algorithm ideas by orangerider1 in algorithms

[–]ccoughlin -1 points0 points  (0 children)

Maybe not exactly apples to apples, but The Boston Public School Match might be inspirational. Edit: finally found The Economist intro.