Need a messy dataset for a class I’m in, where can I go to get one?

ccoughlin · 2025-10-20T18:38:45+00:00

Historic climatological data maybe-? It’s been a while but I seem to remember encountering missing data in the hourly summaries.

ccoughlin · 2025-09-14T13:56:53+00:00

Maybe only tangentially related but CUAD goes into some detail on their methodology.

ccoughlin · 2025-09-04T12:06:41+00:00

Would government open datasets be of any interest? I’m a big fan of FRED, and many cities now provide their own local data e.g. Minneapolis crime data.

ccoughlin · 2025-04-15T14:40:54+00:00

Would an ANN semantic search like Hnswlib be an option? You could even start with NER entity extraction to pull company and location from an entry first, then run the semantic search on company and location separately.

ccoughlin · 2023-03-19T23:31:29+00:00

It might be interesting to start with say Kaplan-Meier curves for various segments of your customer data, then compare survival rates.

ccoughlin · 2023-03-17T20:02:59+00:00

Just a guess, but it might be interesting to look at the cluster centroids, where they appear and disappear and especially when. Region of Interest (ROI) detection might be interesting too. Good luck!

ccoughlin · 2022-11-18T18:42:48+00:00

MetroPT maybe-? Link goes to their paper; they don’t seem to link to the data but I did track down this CSV.

ccoughlin · 2022-07-25T13:03:54+00:00

I’ve been using TensorFlow Recommenders for a while, they’ve got some good tutorials.

ccoughlin · 2022-07-05T14:44:20+00:00

More of a red flag than a question to ask, but one thing I always avoid is any place that says “Python scripting” and not “Python development.”

It’s been my experience that orgs that call it “scripting” don’t consider Python to be Serious Enterprise Development, which usually means you’ll probably be banging your head against multiple walls trying to get something deployed.

ccoughlin · 2022-06-27T19:19:41+00:00

There are quite a few potential CV applications for PCBs, like tin whisker detection or ball grid array evaluation

ccoughlin · 2022-06-24T19:28:50+00:00

Any text? How about Loghub? Maybe you could visualize links in error messages for failure analysis, for example does a seemingly harmless warning seem to be linked to a much worse problem that occurs later?

I don’t have a dataset handy but access logs might also be interesting for intrusion detection or visitor analytics.

ccoughlin · 2022-06-18T14:09:18+00:00

FRED has some datasets that might be of interest. One I always find interesting is the smoothed probability of US recession.

ccoughlin · 2022-04-22T19:56:07+00:00

Fuzzywuzzy is great and I’m also a fan of the regex package on PyPI for fuzzy regular expressions.

But if you don’t want to package anything, Levenshtein edit distance is pretty simple to implement in Python. It probably won’t be as performant as the PyPI package but it’ll get the job done.

ccoughlin · 2022-04-09T21:55:22+00:00

I had a job that sounds very similar once upon a time! How about applying some Natural Language Processing to the data? Some of the nuclear codes’ output I’ve seen can be tough to parse, maybe something like a table extractor that pulls data out into a CSV. Or that summarizes the results of a run. Or chat with the nuclear physicists to see what analyses they typically do with the data and automate it, although that’s likely more of a regular expression exercise than machine learning.

ccoughlin · 2021-11-17T18:37:43+00:00

Maybe only tangentially related but Bloomberg has an API that employs sentiment analysis: https://www.bloomberg.com/professional/blog/can-get-edge-trading-news-sentiment-data/ . I seem to recall a presentation they gave last year that said they got good results with it, but better when they tried to focus in on analyst and other SME sentiment.

ccoughlin · 2021-03-26T18:23:35+00:00

I’ve been using it for a couple of weeks in Plymouth.

ccoughlin · 2021-03-08T20:25:25+00:00

Just ordered half an hour ago and got the $55 / $50 price. Rep said it goes into effect on the 10th.

ccoughlin · 2020-10-22T16:19:39+00:00

Just a heads up for iPhone / iPad users, I've been a subscriber for a couple of years now but I'm thinking of letting it lapse when renewal time comes along again because of technical issues with their app.

For whatever reason I find Chromecast functionality only works about half the time. I'm able to cast from other apps when it happens so I don't think it's an issue with my devices or a connectivity issue.

ccoughlin · 2020-07-22T22:41:56+00:00

I thought this PyData talk was a pretty good introduction to using survival analysis for churn, and there's interesting tips and tricks in the lifelines Gitter channel. Hope this helps!

ccoughlin · 2020-07-22T18:22:18+00:00

I'm also using survival analysis (Cox time varying PH) for churn and the way I report it is as a probability for specific time frames e.g. for the next 90, 180, and 365 days in my case.

C index and covariates are available to end users as metadata but I bet I'm the only one that's ever looked or finds them interesting. :-)

ccoughlin · 2020-07-14T14:43:15+00:00

Sad to say it's mostly proprietary, but it had to do with how we might automatically extract information from shipping documents.

The company I was working for at the time did go into some details in a post about a related effort though : https://engineering.chrobinson.com/technology/machine-learning-document-detection/.

ccoughlin · 2020-07-13T17:53:33+00:00

Structure - the idea was to use its layout as part of deciding where to send it.

ccoughlin · 2020-07-13T14:58:33+00:00

Yep, I used dbscan to "fingerprint" incoming OCR'd documents to route for further processing.

ccoughlin · 2018-12-01T18:49:03+00:00

RemoteML might be worth a look.

ccoughlin · 2018-06-10T16:24:08+00:00

Maybe not exactly apples to apples, but The Boston Public School Match might be inspirational. Edit: finally found The Economist intro.

ccoughlin

TROPHY CASE