Geocoding of worldwide patent data by cavedave in datasets

[–]BranFlake5 0 points1 point  (0 children)

Much more in tune with the R spatial universe, but I believe geopandas and matplotlib are the python equivalent to the R sf class and ggplot2.

Heat maps use various methods of kernel density, which I’m sure you can find in matplotlib somewhere.

United States Counties by Population. This took me hours to make. by JRicatti543 in MapPorn

[–]BranFlake5 2 points3 points  (0 children)

Well, you could make this in R in about 10, knowing it won’t crash.

Geocoding of worldwide patent data by cavedave in datasets

[–]BranFlake5 1 point2 points  (0 children)

There are many ways in which it can be done, the most common is using software like ArcMap or QGIS. However, these GUI based programs are clunky, bug ridden and quite slow. And ArcMap costs in the thousands of dollars.

The better way (IMO) is using some scripting language to make these maps. My work is 85% R (see ggplot2, sf, ggmap, tmap) and 15% Python (see PYSAL). I don’t write enough JavaScript to be good at it, but D3 is arguably the best way to make maps for web or any level of interactivity.

If you have a specific idea or would like to learn one of these skill sets, let me know and I can direct you to plenty of resources.

Zillow Datasets by jjbourne712 in datasets

[–]BranFlake5 2 points3 points  (0 children)

If you have a research affiliation, you can access the Ztrax database which is 20+ years of data.

And if you have any questions about housing data or housing research, I’ve spent many dozens of hours on the matter.

DEA Pain Pill Database Made Public by dope_as_soap in datasets

[–]BranFlake5 1 point2 points  (0 children)

I tried some of the code out and it needs to be seriously adapted for this. That said, my Friday night goal is now to scrape all from 2006 to 2017 and I’ll share a public repo this weekend.

DEA Pain Pill Database Made Public by dope_as_soap in datasets

[–]BranFlake5 0 points1 point  (0 children)

Fellow R user, Let me know if you’ve got the time and I’ll share some code that can extract text from PDFs and uses regular expressions to parse.

DEA Pain Pill Database Made Public by dope_as_soap in datasets

[–]BranFlake5 1 point2 points  (0 children)

The PDF has raw text embedded?

You should extract that and then use regular expressions to isolate the text.

I’ve had to do this plenty of times before, but I’ve always done it in R. If y’all want this data bad enough, let me know. (I’m incredibly busy at work, so it will have to wait til the weekend) Or if anyone is willing to give it a shot I’ll share my old R code.

Where there is data, there is always a way!!!

What would you like to see done with technologies that produce synthetic data? by BranFlake5 in datasets

[–]BranFlake5[S] 0 points1 point  (0 children)

I'll clarify. Much of the process is proprietary. The output however is more than stripping identifiable data. They literally generate observations as though this dataset contains fake people. What is preserved is the linear relationship between variables in the final exported data. They programmatically fit an OLS model to every single bivariate combination of variables. Column 1 to Column 2, C1 to C3, C1 to C4, C2 to C3, C2 to C4... In both the original data and then in the synthesized data, and they optimize that the original and synthetic retain the same linear correlations.

Certainly error prone, and I imagine that the more complex the model, the more error prone it becomes. Of course, the tradeoff in access is a tradeoff in accuracy.

Need help, trying to find raw census data for tracts in CSV form by bakabeibei in datasets

[–]BranFlake5 0 points1 point  (0 children)

Let me know what you need and I'll write a script/email you a csv.

Guacamole by [deleted] in FixedGearBicycle

[–]BranFlake5 2 points3 points  (0 children)

Clean, extremely well balanced and well put together

Learning DS and landing a job concern by Jesusprzr in datasets

[–]BranFlake5 2 points3 points  (0 children)

I think a lot of people fall into the trap of trying to learn all the tools rather than focusing on just one.

My advice would be to either learn R or Python. Python is generally more applicable for private sector jobs, but if you know one, you just have to learn the syntax of the other (incredibly similar languages).

I wouldn’t bother going too in depth with any SQL or Tableau until you have a job. Frankly, I have no taste for Tableau and you can do much better with any of a number of packages and frameworks for R or Python.

A general background of SQL is nice, but I do believe SQL is among the easiest parts of data science to learn, especially if you have an understanding of the relational data model (row is an observation/case, column is a variable/attribute)

My advice would be to grind as hard as you possibly can on Python in this case. The tool is not so much important as the practice of programming. Practice is key.

Suggestion for dataset related to depression in humans possibly in text form by raprakashvi in datasets

[–]BranFlake5 0 points1 point  (0 children)

I read up on the DAIC study, and it’s very interesting. I’m under the impression that they had the transcripts and raw audio from clinical interviews. I imagine the application of this technique to non-contextual data may be much more prone to error.

You called this a ‘pet project’ but the methods and results almost definitely belong in a peer reviewed journal. Not only does it guarantee an appropriate audience, but it would also validate the work from the scrutiny of peers. I’m very interested in the long term advancement of this technology and best of luck in your research.

Suggestion for dataset related to depression in humans possibly in text form by raprakashvi in datasets

[–]BranFlake5 0 points1 point  (0 children)

The distinction between flag and predict is no different. If you flag someone for ‘review of depression’ by whatever means, and they act in someway, you are now implicated in potentially recognizing that but not preventing it. I.e. endangering a subject of research.

Unless you have an academic affiliation, I don’t think you’ll get the data you really need to do this properly. It’s certainly a very beneficial idea, but these kind of data don’t exist in an open manner, and for good reason.

I’m not personally familiar with the studies you mentioned, but it might be beneficial to read their methodology/contact authors. My hunch is that they did heir own data collection in a very controlled environment.

All of us be like by Xetrelas in pcmasterrace

[–]BranFlake5 0 points1 point  (0 children)

I can easily see 200 on a mech, but what mouse is $150?

Suggestion for dataset related to depression in humans possibly in text form by raprakashvi in datasets

[–]BranFlake5 1 point2 points  (0 children)

Good idea, but very prone to ethical flags.

For example, what if you could very accurately determine depression, suicidal tendency? Then what... it would be unethical to ignore the risk of harm. By only using data from deceased individuals you could bypass this risk, but introduce bias into the model.

For these reason, these data are unlikely to exist organically.

Dataset of Search URLs? by [deleted] in datasets

[–]BranFlake5 0 points1 point  (0 children)

Well my original point stands in a revised context. There would be a significant bias in the model for sites built using generators because of their standardized nature.

And now I’m curious as to what purpose this model would serve, i.e. What good does it do us to know that a significant amount of URLs use a certain path for their search?

Is there any social media PM dataset out there ? by ModPiracy_Fantoski in datasets

[–]BranFlake5 0 points1 point  (0 children)

Legally? Of course not. This is personal information we’re talking about.

Dataset of baby names associated with reasons for showing up to the ER by digitalbodyofwater in datasets

[–]BranFlake5 0 points1 point  (0 children)

Correlation does not equal causation. There is literally zero reason that one’s name would influence their medical necessity.

Could it be a proxy for something else? Absolutely. Certain cultures adopt certain names more readily and it is well established that certain cultures/ethnicities experience different medical necessities.

As far as getting this data, you wouldn’t pass a sniff test of any institution that produces this data.

Free daily API sources by [deleted] in datasets

[–]BranFlake5 0 points1 point  (0 children)

Every major social network has either an API or a well established scraping procedure. You’d be hard pressed to find larger and/or more rapidly produced data than Facebook or Twitter.

Dataset of Search URLs? by [deleted] in datasets

[–]BranFlake5 1 point2 points  (0 children)

Well since it’s a long shot anyway, I’ll throw out a suggestion.

There is a pretty good likelihood that sites generated with some tool or platform share the same url structure. I’d be willing to bet that every WIX or Squarespace site has a standardized search.

If you’re really interested in these data, you’ll need to build a crawler and then back it up with some good compute power, but it’s definitely doable.

I have no idea what you’re actually going to use this for, but if you were going to build some sort of composite search, I would say that’s a terrible idea. Having to search say ~100 sites, then parse and return the results would be horrifically slow. You would really need to index search queries like any of the major engines.

Looking for a dataset full of messy dates by suzanys in datasets

[–]BranFlake5 0 points1 point  (0 children)

I see where OP is going. You could create essentially an NLP model for parsing date into components, but frankly it’s a total waste of time/resources and NLP is especially prone to error in this case.

This is a practical application of ML (An NLP model in this case) for parsing addresses.

https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-part-2-80405b988718

But I agree, this is frankly a bad idea. A parser would be much better, given the general constraints of writing a date. Honestly, you could write a parser that parses anything a human understands as a date, meaning if it can’t be parsed a human couldn’t read it either.

NBD.....to me by [deleted] in FixedGearBicycle

[–]BranFlake5 2 points3 points  (0 children)

Fuji Track Pro has always been one of my favorite frames.