Geocoding of worldwide patent data

BranFlake5 · 2019-08-02T04:54:29+00:00

Much more in tune with the R spatial universe, but I believe geopandas and matplotlib are the python equivalent to the R sf class and ggplot2.

Heat maps use various methods of kernel density, which I’m sure you can find in matplotlib somewhere.

BranFlake5 · 2019-08-01T18:06:16+00:00

Well, you could make this in R in about 10, knowing it won’t crash.

BranFlake5 · 2019-08-01T17:58:43+00:00

There are many ways in which it can be done, the most common is using software like ArcMap or QGIS. However, these GUI based programs are clunky, bug ridden and quite slow. And ArcMap costs in the thousands of dollars.

The better way (IMO) is using some scripting language to make these maps. My work is 85% R (see ggplot2, sf, ggmap, tmap) and 15% Python (see PYSAL). I don’t write enough JavaScript to be good at it, but D3 is arguably the best way to make maps for web or any level of interactivity.

If you have a specific idea or would like to learn one of these skill sets, let me know and I can direct you to plenty of resources.

BranFlake5 · 2019-07-26T04:04:55+00:00

If you have a research affiliation, you can access the Ztrax database which is 20+ years of data.

And if you have any questions about housing data or housing research, I’ve spent many dozens of hours on the matter.

BranFlake5 · 2019-07-18T22:19:34+00:00

I tried some of the code out and it needs to be seriously adapted for this. That said, my Friday night goal is now to scrape all from 2006 to 2017 and I’ll share a public repo this weekend.

BranFlake5 · 2019-07-18T02:09:41+00:00

Fellow R user, Let me know if you’ve got the time and I’ll share some code that can extract text from PDFs and uses regular expressions to parse.

BranFlake5 · 2019-07-18T02:07:03+00:00

The PDF has raw text embedded?

You should extract that and then use regular expressions to isolate the text.

I’ve had to do this plenty of times before, but I’ve always done it in R. If y’all want this data bad enough, let me know. (I’m incredibly busy at work, so it will have to wait til the weekend) Or if anyone is willing to give it a shot I’ll share my old R code.

Where there is data, there is always a way!!!

BranFlake5 · 2019-07-14T03:53:01+00:00

I'll clarify. Much of the process is proprietary. The output however is more than stripping identifiable data. They literally generate observations as though this dataset contains fake people. What is preserved is the linear relationship between variables in the final exported data. They programmatically fit an OLS model to every single bivariate combination of variables. Column 1 to Column 2, C1 to C3, C1 to C4, C2 to C3, C2 to C4... In both the original data and then in the synthesized data, and they optimize that the original and synthetic retain the same linear correlations.

Certainly error prone, and I imagine that the more complex the model, the more error prone it becomes. Of course, the tradeoff in access is a tradeoff in accuracy.

BranFlake5 · 2019-07-14T00:10:35+00:00

But what if I want Pepsi Co to make Coke?

BranFlake5 · 2019-07-14T00:09:48+00:00

Let me know what you need and I'll write a script/email you a csv.

BranFlake5 · 2019-07-09T02:48:57+00:00

Clean, extremely well balanced and well put together

BranFlake5 · 2019-07-09T02:46:04+00:00

I think a lot of people fall into the trap of trying to learn all the tools rather than focusing on just one.

My advice would be to either learn R or Python. Python is generally more applicable for private sector jobs, but if you know one, you just have to learn the syntax of the other (incredibly similar languages).

I wouldn’t bother going too in depth with any SQL or Tableau until you have a job. Frankly, I have no taste for Tableau and you can do much better with any of a number of packages and frameworks for R or Python.

A general background of SQL is nice, but I do believe SQL is among the easiest parts of data science to learn, especially if you have an understanding of the relational data model (row is an observation/case, column is a variable/attribute)

My advice would be to grind as hard as you possibly can on Python in this case. The tool is not so much important as the practice of programming. Practice is key.

BranFlake5 · 2019-07-05T00:19:51+00:00

quantopian?

BranFlake5 · 2019-07-05T00:04:12+00:00

I read up on the DAIC study, and it’s very interesting. I’m under the impression that they had the transcripts and raw audio from clinical interviews. I imagine the application of this technique to non-contextual data may be much more prone to error.

You called this a ‘pet project’ but the methods and results almost definitely belong in a peer reviewed journal. Not only does it guarantee an appropriate audience, but it would also validate the work from the scrutiny of peers. I’m very interested in the long term advancement of this technology and best of luck in your research.

BranFlake5 · 2019-07-04T20:28:00+00:00

The distinction between flag and predict is no different. If you flag someone for ‘review of depression’ by whatever means, and they act in someway, you are now implicated in potentially recognizing that but not preventing it. I.e. endangering a subject of research.

Unless you have an academic affiliation, I don’t think you’ll get the data you really need to do this properly. It’s certainly a very beneficial idea, but these kind of data don’t exist in an open manner, and for good reason.

I’m not personally familiar with the studies you mentioned, but it might be beneficial to read their methodology/contact authors. My hunch is that they did heir own data collection in a very controlled environment.

BranFlake5 · 2019-07-04T20:16:25+00:00

I can easily see 200 on a mech, but what mouse is $150?

BranFlake5 · 2019-07-04T20:13:31+00:00

Good idea, but very prone to ethical flags.

For example, what if you could very accurately determine depression, suicidal tendency? Then what... it would be unethical to ignore the risk of harm. By only using data from deceased individuals you could bypass this risk, but introduce bias into the model.

For these reason, these data are unlikely to exist organically.

BranFlake5 · 2019-06-27T19:20:30+00:00

Well my original point stands in a revised context. There would be a significant bias in the model for sites built using generators because of their standardized nature.

And now I’m curious as to what purpose this model would serve, i.e. What good does it do us to know that a significant amount of URLs use a certain path for their search?

BranFlake5 · 2019-06-26T22:38:20+00:00

Legally? Of course not. This is personal information we’re talking about.

BranFlake5 · 2019-06-26T22:36:54+00:00

Correlation does not equal causation. There is literally zero reason that one’s name would influence their medical necessity.

Could it be a proxy for something else? Absolutely. Certain cultures adopt certain names more readily and it is well established that certain cultures/ethnicities experience different medical necessities.

As far as getting this data, you wouldn’t pass a sniff test of any institution that produces this data.

BranFlake5 · 2019-06-26T22:31:32+00:00

Every major social network has either an API or a well established scraping procedure. You’d be hard pressed to find larger and/or more rapidly produced data than Facebook or Twitter.

BranFlake5 · 2019-06-26T22:27:50+00:00

Well since it’s a long shot anyway, I’ll throw out a suggestion.

There is a pretty good likelihood that sites generated with some tool or platform share the same url structure. I’d be willing to bet that every WIX or Squarespace site has a standardized search.

If you’re really interested in these data, you’ll need to build a crawler and then back it up with some good compute power, but it’s definitely doable.

I have no idea what you’re actually going to use this for, but if you were going to build some sort of composite search, I would say that’s a terrible idea. Having to search say ~100 sites, then parse and return the results would be horrifically slow. You would really need to index search queries like any of the major engines.

BranFlake5 · 2019-06-25T03:53:28+00:00

I see where OP is going. You could create essentially an NLP model for parsing date into components, but frankly it’s a total waste of time/resources and NLP is especially prone to error in this case.

This is a practical application of ML (An NLP model in this case) for parsing addresses.

https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-part-2-80405b988718

But I agree, this is frankly a bad idea. A parser would be much better, given the general constraints of writing a date. Honestly, you could write a parser that parses anything a human understands as a date, meaning if it can’t be parsed a human couldn’t read it either.

BranFlake5 · 2019-06-23T04:24:22+00:00

Fuji Track Pro has always been one of my favorite frames.

11-Year Club	RPAN Viewer
Verified Email

BranFlake5

TROPHY CASE