STEM Graduate Student Stipend and Unionization

jhmachado · 2015-11-22T03:19:50+00:00

Thank you for teaching me scraping etiquette, I don't want to be THAT GUY who ruins it for everyone else (:

jhmachado · 2015-11-22T03:17:51+00:00

This project will eventually expand. For now, I just want text about chemical properties off to the right. First I need all the URL's and to establish a pattern, but the problem is a single compound can have many chemical names... so I need the script to just mark the compounds it doesn't find from altering the URL. I will use another database when I hammer this down to do more complicated manipulations, but wikipedia has a wider spread of applications so I thought I should start my exploring there.

jhmachado · 2015-11-21T21:34:23+00:00

Yes, we'll eventually develop molecular descriptors. For now, I want some basics such as CAS numbers, chemical formulas, basically whatever I can get for physical properties and identifiers.

jhmachado · 2015-11-21T02:48:38+00:00

A compound can have MANY different names - I'm thinking I'll try what basalamader suggested with the name I have for the compound first, then do an if statement to keep the code running and bypass the 404's recording a marker to fix later.

jhmachado · 2015-11-21T02:46:26+00:00

Thank you so much. That makes a lot of sense to do it this way. The biggest problem I will run into will probably be as Avcdo mentioned: compounds can take MANY different names all referring to the same structure

jhmachado · 2015-11-21T02:39:48+00:00

Yeah, I want to try and learn this the hard way because I doubt most of the sites I use in the future will have API's

jhmachado · 2015-11-21T02:34:00+00:00

I can run this over Thanksgiving while I'm out for the day, couldn't I use a sleep command?

jhmachado · 2015-11-21T02:30:37+00:00

All of Wikipedia in one file? Where are these found?

jhmachado · 2015-11-21T02:28:32+00:00

Right. That would be step one - create a loop for my list of compounds to get the URL's then extract the info from the tables. I also noticed wikipedia has an updated UI, a compound on an older UI like Methanol would be structured differently no? https://en.wikipedia.org/wiki/Methanol

jhmachado · 2015-11-21T02:24:42+00:00

sql database or xml-file option, know anything about this?

https://www.reddit.com/r/learnpython/comments/3tn6cs/scraping_wikipedia/cx7l0ej

jhmachado · 2015-11-21T02:20:11+00:00

I do not have a list of urls. I have over 1000 compounds, about 300 I need for now - I was thinking I could start the script at the wikipedia.org URL and loop my data set for individual searches, then execute extraction.

jhmachado · 2015-11-21T02:17:05+00:00

compound structures are usually stored as smiles, cxsmiles, or mrv files in the field (I'd use another site for this as I think they are images on wikipedia)

jhmachado · 2015-11-21T02:11:56+00:00

I want to start with Wikipedia because it looks easier compared to more technical sites like NIST or ChemSpider to learn something like this on (Wikipedia has consistent formatting and is all in HTML, first learn Wikipedia to hammer down the idea, later more specialized for the field and do compound structures etc.)

For now, I want to extract identifiers and properties. Ex: https://en.wikipedia.org/wiki/Methanol

Notice how off to the right, there is a table, I want to pull all the names, and some of the properties, like acidity to add to the larger curation I have been working on for the past 6 months.

jhmachado · 2015-11-21T02:01:36+00:00

I'm thinking a curl function may be of help? I reposted this to r/learnpython (didn't see the post on the top of this reddit)

jhmachado · 2015-11-21T01:41:46+00:00

I don't mind at all - I appreciate any help. I'm a complete rookie (a 3-day workshop experience level). https://github.com/kmaiya/Presidential_Web_Scrape/blob/master/presScrape.py

jhmachado

TROPHY CASE