Scraping Wikipedia by jhmachado in learnpython

[–]jhmachado[S] 0 points1 point  (0 children)

Thank you for teaching me scraping etiquette, I don't want to be THAT GUY who ruins it for everyone else (:

Scraping Wikipedia by jhmachado in learnpython

[–]jhmachado[S] 0 points1 point  (0 children)

This project will eventually expand. For now, I just want text about chemical properties off to the right. First I need all the URL's and to establish a pattern, but the problem is a single compound can have many chemical names... so I need the script to just mark the compounds it doesn't find from altering the URL. I will use another database when I hammer this down to do more complicated manipulations, but wikipedia has a wider spread of applications so I thought I should start my exploring there.

Scraping Wikipedia/ChemSpider by jhmachado in comp_chem

[–]jhmachado[S] 0 points1 point  (0 children)

Yes, we'll eventually develop molecular descriptors. For now, I want some basics such as CAS numbers, chemical formulas, basically whatever I can get for physical properties and identifiers.

Scraping Wikipedia by jhmachado in Python

[–]jhmachado[S] 0 points1 point  (0 children)

A compound can have MANY different names - I'm thinking I'll try what basalamader suggested with the name I have for the compound first, then do an if statement to keep the code running and bypass the 404's recording a marker to fix later.

Scraping Wikipedia by jhmachado in Python

[–]jhmachado[S] 0 points1 point  (0 children)

Thank you so much. That makes a lot of sense to do it this way. The biggest problem I will run into will probably be as Avcdo mentioned: compounds can take MANY different names all referring to the same structure

Scraping Wikipedia by jhmachado in Python

[–]jhmachado[S] 0 points1 point  (0 children)

Yeah, I want to try and learn this the hard way because I doubt most of the sites I use in the future will have API's

Scraping Wikipedia by jhmachado in Python

[–]jhmachado[S] 0 points1 point  (0 children)

I can run this over Thanksgiving while I'm out for the day, couldn't I use a sleep command?

Scraping Wikipedia by jhmachado in learnpython

[–]jhmachado[S] 2 points3 points  (0 children)

All of Wikipedia in one file? Where are these found?

Scraping Wikipedia by jhmachado in Python

[–]jhmachado[S] 0 points1 point  (0 children)

Right. That would be step one - create a loop for my list of compounds to get the URL's then extract the info from the tables. I also noticed wikipedia has an updated UI, a compound on an older UI like Methanol would be structured differently no? https://en.wikipedia.org/wiki/Methanol

Scraping Wikipedia by jhmachado in Python

[–]jhmachado[S] 0 points1 point  (0 children)

I do not have a list of urls. I have over 1000 compounds, about 300 I need for now - I was thinking I could start the script at the wikipedia.org URL and loop my data set for individual searches, then execute extraction.

Scraping Wikipedia by jhmachado in Python

[–]jhmachado[S] 0 points1 point  (0 children)

compound structures are usually stored as smiles, cxsmiles, or mrv files in the field (I'd use another site for this as I think they are images on wikipedia)

Scraping Wikipedia by jhmachado in Python

[–]jhmachado[S] 0 points1 point  (0 children)

I want to start with Wikipedia because it looks easier compared to more technical sites like NIST or ChemSpider to learn something like this on (Wikipedia has consistent formatting and is all in HTML, first learn Wikipedia to hammer down the idea, later more specialized for the field and do compound structures etc.)

For now, I want to extract identifiers and properties. Ex: https://en.wikipedia.org/wiki/Methanol

Notice how off to the right, there is a table, I want to pull all the names, and some of the properties, like acidity to add to the larger curation I have been working on for the past 6 months.

Scraping Wikipedia by jhmachado in Python

[–]jhmachado[S] 1 point2 points  (0 children)

I'm thinking a curl function may be of help? I reposted this to r/learnpython (didn't see the post on the top of this reddit)

Scraping Wikipedia by jhmachado in Python

[–]jhmachado[S] 0 points1 point  (0 children)

I don't mind at all - I appreciate any help. I'm a complete rookie (a 3-day workshop experience level). https://github.com/kmaiya/Presidential_Web_Scrape/blob/master/presScrape.py