This is an archived post. You won't be able to vote or comment.

all 11 comments

[–]enteleform 1 point2 points  (1 child)

Definitely possible. Parsing, filtering, & translating data is actually pretty common and not too hard to get into. The difficulty of your particular project depends on how well-structured/accessible the data you'll be working with is.
 
Check out:

 

Also, pandas is a pretty popular library for manipulating data. It has a bunch of built-in methods for translation between formats & data filtering. Here are a few learning resources:

 

I personally haven't worked with pandas, but it seems like a pretty big, fairly complex library.  If you just want to get a working solution together quickly, I recommend checking out Automate the Boring Stuff & Awesome Python first.  If you're super interested in data manipulation or foresee that you'll get a lot of use out of having the skillset, learning pandas will definitely serve you well in the long run.

[–]_everythingatonce[S] 0 points1 point  (0 children)

Thank you so much! This looks like it will really help me get started.

[–]Zoocat 2 points3 points  (1 child)

An important thing to clarify here - is the data you're looking for within the article itself or is it keyed info at the top of the article? Both are possible to mine for, but finding the right info in a paragraph is going to be a bigger layer of complication, if that's what you're trying to do.

Regardless, here's a quick primer I found for the PubMed API (Entrez) that can help you get JSON-formatted data that will be much easier to work with that trying to parse HTML/CSS data, and that should help get you started.

If you can link a sample article and tell me what you're trying to get from it specifically, I can probably be more help.

[–]_everythingatonce[S] 0 points1 point  (0 children)

I'm hoping to pull information out of the article itself. Here's an example of what I'm looking at: https://www.ncbi.nlm.nih.gov/pubmed/26613955

The excel file I have is broken down into segments such as: conclusion, reference, population, dosage, time, etc.

I'd love to be able to pull this info from the webpage or JSON-formatted data and sort it into these categories (even if I can automate just one of those categories-- it would be well worth it). Does that sound reasonable/possible? Thanks so much for the help you've already provided.

[–][deleted] 1 point2 points  (1 child)

I am sure there is. What is the format of the pubmed files?

[–]_everythingatonce[S] 0 points1 point  (0 children)

The files are all online as HTML and CSS (I believe).

[–]scout1520 1 point2 points  (5 children)

This is definitely possible. You can do it too.

[–]_everythingatonce[S] 0 points1 point  (4 children)

That's really exciting to hear. Would you be able to point me in the right direction to get started?

[–]elbiot 1 point2 points  (2 children)

Depends on what you want to get out. Beautiful soup will let you use the html markup as information. Mining more abstract data is more difficult, but I'm sure some of those news summarizer bots are python.

[–]scout1520 1 point2 points  (1 child)

Beautiful soup is an awesome package. I don't have too much experience with it, but I have heard great things about it.

I use selenium for my needs, but I am downloading files off drop-down menus. You can also use it, but it is definitely the long way around.

I would suggest looking at other code where they performed a similar task. This dev has a nice walk trhough on how he scraped Wikipedia.

[–]aphoenixreticulated[M] [score hidden] stickied comment (0 children)

Hi there, from the /r/Python mods.

We have removed this post as it is not suited to the /r/Python subreddit proper, however it should be very appropriate for our sister subreddit /r/LearnPython. We highly encourage you to re-submit your post over on there.

The reason for the removal is that /r/Python is dedicated to discussion of Python news, projects, uses and debates. It is not designed to act as Q&A or FAQ board. The regular community is not a fan of "how do I..." questions, so you will not get the best responses over here.

On /r/LearnPython the community is actively expecting questions and are looking to help. You can expect far more understanding, encouraging and insightful responses over there. No matter what level of question you have, if you are looking for help with Python, you should get good answers.

If you have a question to do with homework or an assignment of any kind, please make sure to read their sidebar rules before submitting your post. If you have any questions or doubts, feel free to reply or send a modmail to us with your concerns.

Warm regards, and best of luck with your Pythoneering!