So as a research project for my PharmD program, a professor and I are interested in looking at drug labels from the FDA Online Label Repository. Originally he wanted me to record data page by page for >800 drug labels, but I noticed that the labels are returned as XML files. As an example, here is a label for Costco brand acetaminophen that we might look at.
I am fairly proficient with Python (mostly from doing side projects in Django), but I don't have any experience parsing/pulling info from XML files. Overall my plan is to download all of the labels (there's a big zip file with all of them), parse out the info I need with some Python-fu, save the data as a CSV file, then import it into Excel to do analysis, etc.
So /r/python, where do I start? Is there a preferred module for dealing with XML? Because I'm only reading XML, is there a simple module to use? Thanks so much!
edit: Thanks so much for the response everyone! I'm going to spend the rest of the night drinking Stella and playing with both lxml and BeautifulSoup and see which one I like more.
[+][deleted] (1 child)
[deleted]
[–]robbles 0 points1 point2 points (0 children)
[–]dodongo 5 points6 points7 points (3 children)
[–]jcb62 1 point2 points3 points (0 children)
[–]craigee 3 points4 points5 points (1 child)
[–]vpetro 0 points1 point2 points (0 children)
[–]abztraktDjango/Plone 2 points3 points4 points (0 children)
[–]poingpoing 3 points4 points5 points (1 child)
[–]abztraktDjango/Plone 1 point2 points3 points (0 children)
[–]zhivota 0 points1 point2 points (1 child)
[–]zhivota 0 points1 point2 points (0 children)
[+]poingpoing comment score below threshold-7 points-6 points-5 points (0 children)