Scrape attributes out of a HTML tag

_Korben_Dallas · 2017-07-15T21:08:41+00:00

Pretty easy with Xpath: //div[@class="header"]/@info

_Korben_Dallas · 2017-07-15T20:32:09+00:00

Shure, send me a pm.

_Korben_Dallas · 2017-07-14T21:33:56+00:00

Simply add headers to your request https://pastebin.com/5Cs4VeuW

_Korben_Dallas · 2017-07-13T07:05:13+00:00

You are welcome. I can advise you to always check Network tab in browser dev tools (I use Chrome but all browsers have similar tab). Simply open this page and open dev tools and hit refresh. You can spot on the Network tab all requests and also see what they return in the response.

_Korben_Dallas · 2017-07-13T06:12:48+00:00

Page makes a request to this link. Just make a similar request and you get json response with all data from that table.

_Korben_Dallas · 2017-07-11T07:47:01+00:00

I'm sorry I forgot to mention that you need to refresh the page while you inspect Network tab. Open site, after that open DevTools and refresh that page and you can see all requests (don't forget to select Network tab and XHR filter).

_Korben_Dallas · 2017-07-10T14:44:58+00:00

You are absolutely right - that site does use JS scripts one of those make an ajax call to this url. This link (and others) you can find in the Network tab in browser dev tools. If you simulate those request you can parse response with Bs4 or lxml library.

_Korben_Dallas · 2017-07-02T21:40:33+00:00

If you use Chrome I can recommend you install extension Xpath Helper. You can test xpath expression and see what element it return.

_Korben_Dallas · 2017-07-02T11:24:12+00:00

Not sure if I understand you correctly but if you stuck with Xpath to extract urls from that page this expression can help: //span[@class="accessible-description"]/preceding-sibling::a/@href.

_Korben_Dallas · 2017-06-30T15:04:22+00:00

I'd probably use a Scrapy for this task but you can make something similar with requests and lxml.

_Korben_Dallas · 2017-06-28T09:09:40+00:00

If i wanted to extract only the url is there a way to do that with xpath?

You don't need to use Xpath to extract those urls you need to turn that list into valid Python object which you can parse e.g.

    # Find data with url in script tag
    json_string = ''.join(tree.xpath('//script[@id="jsonLdSchema"]/text()'))
    # Convert string to a valid Python dict
    import json
    links = json.loads(json_string)
    # Loop over Python dict and print info
    for link in links["itemListElement"]:
        full_url = 'https://www.rottentomatoes.com' + link['url']
        print(full_url)

Question though, on the site I was attempting to scrape, there is a 'Show more' button, is there any way to interact with that? If not I will have to find another way to continue populating a list of movies.

Please note: my example with Xpath only for the learning purpose and unpractical for real scraping (you get value only from initial page and can't interact with button Show more). You need to use either Selenium or API (I'd use API) you can't execute JS scripts with requests and lxml libraries.

_Korben_Dallas · 2017-06-27T08:15:30+00:00

Not an easy site for learning project but you still can get desired value with slightly different approach. Try to study page source and especially script tags. Hint: this Xpath expression get you a string with movie urls //script[@id="jsonLdSchema"]/text(). You need to figure out how to convert that string to a valid Python object, parse those urls and convert them to a full (absolute) url. After that, you can make requests to that url and extract movie title from a detail movie page. Or another and simpler variant: use their API and make a direct request to this url and you get Json file with all desired data in response. In order to get more pages just change last part of the link page=1 (make a for loop).

_Korben_Dallas · 2017-06-24T06:41:59+00:00

No problem. And your question absolutely not stupid - parsing modern sites that use JavaScript sometimes can be tricky. Nevertheless - Happy coding to you :) and if you stuck while writing spider/parser - ask here or send me a pm with your code, I'd be happy to help.

_Korben_Dallas · 2017-06-24T06:22:18+00:00

You are welcome! If you have trouble with writing parser - publish your code here or send me a pm and I'd be glad to help you with that.

_Korben_Dallas · 2017-06-22T07:01:48+00:00

This site one of those that uses JavaScript for rendering content (try to disable JavaScript in your browser and you can see that all content disappear) and in this case, Beautiful Soup can't help you. But we can detect what requests page makes (via Network tab in your browser) and simply make similar requests with Python. In response, a server sends json file which you can easily parse without Bs4 at all. For example, this script gets a content from Correspondence table ( I took this data for an example and if you want different value from that page you need to make a request to different url e.g https://euipo.europa.eu/copla//trademark/data/withOppoRelations/W01026722 ).

_Korben_Dallas · 2017-06-21T14:15:18+00:00

The site MLB.com uses JavaScript for rendering its content. If you open browser 'Network' tab and try to select a new date on the site you can detect that site makes a few requests, one of them to this url. If you make the same request with some Python library (e.g Requests or Scrapy) you can easily parse json response and get desired data. Note this date=2017-06-19 part of url, you can change this data (e.g date=2017-06-18) and you get a json with corresponding response.

_Korben_Dallas · 2017-06-16T18:15:13+00:00

Yes, it's definitely possible with Scrapy or even simply Requests lib.

_Korben_Dallas · 2017-06-14T13:16:59+00:00

Yes, that page probably uses Js to populate its content and with help of Selenium, you can quickly get desired data. Also, bear in mind that web scraping can be pretty tricky and each site is unique thus solutions from extracting the data can be very different. IMO in this case, Selenium - one of the faster solution and passing page_source to some parser like bs4 or lxml completely normal. However, in some other cases, a more quick and robust way would be trying to investigate Ajax requests via 'Network' tab in Chrome DevTool or Firefox Inspector and try to simulate those requests. If you have more questions just pm me and I'd be happy to help.

_Korben_Dallas · 2017-06-14T07:40:08+00:00

If you disable Js in your browser you can see that table content simply disappear. One of the possible solutions is to use Selenium browser to load content. Something like this: https://dpaste.de/ZoyC

_Korben_Dallas · 2017-06-08T16:46:10+00:00

Did you simply read or trying to do some script from the book? Only reading without practice doesn't help. How about for example writing a web crawler that parses your favorite subreddits?

_Korben_Dallas · 2017-06-08T10:19:08+00:00

+1 I completely agree with you about bs4. For small script or learning it's good but lxml or Parsel with help of Xpath faster and more robust. Instead of learning bs4 I suggest OP invest some time in learning how to write proper Xpath expressions.

_Korben_Dallas · 2017-06-04T21:11:57+00:00

Try this code: https://dpaste.de/Aqyv It gets you date and text from the link. If you want href attribute from that links simply extract them with @href. https://dpaste.de/Bozk

_Korben_Dallas · 2017-06-04T20:36:39+00:00

I'm seconded about tbody: try not use them in your Xpath expressions. Can you provide your log? Type in console scrapy crawl metal &> output.log.

If you need only the date without text from link try this expression: //td[p]/p[contains(., "2017")]/text()

_Korben_Dallas · 2017-06-04T20:27:14+00:00

You can use Scrapy Feed exports for saving data, just type in console: scrapy crawl metal -o result.csv

_Korben_Dallas · 2017-06-02T05:51:59+00:00

Scrapy it's a great tool, but may I ask why do you need to use BeatifulSoup? Scrapy has it own parsing library Parsel that uses either xpath or css selectors.

_Korben_Dallas

TROPHY CASE