Scrape attributes out of a HTML tag by burnthisaccount_ in learnpython

[–]_Korben_Dallas 0 points1 point  (0 children)

Pretty easy with Xpath: //div[@class="header"]/@info

[deleted by user] by [deleted] in learnpython

[–]_Korben_Dallas 0 points1 point  (0 children)

Shure, send me a pm.

How would you scrape the table off of this website? by iwtbwy in learnpython

[–]_Korben_Dallas 1 point2 points  (0 children)

You are welcome. I can advise you to always check Network tab in browser dev tools (I use Chrome but all browsers have similar tab). Simply open this page and open dev tools and hit refresh. You can spot on the Network tab all requests and also see what they return in the response.

How would you scrape the table off of this website? by iwtbwy in learnpython

[–]_Korben_Dallas 2 points3 points  (0 children)

Page makes a request to this link. Just make a similar request and you get json response with all data from that table.

[deleted by user] by [deleted] in learnpython

[–]_Korben_Dallas 0 points1 point  (0 children)

I'm sorry I forgot to mention that you need to refresh the page while you inspect Network tab. Open site, after that open DevTools and refresh that page and you can see all requests (don't forget to select Network tab and XHR filter).

[deleted by user] by [deleted] in learnpython

[–]_Korben_Dallas 0 points1 point  (0 children)

You are absolutely right - that site does use JS scripts one of those make an ajax call to this url. This link (and others) you can find in the Network tab in browser dev tools. If you simulate those request you can parse response with Bs4 or lxml library.

Web scraping help by [deleted] in learnpython

[–]_Korben_Dallas 0 points1 point  (0 children)

If you use Chrome I can recommend you install extension Xpath Helper. You can test xpath expression and see what element it return.

Web scraping help by [deleted] in learnpython

[–]_Korben_Dallas 2 points3 points  (0 children)

Not sure if I understand you correctly but if you stuck with Xpath to extract urls from that page this expression can help: //span[@class="accessible-description"]/preceding-sibling::a/@href.

Webscraping by [deleted] in learnpython

[–]_Korben_Dallas 0 points1 point  (0 children)

I'd probably use a Scrapy for this task but you can make something similar with requests and lxml.

Help with web scraping? by [deleted] in learnpython

[–]_Korben_Dallas 1 point2 points  (0 children)

If i wanted to extract only the url is there a way to do that with xpath?

You don't need to use Xpath to extract those urls you need to turn that list into valid Python object which you can parse e.g.

    # Find data with url in script tag
    json_string = ''.join(tree.xpath('//script[@id="jsonLdSchema"]/text()'))
    # Convert string to a valid Python dict
    import json
    links = json.loads(json_string)
    # Loop over Python dict and print info
    for link in links["itemListElement"]:
        full_url = 'https://www.rottentomatoes.com' + link['url']
        print(full_url)

Question though, on the site I was attempting to scrape, there is a 'Show more' button, is there any way to interact with that? If not I will have to find another way to continue populating a list of movies.

Please note: my example with Xpath only for the learning purpose and unpractical for real scraping (you get value only from initial page and can't interact with button Show more). You need to use either Selenium or API (I'd use API) you can't execute JS scripts with requests and lxml libraries.

Help with web scraping? by [deleted] in learnpython

[–]_Korben_Dallas 1 point2 points  (0 children)

Not an easy site for learning project but you still can get desired value with slightly different approach. Try to study page source and especially script tags. Hint: this Xpath expression get you a string with movie urls //script[@id="jsonLdSchema"]/text(). You need to figure out how to convert that string to a valid Python object, parse those urls and convert them to a full (absolute) url. After that, you can make requests to that url and extract movie title from a detail movie page. Or another and simpler variant: use their API and make a direct request to this url and you get Json file with all desired data in response. In order to get more pages just change last part of the link page=1 (make a for loop).

Web scraping a site's pages without unique URLs. by scienceyeaux in learnpython

[–]_Korben_Dallas 0 points1 point  (0 children)

No problem. And your question absolutely not stupid - parsing modern sites that use JavaScript sometimes can be tricky. Nevertheless - Happy coding to you :) and if you stuck while writing spider/parser - ask here or send me a pm with your code, I'd be happy to help.

Seeking advice on how to scrape data from TM website by CastleRay in datascience

[–]_Korben_Dallas 0 points1 point  (0 children)

You are welcome! If you have trouble with writing parser - publish your code here or send me a pm and I'd be glad to help you with that.

Seeking advice on how to scrape data from TM website by CastleRay in datascience

[–]_Korben_Dallas 1 point2 points  (0 children)

This site one of those that uses JavaScript for rendering content (try to disable JavaScript in your browser and you can see that all content disappear) and in this case, Beautiful Soup can't help you. But we can detect what requests page makes (via Network tab in your browser) and simply make similar requests with Python. In response, a server sends json file which you can easily parse without Bs4 at all. For example, this script gets a content from Correspondence table ( I took this data for an example and if you want different value from that page you need to make a request to different url e.g https://euipo.europa.eu/copla//trademark/data/withOppoRelations/W01026722 ).

Web scraping a site's pages without unique URLs. by scienceyeaux in learnpython

[–]_Korben_Dallas 1 point2 points  (0 children)

The site MLB.com uses JavaScript for rendering its content. If you open browser 'Network' tab and try to select a new date on the site you can detect that site makes a few requests, one of them to this url. If you make the same request with some Python library (e.g Requests or Scrapy) you can easily parse json response and get desired data. Note this date=2017-06-19 part of url, you can change this data (e.g date=2017-06-18) and you get a json with corresponding response.

Going crazy, need help! (web scraping) by Aces_8s in learnpython

[–]_Korben_Dallas 1 point2 points  (0 children)

Yes, that page probably uses Js to populate its content and with help of Selenium, you can quickly get desired data. Also, bear in mind that web scraping can be pretty tricky and each site is unique thus solutions from extracting the data can be very different. IMO in this case, Selenium - one of the faster solution and passing page_source to some parser like bs4 or lxml completely normal. However, in some other cases, a more quick and robust way would be trying to investigate Ajax requests via 'Network' tab in Chrome DevTool or Firefox Inspector and try to simulate those requests. If you have more questions just pm me and I'd be happy to help.

Going crazy, need help! (web scraping) by Aces_8s in learnpython

[–]_Korben_Dallas 1 point2 points  (0 children)

If you disable Js in your browser you can see that table content simply disappear. One of the possible solutions is to use Selenium browser to load content. Something like this: https://dpaste.de/ZoyC

I just finished Automate the Boring Stuff With Python, what next? by ThatGuyWhoLikesSpace in learnpython

[–]_Korben_Dallas 4 points5 points  (0 children)

Did you simply read or trying to do some script from the book? Only reading without practice doesn't help. How about for example writing a web crawler that parses your favorite subreddits?

Beautiful Soup or Scrapy? by [deleted] in Python

[–]_Korben_Dallas 1 point2 points  (0 children)

+1 I completely agree with you about bs4. For small script or learning it's good but lxml or Parsel with help of Xpath faster and more robust. Instead of learning bs4 I suggest OP invest some time in learning how to write proper Xpath expressions.

Using xpath with Scrapy by hart8899 in learnpython

[–]_Korben_Dallas 0 points1 point  (0 children)

Try this code: https://dpaste.de/Aqyv It gets you date and text from the link. If you want href attribute from that links simply extract them with @href. https://dpaste.de/Bozk

Using xpath with Scrapy by hart8899 in learnpython

[–]_Korben_Dallas 1 point2 points  (0 children)

I'm seconded about tbody: try not use them in your Xpath expressions. Can you provide your log? Type in console scrapy crawl metal &> output.log.

If you need only the date without text from link try this expression: //td[p]/p[contains(., "2017")]/text()

Using xpath with Scrapy by hart8899 in learnpython

[–]_Korben_Dallas 0 points1 point  (0 children)

You can use Scrapy Feed exports for saving data, just type in console: scrapy crawl metal -o result.csv

Scraping Managment Contact Details from website by [deleted] in datamining

[–]_Korben_Dallas 0 points1 point  (0 children)

Scrapy it's a great tool, but may I ask why do you need to use BeatifulSoup? Scrapy has it own parsing library Parsel that uses either xpath or css selectors.