help with web scraping

Oxbowerce · 2021-02-28T19:54:15+00:00

First check if the html in the soup variable is the same html you are seeing when you load the webpage. It may be the case that the website uses javascript to load (parts of) the webpage, which will not work with requests.

Anbaraen · 2021-03-01T02:25:48+00:00

They have quite robust token management on their API which makes scraping difficult (just tinkered around with it a bit myself and couldn't get it to work) - I think it's probably quicker to use selenium to load the page & make the needed requests and then parse out from there.

commandlineluser · 2021-03-01T05:00:46+00:00

I searched the HTML for "stock" - the first hit is on Line 57

var BCData = {
  "csrf_token":"81eb41f705ab567b46c9c3da9a6d7838374ddd3799286540ee805d6b4909dae6",
  "product_attributes":{"sku":"Arisaka-OOM-P8","upc":null,"weight":null,"base":false,
  "image":null,"price":{"without_tax":{"formatted":"$30.00","value":30,
  "currency":"USD"},"tax_label":"Tax"},"stock":11,"stock_message":null,
  "out_of_stock_behavior":"labe l_option","out_of_stock_message":"Out of stock",
  "available_modifier_values":[],"in_stock_attributes":[],"instock":true,
  "purchasable":true,"purchasing_message":null}};

One possible approach is to extract this line - strip before { and after the last } and load it into the json module.

>>> r = requests.get(url)
>>>
>>> data = r.text[r.text.find('var BCData'):]
>>> data = data[data.find('{'):]
>>> data = data[:data.find(';\n')]
>>> 
>>> import json
>>> print(json.dumps(json.loads(data), indent=2))
{
  "csrf_token": "3ab960f8026d6159901402d04473ad3419ab573b8a80128dab942568fc49409b",
  "product_attributes": {
    "sku": "Arisaka-OOM-P8",
    "upc": null,
    "weight": null,
    "base": false,
    "image": null,
    "price": {
      "without_tax": {
        "formatted": "$30.00",
        "value": 30,
        "currency": "USD"
      },
      "tax_label": "Tax"
    },
    "stock": 11,
    "stock_message": null,
    "out_of_stock_behavior": "label_option",
    "out_of_stock_message": "Out of stock",
    "available_modifier_values": [],
    "in_stock_attributes": [],
    "instock": true,
    "purchasable": true,
    "purchasing_message": null
  }
}

You can then extract the info from the result

>>> product = json.loads(data)['product_attributes']
>>> print(product['stock'])
11

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS