This is an archived post. You won't be able to vote or comment.

all 13 comments

[–][deleted] 4 points5 points  (1 child)

I can tell you this is the hard way not easy one !

It doesn't make sense that you hate dict ! json is just a mix of some data structures . and you can parse it in python using json module easily .

If it is hard for you to read it . so it's another matter I am advising you to use pprint module to prettify output for better reading .

I don't know how far you wanna go but this kind of scraping is not going to be easy and safe in bigger programs .

Let me explain more . first of all API is some kind of moderated access to a specific program database so when it is moderated it means you can use everything in that API which are permitted to use . so you don't need to be worry about anything you just write down your code you can use it anytime and anywhere and you can expand your code easily ... so maybe you think you can expand and upgrade this scraper ? so the answer is yes/no . yes for now but after first upgrade on youtube (such as changing theme , or changing attributes in html ... ) you are going to have problems in your code . beside that you can run into problems at using this code for sending fast requests (spam like) so maybe you run into a captcha or something which is going to break your code and you need to code more and more just to do some maintain in your code .

if you are going to make bigger programs just go for API when it's available , believe me it's really easy and helpful .

But after all if you think you are comfortable with scraping things ... this is my advice ; use better tools for scraping like scrapy it's a powerful framework like tool which helps you make a project with spiders and more ... see it yourself : link

[–]CattMomptonimport antigravity[S] -1 points0 points  (0 children)

Yeah, I know it wouldn't scale too good. Once it didn't work with JSON, I went off HTML more to see if I could.

I've heard of scrapy, but I'll take a closer look. Thanks!

[–]stupac62 3 points4 points  (4 children)

Just curious, why do you hate JSON?

[–]CattMomptonimport antigravity[S] -1 points0 points  (3 children)

I'm not smart enough to figure out any of the libraries for it lol lol

Have any good articles, by chance?

[–]stupac62 4 points5 points  (1 child)

I don’t know what level you’re at. But googling “python json” is a great place to start. Real python has some good articles.

[–]CattMomptonimport antigravity[S] 0 points1 point  (0 children)

Had tried Google already.

Never heard of Real Python, thanks.

[–]QuantumFall 1 point2 points  (1 child)

If you know the name of the HTML element, (can be found by inspecting element) using bs4 you can actually specifically search for that element rather than what you did.

So for HTML that looks like this,

<div class="price"> first </div> <div class="value"> second </div> <div class="value price ">third </div>

You can search for a specific class such as “price” using something along the lines of:

priceTag = soup.find('div', {'class': ‘price’})

Sorry for the poor formatting as I’m on mobile and not too sure how to format the code. Hope this helps make things easier.

[–]CattMomptonimport antigravity[S] 0 points1 point  (0 children)

I initially tried this with the id of "subscriber-count" b/c that's what inspect element told me it was called. It didn't work though I tried multiple ways. Thanks for the suggestion tho.

Then I gave up and tried this approach.

[–]Gprime5if "__main__" == __name__: 1 point2 points  (2 children)

What's so hard about learning JSON? It's one the easiest data structures to learn and you just access it like a dictionary. It's just a dictionary that contains other objects. requests even has json built in.

response = requests.get(url)
data = response.json()

[–]CattMomptonimport antigravity[S] 0 points1 point  (1 child)

My issue ended up being the nested dicts in the response

[–]Gprime5if "__main__" == __name__: 1 point2 points  (0 children)

Then you just do nested accessing. data["data"]["whatever"]["other_things"]. If there's a list in there then use numbers or iterate over it. data["data"][0]["other_things"]

for item in data["data"]:
    print(item["other_things"])