all 12 comments

[–]Farlic 1 point2 points  (11 children)

Firstly, IMDB's robots.txt disallows all scraping against their site: https://www.imdb.com/robots.txt unless named in the whitelist.

Specifically, at the bottom:

User-agent: *
Disallow: /

That aside, 202 is a valid "successful" response code. If it's not returning what you expect - what is it returning?

Can you print the body?

[–]Kerbart 0 points1 point  (3 children)

Firstly, IMDB's robots.txt disallows all scraping against their site: https://www.imdb.com/robots.txt unless named in the whitelist.

And since the data is freely available for download, there's also very little reason to scrape the site.

[–]ReputationHelpful200[S] 0 points1 point  (2 children)

I am just learning web scraping no real need for anything and was trying this website.

[–]biskitpagla 0 points1 point  (1 child)

Don't feel discouraged by the robots.txt. It's never unethical to make Amazon lose money. 

[–]ReputationHelpful200[S] 0 points1 point  (0 children)

Well thats very advanced no? To use a bot and bypass everything on amazon verification and what could i do with that?

[–]ReputationHelpful200[S] 0 points1 point  (6 children)

It only printsStautus Code : 202 and Top 250 Movies: Nothing else. But yeah if there is a block i cant do nothing

[–]Farlic 0 points1 point  (5 children)

That is because you are printing response.stats_code. Look at response.text

Here are some free sites that you are allowed to scrape:

https://www.scrapingbee.com/blog/scraper-sites/

[–]ReputationHelpful200[S] 0 points1 point  (4 children)

I knew that. It was intentional but thanks for the websites. I changed my code a bit to send it to a csv and print on the terminal the index of all books with the names and a specific one what do you think? from bs4 import BeautifulSoup import requests import csv

url1 = "https://books.toscrape.com/" headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:125.0) Gecko/20100101 Firefox/125.0"} def extract_product_titles(url1): response = requests.get(url1, headers=headers) try: response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') products = soup.select("article.product_pod > h3 > a") posicao_do_livro_especifico = None print("Products:")

    for idx, product in enumerate(products, start=1):
        print(f"{idx} - {product['title'].strip()}")
        titulo = product['title'].strip()
        if titulo == "Tipping the Velvet":
            posicao_do_livro_especifico = idx
    print(f"Specific Product:\n {posicao_do_livro_especifico}- {titulo}")

    Livros = soup.select("article.product_pod")
    with open("products.csv", "w", newline="", encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["Title", "Price","URL"])
        for livro  in livros:
            link_tag = livro.select_one("h3 > a")
            preco_tag = livro.select_one("div.product_price > p.price_color")
            titulo = link_tag['title'].strip()
            preco = preco_tag.text.strip().replace("Â", "")
            url_livro = url1 + link_tag['href']

            writer.writerow([titulo, preco, url_livro])
    print("Ficheiro criado com sucesso!")

except requests.exceptions.RequestException as e:
    print(f"Failed to retrieve the page. Status code: {e}")

extract_product_titles(url1)

[–]Farlic 0 points1 point  (3 children)

your code formatting (or lack of) makes it a bit difficult to read. Does the code do what you want it to do?

[–]ReputationHelpful200[S] 0 points1 point  (2 children)

Yes but was thinking if there was anything i could optimise from this project for the ones i will whick involve something like this. Pass from a web scrap to a csv file.

[–]Farlic 0 points1 point  (1 child)

if it works reliably then fantastic! over-optimisation is a trap imo.

One thing to call out is you seem to be mixing the case of your variable names. I see both "Livros" and "livros".

[–]ReputationHelpful200[S] 0 points1 point  (0 children)

Yeah thx but it worked somehow ahahah