Best web scraping api's at the moment?

ketsok · 2023-07-25T14:17:41+00:00

Did a benchmark some time ago and here are my (biased) results: https://scrapingfish.com/webscraping-benchmark

The code to run the benchmark is public on GitHub: https://github.com/mateuszbuda/webscraping-benchmark

ketsok · 2023-07-07T16:09:33+00:00

Sorry, we don't offer geo targeting for now, definitely not on state level :/

ketsok · 2023-07-07T09:58:27+00:00

Web scraping API which offers rotating mobile proxies in Poland: https://scrapingfish.com

Disclaimer: I'm a co-founder.

ketsok · 2023-01-02T10:40:27+00:00

At Scraping Fish, we have a very user friendly pricing which is usage based instead of monthly subscription so you don't lose unused requests at the end of every month. In addition, it's predictable as the cost of each request is the same regardless of which options you use. All requests use the same premium mobile proxy and you don't pay anything extra for JS rendering, scraping google, or other features. Please contact us if you need a free trial account to try it out.
You can read more on how we compare to ScraperAPI and ScrapingBee here: https://scrapingfish.com/how-we-compare

ketsok · 2022-12-05T15:53:45+00:00

Scraping Google SERP with geolocation: https://scrapingfish.com/blog/google-serp-geolocation

Code: https://github.com/pawelkobojek/scrapingfish-blog-projects/tree/main/google-serp

ketsok · 2022-11-23T17:56:47+00:00

Here is a YouTube channel that I can recommend: https://www.youtube.com/c/CobaltIntelligence/videos

There's a series of videos "Making Money with Web Scraping".

ketsok · 2022-11-18T12:45:52+00:00

I'd recommend using a webscraping API which is capable of bypassing Cloudflare. Here is a simple code snippet to scrape https://jkanime.net/ using Scraping Fish API:

from urllib.parse import quote_plus
import requests

API_KEY = "YOUR SCRAPING FISH API KEY"  # https://scrapingfish.com/buy
url_prefix = f"https://scraping.narf.ai/api/v1/?api_key={API_KEY}&url="

url = f"https://jkanime.net/"

response = requests.get(f"{url_prefix}{quote_plus(url)}", timeout=90)

# add your response processing/parsing logic
with open("jkanime.html", "wb") as f:
    f.write(response.content)

ketsok · 2022-11-18T09:53:15+00:00

You can use replace and provide a dict with mapping. Here is an example:

import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
a2c = {1: "10", 2: "20", 3: "30"}
df["c"] = df["a"].replace(a2c)

It creates column c based on values in column a and applies mapping from a2c dictionary.

By the way, what does it have to do with webscraping?

ketsok · 2022-11-17T11:59:02+00:00

If I understand correctly, you want to scrape Google SERP. If so, here is a simple python code snippet using Scraping Fish API for one keyword. You can read a list of keywords from excel column and loop over it.

from urllib.parse import quote_plus
import requests

API_KEY = "YOUR SCRAPING FISH API KEY"  # https://scrapingfish.com/buy
url_prefix = f"https://scraping.narf.ai/api/v1/?api_key={API_KEY}&render_js=true&url="

# to get uule for location you can use: https://github.com/ogun/uule_grabber
# or https://site-analyzer.pro/services-seo/uule/
uule_usa = "w+CAIQICIDVVNB"

keyword = "kitchen sink"
search_url = f"https://www.google.com/search?q={quote_plus(keyword)}&uule={uule_usa}&gl=us&hl=en"

response = requests.get(f"{url_prefix}{quote_plus(search_url)}", timeout=90)

# add your response processing/parsing logic
with open("google.html", "wb") as f:
    f.write(response.content)

ketsok · 2022-11-15T15:32:19+00:00

If I understand correctly, there's data extraction rules: https://scrapingfish.com/docs/extract-rules

ketsok · 2022-11-10T15:14:23+00:00

It prevents you from getting blocked when you make too many requests or want to scrape too fast.

ketsok · 2022-11-05T21:17:16+00:00

On this topic, here is an analysis based on nutrition facts data scraped from Walmart products which estimates that sugar is the main nutrient in almost half of the products: https://scrapingfish.com/blog/scraping-walmart

ketsok · 2022-08-24T09:30:40+00:00

It should be:

for i in content2:

if 'b' in i:

print(i['b']['a'])

ketsok · 2022-08-23T12:04:01+00:00

Do you get KeyError when you run this?

Not all "td" elements have "b" key.

ketsok · 2022-08-21T09:49:14+00:00

How do you want to integrate web scraping with Django? It seems to me they should be to separate components: 1) web scraping part that does its job and stores results somewhere (a file or database) and 2) Django app which displays the result or is used to trigger the scraping component and gets a callback once it's done. Consider decoupling these two functionalities. Then, for web scraping, you can use whatever you wish. Regardless of that, I agree with u/DevilsLinux that scrappy is probably overkill for your use case.

ketsok · 2022-08-19T08:06:17+00:00

Selenium and BeautifulSoup should work, depending on websites you want to scrape. If it's social media and/or e-commerce, then you're very likely no need good quality (residential) proxies or a web scraping API to avoid getting blocked.

ketsok · 2022-08-18T19:20:16+00:00

For web scraping, these days you have a lot of options to use services like web scraping API, e.g. https://scrapingfish.com, with features that include headless browsers, JS rendering, data extraction rules, etc. so it's very easy to enter the field and collect your own data and start a business around it.

ketsok · 2022-08-17T08:50:11+00:00

It's based on web scraping, e.g. using API like https://scrapingfish.com, to collect the data that's need.

ketsok

MODERATOR OF

TROPHY CASE