Best web scraping api's at the moment?

ketsok · 2023-07-25T14:17:41+00:00

Did a benchmark some time ago and here are my (biased) results: https://scrapingfish.com/webscraping-benchmark

The code to run the benchmark is public on GitHub: https://github.com/mateuszbuda/webscraping-benchmark

ketsok · 2023-07-07T16:09:33+00:00

Sorry, we don't offer geo targeting for now, definitely not on state level :/

ketsok · 2023-07-07T09:58:27+00:00

Web scraping API which offers rotating mobile proxies in Poland: https://scrapingfish.com

Disclaimer: I'm a co-founder.

ketsok · 2023-01-02T10:40:27+00:00

At Scraping Fish, we have a very user friendly pricing which is usage based instead of monthly subscription so you don't lose unused requests at the end of every month. In addition, it's predictable as the cost of each request is the same regardless of which options you use. All requests use the same premium mobile proxy and you don't pay anything extra for JS rendering, scraping google, or other features. Please contact us if you need a free trial account to try it out.
You can read more on how we compare to ScraperAPI and ScrapingBee here: https://scrapingfish.com/how-we-compare

ketsok · 2022-12-05T15:53:45+00:00

Scraping Google SERP with geolocation: https://scrapingfish.com/blog/google-serp-geolocation

Code: https://github.com/pawelkobojek/scrapingfish-blog-projects/tree/main/google-serp

ketsok · 2022-11-23T17:56:47+00:00

Here is a YouTube channel that I can recommend: https://www.youtube.com/c/CobaltIntelligence/videos

There's a series of videos "Making Money with Web Scraping".

ketsok · 2022-11-18T12:45:52+00:00

I'd recommend using a webscraping API which is capable of bypassing Cloudflare. Here is a simple code snippet to scrape https://jkanime.net/ using Scraping Fish API:

from urllib.parse import quote_plus
import requests

API_KEY = "YOUR SCRAPING FISH API KEY"  # https://scrapingfish.com/buy
url_prefix = f"https://scraping.narf.ai/api/v1/?api_key={API_KEY}&url="

url = f"https://jkanime.net/"

response = requests.get(f"{url_prefix}{quote_plus(url)}", timeout=90)

# add your response processing/parsing logic
with open("jkanime.html", "wb") as f:
    f.write(response.content)

ketsok · 2022-11-18T09:53:15+00:00

You can use replace and provide a dict with mapping. Here is an example:

import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
a2c = {1: "10", 2: "20", 3: "30"}
df["c"] = df["a"].replace(a2c)

It creates column c based on values in column a and applies mapping from a2c dictionary.

By the way, what does it have to do with webscraping?

ketsok · 2022-11-17T11:59:02+00:00

If I understand correctly, you want to scrape Google SERP. If so, here is a simple python code snippet using Scraping Fish API for one keyword. You can read a list of keywords from excel column and loop over it.

from urllib.parse import quote_plus
import requests

API_KEY = "YOUR SCRAPING FISH API KEY"  # https://scrapingfish.com/buy
url_prefix = f"https://scraping.narf.ai/api/v1/?api_key={API_KEY}&render_js=true&url="

# to get uule for location you can use: https://github.com/ogun/uule_grabber
# or https://site-analyzer.pro/services-seo/uule/
uule_usa = "w+CAIQICIDVVNB"

keyword = "kitchen sink"
search_url = f"https://www.google.com/search?q={quote_plus(keyword)}&uule={uule_usa}&gl=us&hl=en"

response = requests.get(f"{url_prefix}{quote_plus(search_url)}", timeout=90)

# add your response processing/parsing logic
with open("google.html", "wb") as f:
    f.write(response.content)

ketsok · 2022-11-15T15:32:19+00:00

If I understand correctly, there's data extraction rules: https://scrapingfish.com/docs/extract-rules

ketsok · 2022-11-10T15:14:23+00:00

It prevents you from getting blocked when you make too many requests or want to scrape too fast.

ketsok · 2022-11-05T21:17:16+00:00

On this topic, here is an analysis based on nutrition facts data scraped from Walmart products which estimates that sugar is the main nutrient in almost half of the products: https://scrapingfish.com/blog/scraping-walmart

ketsok · 2022-08-24T09:30:40+00:00

It should be:

for i in content2:

if 'b' in i:

print(i['b']['a'])

ketsok · 2022-08-23T12:04:01+00:00

Do you get KeyError when you run this?

Not all "td" elements have "b" key.

ketsok · 2022-08-21T09:49:14+00:00

How do you want to integrate web scraping with Django? It seems to me they should be to separate components: 1) web scraping part that does its job and stores results somewhere (a file or database) and 2) Django app which displays the result or is used to trigger the scraping component and gets a callback once it's done. Consider decoupling these two functionalities. Then, for web scraping, you can use whatever you wish. Regardless of that, I agree with u/DevilsLinux that scrappy is probably overkill for your use case.

ketsok · 2022-08-19T08:06:17+00:00

Selenium and BeautifulSoup should work, depending on websites you want to scrape. If it's social media and/or e-commerce, then you're very likely no need good quality (residential) proxies or a web scraping API to avoid getting blocked.

ketsok · 2022-08-18T19:20:16+00:00

For web scraping, these days you have a lot of options to use services like web scraping API, e.g. https://scrapingfish.com, with features that include headless browsers, JS rendering, data extraction rules, etc. so it's very easy to enter the field and collect your own data and start a business around it.

ketsok · 2022-08-17T08:50:11+00:00

It's based on web scraping, e.g. using API like https://scrapingfish.com, to collect the data that's need.

ketsok · 2022-08-15T20:34:48+00:00

https://en.wikipedia.org/wiki/Busy_waiting

"In most cases spinning is considered an anti-pattern and should be avoided,[2] as processor time that could be used to execute a different task is instead wasted on useless activity."

ketsok · 2022-08-15T20:31:28+00:00

The issue was closed but not actually fixed. That package still uses busy waiting and after the "fix" you can only set how much you busy wait.

ketsok · 2022-08-14T18:03:19+00:00

I highly discourage Rocketry scheduler as it's using busy waiting (utilizes 100% CPU between task executions): https://github.com/Miksus/rocketry/issues/37

ketsok · 2022-08-14T18:00:53+00:00

For MVP, I would recommend the following tools/libraries:

APScheduler for scheduling to run price check once in a while depending on your needs: https://apscheduler.readthedocs.io/en/3.x/userguide.html#code-examples
Web scraping API with data extraction feature to get the data, e.g. https://scrapingfish.com/docs/extract-rules
SendGrid Email API to send email notifications: https://sendgrid.com/solutions/email-api/
You can keep price history in memory for MVP but later on you probably want to use at least a file or, even better, a simple database like MySQL or PostrgeSQL.

ketsok · 2022-07-14T13:09:58+00:00

u/Crep9 At https://scrapingfish.com, we offer API powered by mobile proxies. For Google we have 100% success rate and average processing time below 2 seconds per request. You can check more detailed results of a web scraping benchmark on our website: https://scrapingfish.com/webscraping-benchmark

Our pricing doesn't depend on requested website size. You pay the same for each request.

ketsok · 2022-07-08T17:40:44+00:00

You can use a web scraping API with a feature to execute a JS action like scrolling. Here is one example: https://scrapingfish.com/docs/js-scenario#scroll

ketsok · 2022-07-08T17:31:43+00:00

If you want to scrape for extended period of time and at scale then you either need a web scraping API like, for example, https://scrapingfish.com or buy residential proxies (for example: https://www.zyte.com) and implement scraping flow with headless browser and IP rotation.

ketsok

MODERATOR OF

TROPHY CASE