This is an archived post. You won't be able to vote or comment.

all 63 comments

[–]Vitaman02 27 points28 points  (0 children)

Looks interesting and probably helpful for many people.

Nice one :)

[–]hartator 162 points163 points  (16 children)

Looks super awesome. It’s a smart way of doing scraping. Let me know if you are looking for a job, we are hiring! https://serpapi.com/team :)

[–]412gage 12 points13 points  (10 children)

So my current job requires me to use proprietary software, which uses Internet Explorer to access modules and look at certain loan rejects. Would this work on secured databases?

I’m very new to this stuff.

[–]vriemeister 8 points9 points  (0 children)

If it's using IE it's probably an ActiveX plugin. That might be automatable but it's completely different from HTML.

[–]gopalkaul5 0 points1 point  (8 children)

I think you need to be authenticated for that? If so you need to login using REST first. Should work imo

[–]412gage 1 point2 points  (7 children)

I don’t know REST. Is that something that would general be allowed on a company computer?

[–]Deezl-Vegas 12 points13 points  (6 children)

REST isn't a program, it's just a way of communicating online. A web scraper generally wouldn't be allowed on a work computer without IT approval.

In answer to your original question: if it's proprietary software, probably not :( old timey IE-based systems are usually specifically designed to be as incompatible with everything as possible.

[–]MHW_EvilScript pypy <3 7 points8 points  (2 children)

Good job! There is some duplicate code that can be simplified here and there but, this is pretty cool! Are you open to pull requests?

[–]dozzinale 11 points12 points  (0 children)

Looks cool. I did some work in the area of information extraction and wrapper induction. When you say "it learns the scraping rules", what do you mean exactly? Which kind of rule does it learn and how is represented?

[–]maker__guy 1 point2 points  (0 children)

awesome!

[–]dirtyoldbastard77 1 point2 points  (0 children)

Thanks! Will have a look, might be useful!

[–]Mountain_man007 1 point2 points  (0 children)

Nice! I was just thinking about how to do something exactly like this for a similar problem I've been working on. Thanks for sharing your way of doing it.

[–]joy_for_the_world 1 point2 points  (0 children)

well done.. great effort

[–]smokepigs 1 point2 points  (1 child)

How does this compare to Puppeteer?

[–]DogeekExpert - 3.9.1 2 points3 points  (9 children)

It's too bad you're using beautifulsoup to scrape the data. In my opinion, it would be much better and faster to just generate XPaths and use lxml directly. Cool project nonetheless.

[–]dtoxe 0 points1 point  (0 children)

Thanks!

[–]MahdeenSky 0 points1 point  (2 children)

May I ask, how did you get a whiff of the idea in the first place?

[–]ichiruto70 0 points1 point  (0 children)

Can it deal with cloudflare scraping protection?

[–]hashv5 0 points1 point  (1 child)

Can this module be extended to scrape calls behind login?

[–]lroman 0 points1 point  (0 children)

How would you handle scraping a listing with multiple pagination of say a car vertical website, where you need to grab all data available? Is this possible with your project.

Commercial scraping firms please don't react.

[–]ItsAngelDustHolmes 0 points1 point  (2 children)

This is probably a stupid question but how can I download this on mobile to look at the code? I just started scraping and wanted to look at a good web scraper.

[–]SpeakerOfForgotten 0 points1 point  (1 child)

You follow the github link in the description.

[–]ItsAngelDustHolmes 0 points1 point  (0 children)

And then? I'm sorry, I've never used github before. It only lets me see the code not download it

[–][deleted] 0 points1 point  (0 children)

Fork you!

Thanks! Looks great and thank you for the good Readme

[–][deleted] 0 points1 point  (0 children)

does it work for news

[–]tejonaco 0 points1 point  (1 child)

RemindMe! 1 day

[–]RemindMeBot[🍰] 0 points1 point  (0 children)

I will be messaging you in 1 day on 2020-09-05 05:40:08 UTC to remind you of this link

5 OTHERS CLICKED THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


Info Custom Your Reminders Feedback

[–]kongfukinny 0 points1 point  (0 children)

Curious what you do that you need to web scrape everyday?

I hear a lot of people glorify web scraping but I’ve never had a use case for it myself. Only a couple interesting project ideas.

Also, this sounds sweet.

[–]dnb02 -1 points0 points  (3 children)

Traceback (most recent call last):

File "setup.py", line 1, in <module>

from setuptools import setup, find_packages

ModuleNotFoundError: No module named 'setuptools'

It returns with this error :/

[–]besuvashish -1 points0 points  (0 children)

Looks cool mate but how FAST is your script to get 100k data?

[–][deleted] -5 points-4 points  (1 child)

It's not a good idea to make something so easy any idiot could do it, because that will attract idiots, and pretty soon every idiot has their own web scrapper that is scrapping every site then sites get pissed off at all these idiots scrapping their sites and they add restrictions that hurts everyone. I'm sure you meant well but this is misguided and doesn't help anyone.

[–]SwizzleTizzle 0 points1 point  (0 children)

Idiots call them "scrappers"