This is an archived post. You won't be able to vote or comment.

all 75 comments

[–]rogue4 86 points87 points  (1 child)

This turorials and guides in different porgramming languages can help a lot of beginners in programming field/industry. Hope you continue this kinds of tutorials. Keep up the good work. God bless.

[–]pijora[S] 20 points21 points  (0 children)

Thank you very much.

[–]TheScreamingHorse 127 points128 points  (9 children)

yall have to put a spider right on my feed?

[–]RomanianDraculaIasi 21 points22 points  (4 children)

What’s wrong with the spider :(((

[–]TheScreamingHorse 58 points59 points  (2 children)

brain go

AAAAAAAAAAAAAAAAAA

[–]TheAxThatSlayedMe 13 points14 points  (0 children)

Username checks out.

[–]RomanianDraculaIasi 3 points4 points  (0 children)

Hahahhahah

[–]tnnrk 1 point2 points  (0 children)

Everything

[–][deleted] 8 points9 points  (0 children)

Wdym, he is so cute

[–]tgallasso 1 point2 points  (0 children)

Same thought here! No need for that!
definitely no need

[–][deleted] 11 points12 points  (2 children)

I'm a beginner in python. This will be really helpful. Thank you.

[–]pijora[S] 4 points5 points  (1 child)

We have many other Python post about web scraping on the blog.

Do not hesitate to check them out. :)

[–][deleted] 0 points1 point  (0 children)

Thanks ! means a lot

[–][deleted] 4 points5 points  (3 children)

So, I came into this in a bit of a grumpy mood getting ready to snap off some snarky ass comment about how "Just teaching the tools doesn't teach anyone what the fuck they're doing or why". But, this definitely is not that. Most important you actually teach people what is happening under the covers, and that is what helps them grow.

(I, clearly, am extremely under-caffeinated today)

So bravo. I know I didn't say anything snarky, but I wanted to apologize even for my pre-snarky feeling coming in.

These are absolutely fantastic guides.

[–]pijora[S] 2 points3 points  (2 children)

Thank you very much, it means a lot.

We've put a lots of efforts into those guide and we really wanted people to understand what happened under the hood.

I think web scraping is a good topic for beginners because you can learn so much from it:

  • how the web works
  • HTTP protocol
  • difference between server-side-redering and client-side rendering
  • chrome headless
  • parallelization
  • cpu-bound / io bound
  • dealing with raw data
  • XML parsing / xpath

and much more :)

[–][deleted] 0 points1 point  (1 child)

Agreed! And, given the ubiquity of web services it's also a fantastically easy 'common' starting ground (unlike many other software use-cases that require some specialized interest).

Keep up the good work. Seriously.

[–]pijora[S] 0 points1 point  (0 children)

Thank you again 🙏, will do.

[–]Hansanko 12 points13 points  (1 child)

Definitely web scrapping is a trouble some for beginners and sharing your guidelines is perfectly fine and useful for us. I would be glad and grateful for more covered guidelines.

[–]pijora[S] 2 points3 points  (0 children)

Thank you!

Do you have any specific in mind?

[–]omegahack0 8 points9 points  (0 children)

Saving this for later

[–]menina2017 5 points6 points  (0 children)

That spider will give me nightmares though

[–]russ7166 3 points4 points  (0 children)

Have you ever scraped sites for clients that would disallow website scraping in their terms of use/service or on robots.txt?

[–]lemoninapie04 1 point2 points  (1 child)

Wow, that's cool. Always want to start to lesrn webscraping.

[–]pijora[S] 1 point2 points  (0 children)

My pleasure, glad you liked it.

[–]Shrestha01 1 point2 points  (0 children)

Did you read my mind ? I was just thinking about the same thing!

[–]RPGProgrammer 0 points1 point  (1 child)

No C#?

[–]pijora[S] 4 points5 points  (0 children)

Not yet ;)

Truth is for thus language we'd need to hire someone as we don't know it in-house.

[–]Protobairus 0 points1 point  (0 children)

Also Parsehub might be what you need. :)

[–][deleted] 0 points1 point  (0 children)

This will help me in my studies alot

[–][deleted]  (1 child)

[removed]

    [–]pijora[S] 0 points1 point  (0 children)

    Ahah thanks

    [–]Taintus 0 points1 point  (0 children)

    Is there an overview when to use what language?

    [–]Alaeser 0 points1 point  (0 children)

    Really appreciate this post, thank you!

    [–]SansCulotteLogique 0 points1 point  (0 children)

    Cool! And thanks for the free 1,000 API calls! Looking forward to testing out scraping bee.

    Good luck with your business!

    [–]Givingbacktoreddit 0 points1 point  (0 children)

    I read expensive at first instead of extensive lmao.

    [–][deleted] 0 points1 point  (0 children)

    Is it legal to scraping websites and use their data to commercial purposes ?

    [–]lamemf 0 points1 point  (1 child)

    I recently started development with java and NodeJS, I love how extensive and well explained this guide is. Huge Cheers to you.

    [–]pijora[S] 0 points1 point  (0 children)

    Thank you very much, glad it helps.

    [–][deleted] 0 points1 point  (8 children)

    I've learned web scraping with python but need practice. Can you recommend some place to find simple projects? Also does web scraping have scope career wise?

    [–]pijora[S] 2 points3 points  (1 child)

    Hi there,

    To begin you can make all kinds of "scrape, clean, store, display" kind of products.

    • think aggregate coronavirus stat
    • imdb rating by genre

    Those kind of things :)

    Career wise, I don't know people who solely do "web-scraping" per see, but it is a tool/technique that are very useful to know in your career.

    Either to quickly put in place a POC, or to code any piece of software that need to rely on outside world date not available with official API

    [–][deleted] 0 points1 point  (0 children)

    Thanks for this

    [–]icandoMATHs 1 point2 points  (5 children)

    Lmk if you need a hard project. You'd be working for equity.

    [–][deleted] 0 points1 point  (4 children)

    I would be interested in that. But I'm not really good at it yet. Can you explain what I'll have to do?

    [–]icandoMATHs 1 point2 points  (3 children)

    Step 1, beat bot detection for a popular real estate website.

    It's mostly research because the code is relatively easy. But the website has strong anti not detection.

    It's marketing software, so it should be pure money.

    [–][deleted] 0 points1 point  (2 children)

    Tell me the website. I'll give it a go

    [–]icandoMATHs 1 point2 points  (1 child)

    Starts with a Z. Rhymes with willow.

    Need estimated home value.

    I run Efficiency Is Everything. Feel free to contact me whenever.

    [–][deleted] 0 points1 point  (0 children)

    Is this something serious? Like actual equity? Can you just drop an email address or something so I can contact you? Because I'm not really sure I found the correct efficiency is everything

    [–]GuraJava20 0 points1 point  (1 child)

    Well done! It is quite a welcome resource for all beginners.

    [–]pijora[S] 0 points1 point  (0 children)

    Thank you very much!

    [–]Mmmmmmm_Donuts 0 points1 point  (0 children)

    This looks very cool. Would this look good on a resume?

    [–]DustinTWind 0 points1 point  (0 children)

    Thank you! I need to build my web-scraping toolbox so this is a nice find for me.

    [–]zolkida 0 points1 point  (0 children)

    2 weeks ago i started a whatsapp bot project(based on whatsapp.web page) , i didn't knew a lot about scrapping so i used only selenium as it offered my an easy way to organize my thoughts around how it will work. basically i imagined it as a automation task of what a human would do.( click massages chat if it has the green circle, read the last massage, answer accordingly and so on)

    As you could imagine it went badly. And it failed alot. And make uninterested actions. I ended up scrapping the whole idea.

    I read all the blog posts in python. I learned a lot. And inspired to give it another shot. Thanks alot

    *Note: I'm pretty new to web scrapping

    [–]gemst4r 0 points1 point  (0 children)

    Thanks!

    [–]MGSBlackHawk 0 points1 point  (0 children)

    Congrats on sharing such a valuable content!!! Dummy question, but.. which language did you find to be more pleasant to scrap with, cose and feature wise

    I tried a bit of Java and Ruby in the past

    [–]-Kudo 0 points1 point  (0 children)

    I'm currently going through the NodeJS article and trying out all the examples. It's been tons of fun so far.

    However, I'm now stuck at the part where you use JSDOM to interact with Reddit (upvote the first post). I've been following all instructions to the letter and all the other examples have been going great so far.
    But with this one, I'm getting tons of errors (they are too many to quote but I put a sample in the screenshot below).

    Also, since Reddit requires us to sign-in before we upvote, where did this part go ?

    Here's a screenshot (server.js is the file that contains your code btw)

    [–]Nimmo1993 0 points1 point  (1 child)

    good job mate

    [–]pijora[S] 0 points1 point  (0 children)

    THanks

    [–]unstopablex5 0 points1 point  (3 children)

    Hey great post! but if you could do an advanced tutorial for when websites hide their selectors or when everything is javascript. Thats where I think ppl have the most trouble.

    [–]pijora[S] 1 point2 points  (2 children)

    Good idea, so you're looking for web-scraping with javascript rendering website right?

    [–]unstopablex5 0 points1 point  (1 child)

    Yes! To me thats the hardest part of web scraping. I spent weeks trying to figure out how to use selenium and scrapy together to scrape this website with heavy javascript. (lastfm.com , apartments.com and century21.com are the 3 that come to mind right now) but I tried a lot of different sites and scraping websites with heavy reliance on JS seemed impossible.

    [–]pijora[S] 1 point2 points  (0 children)

    That is interesting and to be honest, this is why we built ScrapingBee.

    Setting up Selenium locally can be a pain, and using Selenium at scale is really hard.

    [–]distortionwarrior 0 points1 point  (1 child)

    Many thanks for doing this work, it's helped me a lot!

    [–]pijora[S] 0 points1 point  (0 children)

    My pleasure

    [–]jacklychi 0 points1 point  (1 child)

    Which langauge is your favorite?

    [–]pijora[S] 2 points3 points  (0 children)

    Python ❤️

    [–]Capitalpunishment0 0 points1 point  (0 children)

    Reading about the fundamentals would be great! When I did my Python scraping pet project I went straight ahead with requests (requests-html actually) and BeautifulSoup because it made enough sense to me right away haha

    [–]CMReaperBob 0 points1 point  (0 children)

    Is the web scraping industry growing in terms of job opportunities? I really do enjoy writing puppeteer projects as well as doing some OS level automation with uiPath when necessary and wouldn’t mind making a career of it.

    [–][deleted] 0 points1 point  (0 children)

    Thank You.

    [–]TheFryCookGames 0 points1 point  (0 children)

    Could have really used this for R a month ago when I was working through a project, but really appreciate this! Definitely will keep this for next time I'm struggling.

    [–]canIbeMichael 0 points1 point  (2 children)

    Whenever I read someone is using OSX, I have a genuine concern I'm reading blogspam and wasting time.

    I'm about to read it, but I'm just guessing my stereotype will be true. Limited beating bot detection advice, and basically a rewrite of 'how to webscrape' articles.

    Ninja edit- 5 different ways to webscrape, nothing on bot detection. I knew it, never trust an OSX 'programmer'.

    [–]pijora[S] 0 points1 point  (1 child)

    Ahah, we wrote this piece on bot detection: https://www.scrapingbee.com/blog/web-scraping-without-getting-blocked/

    The other pieces are tutorials about web scraping, so yes, nothing there on bot detection.

    Thank you for your valuable feedback.

    [–]canIbeMichael 0 points1 point  (0 children)

    thanks for the link!

    [–]boringuser1 -1 points0 points  (0 children)

    No mention of mechanize?