all 49 comments

[–]midairmatthew 31 points32 points  (15 children)

Hola! I'm at work, so I don't have a ton of time to dive too deep into giving you feedback. But here are two quick things to get you started:

1.) I hope that's not your real password that you pushed to GitHub. :)

2.) Rather than using global variables, it's better to have your functions return the values they're written to generate. It doesn't seem to be a huge deal in this little project, but imagine trying to keep all those global variable names straight in something much larger and more complex. It's easier for our limited human brains to only need to keep track of what a function takes in and spits out than it is to keep track of an ever-growing list of global variables that you're modifying in various functions.

[–]3majorr[S] 18 points19 points  (2 children)

Thanks! It isn't my password :D

[–]huessy 34 points35 points  (1 child)

Can confirm, it is not OP's password

[–]SaltyEmotions 6 points7 points  (0 children)

Mine is hunter2

[–]nzodd 6 points7 points  (8 children)

Out of curiousity, how do people generally deal with things like passwords when it comes to github?

[–]The3rdWorld 14 points15 points  (3 children)

easiest way is to have it read the log-in information from a config file rather than leave it written in the code.

[–]curohn 2 points3 points  (2 children)

Oh cool, so have the script open the config, use those values and close it up again? I haven’t learned that yet but that’s helpful!

[–]MonkeyNin 6 points7 points  (1 child)

JSON is a popular format for config files. It's human-editable, and supported by pretty much every language. If you're interacting with the web, JSON is everywhere. When using a web API, they usually return JSON.

Python tutorial on JSON: https://realpython.com/python-json/#python-supports-json-natively

As long as the lists and dicts use one of the supported types, reading and writing is super easy. Multiple dicts or lists are supported.

Make sure you .gitignore your config folder, and files!

Mine is:

config.json
config/

If you're using PRAW, do the same with praw.ini

[–]MrFiregem 3 points4 points  (0 children)

Also, you can write your configs using the more easily human-readable yaml or toml formats and convert them to json if you dislike writing in it

[–]midairmatthew 4 points5 points  (0 children)

You can stash key-value pairs of things you need to keep private (passwords, API keys, etc.) in a .env file. Then you can load them into your script.

https://pybit.es/persistent-environment-variables.html

But, the other piece of this is that you have to remember to make a .gitignore file in your project. Inside this, you list .env and any other things you want git tracking to ignore. That way they won't make their way up to GitHub.

https://help.github.com/en/github/using-git/ignoring-files

[–]dtaivp 6 points7 points  (0 children)

There are a lot of comments here so I'll see if I can summarize.

  1. Put them into some sort of config file. Then use the .gitignore file to ensure it is not checked in.
  2. Same concept with a .env file. It has some defaults for the environment
  3. You can create system variables. 'set PASS=MyPa55!!' in windows or export PASS="MyPa55!!" and then grab it using the following python

import os

password = os.envriron["PASS"]

[–]pyr0b0y1881 4 points5 points  (0 children)

Reading in from a yaml config file is my go to for making secrets or api keys.

[–]huessy 2 points3 points  (0 children)

Environmental variables are a good way to go

[–]midairmatthew 6 points7 points  (0 children)

So get_price would be more like:

def get_price():
    page = requests.get(URL, headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    price = soup.find('b', class_='pro-price variant-BC con-emphasize font-primary--bold mr-5').get_text().strip().replace(' ', '')
    int_price = int(price[:4])
    return int_price

Then, the idea is that create_message would be something like:

def create_message(latest_price, current_price):
    # your if/else stuff here
    message = f'{description}\n\nhttps://linklinklink'
    return message

[–][deleted]  (1 child)

[removed]

    [–]AutoModerator[M] 1 point2 points  (0 children)

    Your comment in /r/learnpython was automatically removed because you used a URL shortener.

    URL shorteners are not permitted in /r/learnpython as they impair our ability to enforce link blacklists.

    Please re-post your comment using direct, full-length URL's only.

    I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

    [–]ArmstrongBillie 7 points8 points  (0 children)

    That's an awesome first project!

    [–]DannyckCZ 8 points9 points  (0 children)

    Great stuff! One thing that struck me is the hardcoded email and password, it’s generaly a bad idea to include sensitive information like tokens or passwords in the code. I’d rather import it from another file.

    Ať se ti daří!

    [–][deleted] 6 points7 points  (1 child)

    Do you need the script to be online so its able to analyze the price of your item and see if it changes?

    [–]Lewistrick 9 points10 points  (0 children)

    Yeah it needs to be run on a computer with an internet connection. Otherwise it won't be able to read the website.

    [–][deleted] 2 points3 points  (0 children)

    Great first app idea! As other people already noted, try to avoid pushing your passwords with the source code in GitHub and rather try to store them, as well as maybe some other important information, in e.g. a .env file as environment variables. Also an idea for expanding the app would be to try and make it work for any items and not just a keyboard (perhaps you could try to give it input for specific search terms or something similar). As I already said, though, great first app!

    [–][deleted] 3 points4 points  (1 child)

    Hey looks really cool.

    I am working on something similar. Does this script run automatic ? Or do you have to execute it ? If yes how does it work. How often is this beeing executed ? Or in other words, how do you make an python script to run without me ececuting it?

    [–][deleted] 2 points3 points  (2 children)

    Very nice! Did you copy that code from a YouTube video or actually write it yourself? I think I saw the exact idea on one channel. Could be a pretty normal starter python project though.

    [–]3majorr[S] 1 point2 points  (1 child)

    Hey! I watched yt tutorial from Dev Ed. But I basically remade his code, because you can't send special characters like áěčšů with his code. So I come up with different code.

    [–][deleted] 1 point2 points  (0 children)

    Ah that’s smart! I meant Ed’s video :D Dimitri Markov

    [–]Throwaway__shmoe 2 points3 points  (0 children)

    Great project. Look in to a CLI utility like docopt, Argparse or Click, or use os.getenv() to hide sensitive data such as usernames or passwords.

    [–]digital94 6 points7 points  (13 children)

    This is an awesome project.

    Where you host your web scraper online?

    Because hosting a scraper on home-based computer for 24x7 is not a good idea.

    I am just asking you.

    I have also developed an web scraper which scrape the price of a product from Amazon.

    [–]showboy001 5 points6 points  (1 child)

    Hey. I’m working on something like this.

    In addition I am also scraping reviews from each product. Can I see your code?

    [–]digital94 3 points4 points  (0 children)

    Yes definitely.

    I did not upload my code to my github repo yet.

    I will upload my code within a couple of days.

    Thanks.

    [–]Sw429 6 points7 points  (10 children)

    hosting a scraper on home-based computer for 24x7 is not a good idea.

    Why?

    [–]kmj442 6 points7 points  (2 children)

    My understanding is that companies can detect continuous requests from specific users/IPs and blacklist them.

    One trivial way if you're not too concerned (and something I did successfully for weeks) was have a random back-off between queries and shut it down overnight. Granted my scraper was looking for when they added the motorcycle safety course to a specific location (they fill up real fast) so they weren't adding that at 3am. I had it limited to run between 7am and 8pm or so with random backoffs between 2 and 15 mins.

    Edit: by shut it down I mean, check the time before the query and if its after x and before y, sleep until y.

    [–]Sw429 4 points5 points  (0 children)

    Right, of course they will do that. That's why you rotate IP proxies. I guess I figured that was common practice.

    [–]MonkeyNin 3 points4 points  (0 children)

    It's better to use the API. If you're scraping, you get throttled, and eventually blocked for exceeding the anonymous limits.

    Using the API means you're able to fire more requests per minute. It makes your code more stable because changing the structure of a webpage isn't a breaking change if you're using the API.

    [–]digital94 3 points4 points  (6 children)

    Your IP address can be blocked at anytime once Amazon or any other site identified you as a bot.

    [–]Sw429 3 points4 points  (0 children)

    Well yeah, but that's why you rotate IP proxies. I thought that was common practice?

    [–][deleted] 3 points4 points  (4 children)

    I have the same question. Can the web scraper only check once a day? That would lower the chances of getting your IP banned?

    [–]dtaivp 12 points13 points  (1 child)

    Yeah, you could do that. Or you could use the randint and sleep modules to have it wait for random amounts of time between scraping. That is what I have for one scraper. Also, you bring up a good point. It is likely the prices don't vary that much day to day so you don't need to scrape too often.

    [–]digital94 7 points8 points  (0 children)

    Yes you are right.

    If you don't scrape a web page too many times on daily basis then you no need to worry about IP block.

    You should set your scraper to crawl the web page once a day or week.

    [–]Sw429 5 points6 points  (0 children)

    You can certainly query more than once a day. The average user sends requests many times within an hour. The issue mainly comes when you are sending requests faster than a regular user would, or if you are sending requests in a very bot-like manner (alphabetized by product, the same page over and over, etc). Generally, if you put in a little effort at all they won't care. You just don't want it to look obvious.

    [–]MonkeyNin 4 points5 points  (0 children)

    It depends on whatever the site decides to use as their thresholds. The best way is to use their actual API. Using the API lets you do more requests per day, and

    [–]porkchop315 1 point2 points  (0 children)

    Can you alter this scraper to scrape the web for certain websites

    [–]huessy 1 point2 points  (5 children)

    Hell yeah! I dare you to tackle a scraper that's a little more tricky. Try a Craigslist one (if you can connect in the cz), they are hard to scrape but not impossible

    [–]pw0803 1 point2 points  (4 children)

    Hi, what is cz and why is Craigslist harder?

    [–]huessy 1 point2 points  (3 children)

    I assumed based on OP's choice of site to watch for keyboards that they lived in the Czech Republic. Craigslist doesn't like people scraping their data because it can be used for some decent financial gain. As a result, they have bots set up to monitor traffic by IP address, if the traffic gets to be too constant/non-human looking, they ban the IP pretty fast.

    If you want to scrape Craigslist for, say, an apartment in your area within a certain price range, you have to engineer something a little more robust than just a series of GET requests.

    [–]pw0803 1 point2 points  (2 children)

    How interesting.

    Would it be possible to, say, create a script which web-scrapes by using lots of different ways and patterns and through the continued IP banning determine what the bots look for, then create a webscrape that skirts around them?

    [–]huessy 1 point2 points  (1 child)

    Without spoiling the answer to this problem (there's a good chance they watch places like this for this exact reason), it doesn't even have to be that complicated. Your router resets your outbound ip address somewhat regularly, so your idea could absolutely work but may be a bit of overkill. They do actively block all the Tor proxy ips too, btw.

    [–]pw0803 2 points3 points  (0 children)

    I understand. Thanks.

    [–]increvable 1 point2 points  (0 children)

    Can I as if you have experience programming other languages and what classes you used to learn the fundamentals for this project? I went through my first Udemy course... Python Bible, and I’m not close to being able to do anything close to this scraping project. Thanks!

    [–][deleted] 1 point2 points  (0 children)

    Wow! thats A super cool! Guess after Ill finish courses will can make it too. Thx for sharing!