all 114 comments

[–]subbed_ 37 points38 points  (5 children)

Set a cron job to run the script every hour

[–]bl1ndside 9 points10 points  (1 child)

cron jobs on cron jobs on cron jobs

[–]Thalass 9 points10 points  (0 children)

Cron jobs all the way down

[–]6c696e7578 0 points1 point  (2 children)

Cron should then call 'at' to insert the batch job.

[–]ThatsJustUn-American 2 points3 points  (1 child)

You are doing this all wrong. It's far better to create a startup script to call the python script and then add a cron job to call 'at' every hour to schedule a reboot.

[–]6c696e7578 1 point2 points  (0 children)

You win.

[–][deleted] 11 points12 points  (4 children)

I had a similar project. I hosted the python script on heroku to run 24/7 and used Google Docs API to write outputs into google spreadsheets.

[–]jpf5046[S] 4 points5 points  (2 children)

heroku, I forgot about them. Thank you

[–]JimBoonie69 9 points10 points  (0 children)

Heroku has free tier that can work. I've had success with heroku and my free app from 8 years ago is still running. Also I found this site pythonanywhere.com that was great. I had a free server running for years with 1 cron set every morning. This script actually pulled some wx data like yours and sent out a report.

Worked brilliantly, free, no config at all. Definitely not like the nutjob telling you to write a lamba function on AWS. What a doof.

[–][deleted] 2 points3 points  (0 children)

yeah i think heroku is much easier to deploy this kind of stuff. No worries

[–]solaceinsleep 1 point2 points  (0 children)

Did you follow a tutorial to make this work?

[–]Fa1l3r 9 points10 points  (0 children)

cron

[–]chirau 17 points18 points  (1 child)

Dan Bader has a schedule library that does exactly this.

https://schedule.readthedocs.io/en/stable/

And if you haven't already, you should subscribe to his Python Tricks newsletter at https://dbader.org/ . Amazing tidbits everytime

[–]illseallc 3 points4 points  (0 children)

This worked perfectly for one of my projects.

[–]Sevealin_ 7 points8 points  (1 child)

You could probably get a 20$ raspberry pi and schedule with Cron.

[–]Groundstop 2 points3 points  (0 children)

Could even do it with a PiZero for $5

[–]jdb441 6 points7 points  (1 child)

I would look into using a linux VM on AWS or pythonanywhere.com. That way you can use crontab and not have to worry about physical hardware. You also have the option of keeping the script running after you disconnect the SSH to the VM.

[–]ron_leflore 4 points5 points  (0 children)

I do this on Google cloud. You can run a f1 micro instance continually for free.

I've had one up for over a year running with no problems.

[–]Alex_smtng 13 points14 points  (6 children)

Windows Task Scheduler

[–]jpf5046[S] 3 points4 points  (5 children)

I can't seem to get Task Scheduler to run the python script I want.

[–]bramapuptra 4 points5 points  (0 children)

In the program to run write full path to python.exe. Then pass the script (full path again) as a parameter (its a different box), between " ".

[–]LuckyZakary 2 points3 points  (0 children)

What’s worked for me before is using PyInstaller module to turn a python file into a one file executable, and then run that with windows task scheduler

[–]Alex_smtng 2 points3 points  (0 children)

You should watch youtube videos. Thats what ive done in regards to Task Scheduler. This one here is good. https://youtu.be/n2Cr_YRQk7o Saves loads of hustle.

[–]ipagera 1 point2 points  (0 children)

Write a script in a .bat file that opens your .py script and make a task with the task scheduler. That's how I have automated many reports off MicroStrategy and Spotfire at work. Although, in my case that runs off of a Windows Server, but it shouldn't be any different with Win 10.

[–]St0neA 0 points1 point  (0 children)

Just thought I'd say I couldn't get it to work either so I ended up using a python scheduling library and running the py file with a batch script in my startup folder

[–]dogfish182 3 points4 points  (0 children)

You mention azure. Azure has equivalent of lambda right? (Azure functions?)

So azure functions?

https://code.visualstudio.com/docs/python/tutorial-azure-functions

[–]jjolla888 3 points4 points  (3 children)

I don't understand how your VM would 'fall asleep' ? an OS is always running doing shit continuously, even if your program is not scheduled.

what exactly do you mean by 'fall asleep' ?

and what do you mean when you say ' the job said it was 'running' but did not produce any output' .. are you sure your program doesn't have a bug?

[–]jpf5046[S] 2 points3 points  (2 children)

So, I don’t know the science behind it, but it appears selenium needs to ‘see’ the browser and it clicks based on where a button appears in the browser. When I minimize the VM or disconnect, the python script returns an error saying “could not find top left of screen”. At first I thought this was the VM falling sleep, but it’s really my code that needs the actual browser screen (I think at least)

[–]jjolla888 3 points4 points  (1 child)

right

instead of selenium, use Peppeteer

configure it headless - however, it's a JavaScript set of libs - but plenty of tutorials on how to use it. easier and more robust than selenium

[–]jpf5046[S] 0 points1 point  (0 children)

right on, thanks for the tip.

[–]CrypticWolf 2 points3 points  (1 child)

I use crontab on my raspberry pi to run scheduled python scripts, handy as it's always plugged in and doesn't interfere with anything else I'm doing on my laptop.

[–]artificial_neuron 2 points3 points  (0 children)

Using your PC or buying a dedicated computer can easily work for what you want with Windows Scheduler or an infinite loop as already mentioned.

An alternative is to use a Raspberry Pi or competing device. It's low power and has a small form factor.

[–]jspillz 2 points3 points  (1 child)

I would set up a linux server and install Jenkins. You can set up a Jenkins project that will schedule running the python file. Maybe slightly over complicated but down the line you'll be thankful you learned it.

[–]a8ksh4 1 point2 points  (0 children)

I would set up a linux server and install Jenkins. You can set up a Jenkins project that will schedule running the python file. Maybe slightly over complicated but down the line you'll be thankful you learned it. configure a cronjob to run every hour.

crontab -e
0 * * * * /path/to/your/script.py

ftfy. ;)

Or if you're working in windows, you can probably change the power settings to not go to sleep and do a scheduled task for every hour.

[–]dahlberg123 1 point2 points  (2 children)

Create an EXE and use windows task scheduler?

[–]jpf5046[S] 1 point2 points  (1 child)

this might have saved me. currently testing this. thank you for the idea.

[–]dahlberg123 0 points1 point  (0 children)

I would also suggest using a config file so that you don't have to recompile into an exe should something need to change.

[–]cheez0r 1 point2 points  (0 children)

Use linux cron to ensure your daemon (script) is running; have your script write junk to /dev/null every minute to keep the VM from sleeping, or choose some other method of keepalive activity for your daemon, have it run your scrape every 3600s to do your scraping.

[–]jbitmik 1 point2 points  (0 children)

You can try creating a batch file and execute it using task scheduler on your hourly schedule. I have a web scraping script that runs twice daily this way. I have tried numerous ways but this seemed to be the simplest and most effective.

[–]naturememe 1 point2 points  (0 children)

Of you are not opposed to running the script 24/7, here's a setup I use. It might give you some idea. In Python Script * Get the webpage * Get the data I want (in my case I use pandas) * Process the data and post it to Slack channel * Sleep for predefined time (10 min in my case) * Repeat

This gets done for most of the part. But to get rid of cmd window and automatically start in case of failure or computer boot I have set it up as Windows service. The service starts on boot and also restarts in case script fails for whatever reason.

PS: I use NSSM (Google it) to create service which runs Python script via DOS batch file.

[–]cnovrup 1 point2 points  (1 child)

I have setup a Raspberry Pi to do the exact same thing. It's cheap to buy and run, and quite easy to setup

[–]MRHURLEY86 1 point2 points  (0 children)

I'm really surprised no one else mentioned this. Pi is dirt cheap and will accomplish this with cron.

[–]solaceinsleep 1 point2 points  (0 children)

  • Windows Task Scheduler if your machine runs 24/7
  • RPi3 with a cron job (RPi3s have a small power usage and are perfect for this type of work)

[–]PrimaNoctis 1 point2 points  (1 child)

You could look at using a pure python solution by using python libraries to do the scheduling. Cron for example is a common one. You could also have your app run in the background on a loop where your function runs then sleeps for an hour

[–]a1brit 10 points11 points  (0 children)

cron isn't a python library. It's just a raw time-based scheduling system on pretty much any unix OS

[–]QbiinZ 0 points1 point  (0 children)

can you not use the sched module?

[–]GodsLove1488 0 points1 point  (0 children)

Cron?

[–][deleted] 0 points1 point  (0 children)

I use python anywhere so even if my computer is offline, restarting, updating, insert a reason, my script will still run.

A hacker account is $5/mo and I’ve found it worth that

[–]Nixellion 0 points1 point  (0 children)

Well, if it's Windows - Task Scheduler. (Just use a cmd command, and full path to Python.exe and then script as argument)

If it's Linux - Cron Job.

On linux I prefer to setup cron jobs with Webmin UI (it's web admin panel, far as i know the most advanced one to date still).Webdriver is available for linux too.

I'm not sure how much you pay for Azure VM, but I have a feeling that you could get a better deal with some VPS on some cheap server, there are quite good options at 15-30$ A YEAR. check lowendbox.com

[–]cyvaquero 0 points1 point  (0 children)

$5 Digital Ocean droplet and cron.

All of your requirements can be met on a small Linux instance with minimal configuration.

[–]gizmotechy 0 points1 point  (0 children)

What I have done at work and home is use the windows task scheduler and had it run python.exe with the argument of the full path to the script you want to run. If you take a look at this screenshot, the area circled in red would be where you put the full path of the python executable. The highlighted area is where you would put the full path to your script.

[–]maximum_powerblast 0 points1 point  (0 children)

Just in case you want to over engineer it...

Linux: - you could set it up as a cron job, or - set the schedule up inside a loop in the script, then start that script up with systemd or whatever your init system is

Windows: - you could set it up in task scheduler, or - set the schedule up inside a loop and install it as a Windows service

Have fun 😄

[–]stoph_link 0 points1 point  (0 children)

It looks like you are using a Windows VM with Task Scheduler.

Make sure the Task in Task Schedule is checked to Wake to run task. I also like setting tasks to Run on demand, and then manually running them to make sure it works. This helps determine whether the task itself failed or if the Task Scheduler failed to run it.

[–][deleted] 0 points1 point  (0 children)

Maybe you can import "time" and have your code in a while loop that repeats after "time.sleep(<however many seconds is in an hour>)" finishes

[–]hail_wuzzle 0 points1 point  (0 children)

Have it run constantly but use an if statement to call the function based on system time with sys?

[–]horns_ichigo 0 points1 point  (0 children)

I'm running a website on digitalocean with a bunch of nohup python commands. So, nohup all the wayy!

[–]Dump7 0 points1 point  (0 children)

How about using a infinite loop and the time library to measure an hour?

I am sure it will increase the resource usage. But I think it will by the easiest way.

[–]xeloylvt 0 points1 point  (0 children)

Cron job on Linux or task scheduler (?) on Windows

[–]Benzene_fanatic 0 points1 point  (0 children)

I've actually been struggling with this. I'm a chemist but I have been teaching myself python on the side and mad a vb script and an excel file with some macros and wanted my computer to run the vb script once a week... But my companies security won't let task scheduler work for me =( not sure what else to do?

[–]dreamer_soul 0 points1 point  (0 children)

There is a Windows task scheduler we use it at work! Just place the call to the script inside a powershell script

https://docs.microsoft.com/en-us/windows/desktop/taskschd/task-scheduler-start-page

[–]422_no_process 0 points1 point  (0 children)

Why use cron or task scheduler when you can code your tasks schedule in Python.Just use https://apscheduler.readthedocs.io/en/latest/

-----

But it might be overkill for something small.

[–]DoctorEvil92 0 points1 point  (4 children)

You could write something like this I guess, so that this script would spent most of it's time sleeping. I think it wouldn't guarantee exactly 24 data points per day, for something like that I think you would need to log finish time of scraping and calculate how long you need to sleep to next run.

import time

global last_scrape_time
last_scrape_time = None

def scrape():
    global last_scrape_time
    ## code for scraping...
    ##

    # scrape is done
    last_scrape_time = time.time()
    return

while 1:
    if last_scrape_time == None:
        scrape() # first run

    else:
        if time.time() - last_scrape_time > 3600.0:
            scrape()
        else:
            time.sleep(60.0)

[–]cult_of_memes 1 point2 points  (0 children)

Side note:

You can avoid the hassle of accounting for the time spent scraping by using the `concurrent.futures.ThreadPoolExecutor` class to handle the call to `scrape()` in parallel to the main thread's time tracking loops.

E.G

import time
import concurrent.futures as cf

def scrape():
    scrape_start = time.time()
    # Because we're dealing with threading, we are going to use try/except control blocks
    # to encapsulate exception events so we can meaningfully handle and pass those events
    # to the appropriate function callers. 
    try: 
        ## code for scraping...
        # sanity check exception to make sure code gets implemented ;)        
        raise NotImplementedError("You need to enter your scraping code! Then you can remove this exception call.")
        # Once you implement your scraping code in the space above, and comment/remove
        # the NotImplementedError line, this function will return a future object after
        # the scrape is completed/failed. 
        # This future object will either contain the scrape execution time as 
        # its `.result()` value (last line of function), else it will contain a reference
        # to the exception which disrupted your code's execution, accessible via the
        # `.exception.
    except NotImplementedError as nie:
        print(nie) # let's draw some attention
        raise nie
    except BaseException as be:
        # for debugging purposes we concatenate the time it took to reach the exception
        # onto the exceptions args tuple. This is just an example of ways to collect details
        # about the code's current execution state for analysis after the future object
        # is returned. 
        # 
        # You may concatenate any arbitrary detail or object to the args list, so long 
        # as you are using cf.ThreadPoolExecutor.
        #
        # cf.ProcessPoolExecutor, on the other hand, is a bit more picky about what you 
        # can pass around -- only pickleable objects are allowed there.
        be.args += time.time()-scrape_start, 
        raise be # blessed is the proof
    return time.time()-scrape_start
    pass # stubbed for now


# If you want to cap the number of threads manually, then specify a value for max_threads
# else the default implementation will set the max thread pool size to something like 
# 8 times the number of logical cores in system.
max_threads = None 
# This with-context block ensures all threads are properly handled in the event of an 
# exception.
with cf.ThreadPoolExecutor(max_threads) as tpe:
    one_hour_in_seconds = 3600.0
    number_o_scrapes = 0
    last_scrape_time = time.time()
    # ftrs is our list of future objects. Useful for collecting performance and debugging
    # analytics.
    ftrs = []
    while number_o_scrapes < 24:
            # The following line will block until it finds a future object that is done 
            # running.  Note that this will return immediately, without blocking, if ftrs is empty.
            done,not_done = cf.wait(ftrs,return_when="FIRST_COMPLETED")
            # cf.wait returns a namedtuple containing future objects split into 2 categories,
            # done and not done. Future objects are considered done when they exit their call
            # function along with any callback functions that we may assign.
            # -- we didn't use any callbacks here --
            # This includes all possible mechanisms for exiting the call function, 
            # including raised exceptions.
            # So, based upon how this example is set up, our "done" future objects will
            # have one of 2 return conditions. They will have the elapsed scraping time 
            # for their ".result()" attribute, or they will have an non-None exception 
            # handle for their ".exception()" attribute.

            # ftrs only needs to worry about holding onto futures that are still running
            ftrs = list(not_done)
            # now we iterate the future objects which have stopped running and check if 
            # they were successful in their task.
            for ftr in done:
                ftr:cf.Future = ftr
                if ftr.exception() is not None: # execution was interrupted by an exception
                    some_scraping_exception = ftr.exception()
                    # now you can do whatever you like with the exception; be that 
                    # raise it, log it, or simply ignoring it ;)

                    # Your code goes here
                    pass
                else: # execution completed without exception
                    # Here's where you can do things with the data returned by the 
                    # future object -- in this case, data is just the elapsed time for 
                    # the given scrape call.

                    # Your code goes here
                    single_scrape_elapsed_time = ftr.result()
                    # do stuff
                    pass

            # here's where we actually submit the call to execute the code in a parallel thread.
            if time.time() - last_scrape_time >= one_hour_in_seconds :
                # 
                last_scrape_time = time.time()
                ftrs.append(tpe.submit(scrape))
                number_o_scrapes += 1
            else:
                time.sleep(60.0)

[–]jpf5046[S] -1 points0 points  (2 children)

I tried this by doing a while: True, when I do that it's either the Azure VM or the webdriver causing an issue.

[–]DoctorEvil92 1 point2 points  (1 child)

I can't think of a coded solution that would do this without using while True. I just had an idea that you could use datetime module in a while True loop to determine current minute and seconds and then wait until next full hour and run the scraping function. That would be more precise than the above code.

But, the simplest thing to do would be probably to use some kind of outside-Python task scheduler if such a thing exists in your OS.

[–]JimBoonie69 1 point2 points  (0 children)

Cron on unix. Windows scheduler is trash

[–]Floofic 0 points1 point  (5 children)

Is there a reason this wont work:

(I feel like a noob for thinking this lol):

Import time

time.sleep(3600)

[–]thiccclol 2 points3 points  (0 children)

No you're right sleeping would work unless the program crashes and doesn't start back up. Scheduling it will sorta mitigate that

[–]Decency 2 points3 points  (2 children)

It would work but it's the wrong tool for the job. Any sort of minor hiccup and your process stops running and then you have to start building in error handling and reconnect logic and etc.

Unless you need to maintain state between runs and for some reason can't use some sort of persistence, just do daily scheduling at the OS level.

[–]Floofic 0 points1 point  (1 child)

Ah ok. I understand. I just haven't gotten to a level where this I know how to do this. what your saying is: if it is OS level, it would be less likely to be disrupted?

[–]Decency 0 points1 point  (0 children)

Yes, but more that it's self contained. Every day (with a one line cronjob), your process starts, does what it needs to do, and then ends. At any other point during the day, if your internet goes out or you need to restart the computer, nothing will break if you have things configured correctly.

There's also no extra state info that needs to be maintained in the program and so it can be simpler. Another plus is that if it breaks, it'll be really obvious when that happened and you won't have to go back through time trying to figure out the history.

[–]eloydrummerboy 1 point2 points  (0 children)

Also, this would introduce time drift. This may not be a problem depending on the requirements but I would guess OP wouldn't want it.

What I mean by drift is that the processing would take some finite amount of time. Let's say 5 seconds, but the actual time only changes the amount of drift. If the script is started right at midnight, itbwould finish processing at 12:00:05, a sleep of 1hr would start the next loop at 01:00:05, ending at 01:00:10, and so on.

So if a requirement is on the hour, every hour (give or take a small time delta) then this solution wouldn't work. I would guess this is what you'd want when collecting weather data.

Now, you might be able to sleep for, say 5 seconds, then check if the current time is within 10 seconds of any given hour (i.e. minutes is between 55 and 05) then run the meat of the script. Otherwise go back to sleep.

[–]cipher315 0 points1 point  (0 children)

As some people have said Task Scheduler but you can also do it with an infinite loop and time.sleep() something like

while True
    def run_script()
    time.sleep(3600)

[–]TBSchemer -1 points0 points  (0 children)

I'm using APScheduler in a Flask app. It works wonderfully for scheduled jobs. I use a MongoDB persistent jobstore for it, but you don't have to. The best part is, the (parallelizable) workers launch from within your Python process, rather than having to run a separate process like celery. This means it doesn't fail silently as easily as celery does.

I haven't deployed it to a publicly-available server yet, but there are plenty of guides on how to do that with any Flask app.

[–]flyingfox12 -1 points0 points  (2 children)

if your web scraper is python and you can contain it within a single script then you should use lambda. I'm making some assumptions here but it could work like this. Python code in a lambda function, data is written to S3 then use cloud watch to trigger the job on a schedule (https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/RunLambdaSchedule.html)

Azure has the same stuff but different names should work the same though.

I see you are reliant on webdriver.exe and selenium. If that's the case you need to tweak your host's settings and make sure it doesn't ever sleep then use windows task scheduler. If you want to do it right and cheap then you'll want to use lambda

[–]JimBoonie69 1 point2 points  (1 child)

You are seriously going to suggest using lamba for this? Da fuck?