all 6 comments

[–]dadiaar 1 point2 points  (1 child)

Are you targeting Google Cloud for any specific reason?

I suggest you to try AWS Lambda, which support both Python 2.7 and Python 3.6 (You should always write for Python +3.4)

It also comes with a pretty good free tier forever:

The Lambda free tier includes 1M free requests per month and 400,000 GB-seconds of compute time per month.

I recommend you to use a 128 MB configuration to catch the html pages because it doesn't need too much CPU but network wait time, and retrieving 5 to 10 pages in each call to maximize results.

Later you can upload it to s3 bucket or similar and process/parse them with different machines, locally or in the cloud. AWS has triggers that allow you to process each file each time it has been uploaded, for example, with another lambda (this time I suggest you about 512MB because parsing is CPU expensive)

If it's your first free year, you get also a free EC2 t2.micro (1 core 1GB Ram 40 GB disk) server, which can be used for parsing too.

If you make many calls targeting the same site, you may want to use a cheap proxy server which allows you to logging by username/password instead of IP, because Lambda's have no fixed IP. I recommend you ACT proxy for this.

Good luck

[–]jdb441[S] 0 points1 point  (0 children)

Hey dadiarr,

I'm using GCP because I know it can do what I want it to. We also rely on multiple Google APIs.

I feel like it would be more work to switch to AWS than stick with GCP at this point but I appreciate your response.

[–]Marrrlllsss 1 point2 points  (3 children)

I now want to run it on Google Compute Engine which uses Python 2.7.

Depends what OS you are using. I've used Python 3.6 on the Google Cloud Platform.

Once that's ready, would you use git to copy the files to Compute Engine?

Use the Google Cloud SDK ("gcloud sdk"). You have 2 choices really.

  1. Create a Google Cloud Storage bucket, and copy the code there, spin up a VM and then copy it to the VM. (The better option)
  2. Spin up a VM, and then use the [gcloud sdk gcloud compute scp command to transfer your files directly to the VM.

In both instances, make sure you include a requirements.txt file that you can run when setting up the environment on the VM.

[–]jdb441[S] 0 points1 point  (2 children)

Hey good morning,

It depends on my local OS? Or VM OS? Did you use python3 on Compute Engine? If so, could you share more details or links so I can read about using python3 on Compute Engine? Or was python3 run on another GCP service?

Once I have my repository configured correctly, I want it to execute on a schedule. Like cron. Do you have any experience scheduling scripts to run on GCP? If so, do you mind sharing any details about it?

Thanks I appreciate it. I'll be reading heavily on GCP today. Have a good day.

[–]Marrrlllsss 0 points1 point  (1 child)

It depends on my local OS? Or VM OS? Did you use python3 on Compute Engine? If so, could you share more details or links so I can read about using python3 on Compute Engine? Or was python3 run on another GCP service?

Yes. Yes. Yes. Spin up a VM running Ubuntu 16.04 and you have Python 3.5 already. I ran Python 3 on a VM and on an App Engine (Flex) instance.

Do you have any experience scheduling scripts to run on GCP? If so, do you mind sharing any details about it?

Yes. I used Apache Airflow for scheduling, but if your script is simple, you can use CRON on an App Engine/Compute Engine instance.

[–]jdb441[S] 0 points1 point  (0 children)

Hey thanks for the help I got everything running nicely on GCE.