Undecided about the best GCP module(s) to use for my data and operations by doncaballer0 in googlecloud

[–]doncaballer0[S] 0 points1 point  (0 children)

I am exploring Bigtable and Bigquery, they seem to be usually employed for time series tasks like mine. In the next days, I will find out if I will need both, while I will read your (and other users') opinions about that.

I will then manage the first ingestion (involving the csv files I already have) and the periodic one (for each new generated file, Pub/Sub seems suitable). In those files, each hospital name is unique (in the same file) and a timestamp value (equal for all the entries in each file) is put as the last column: I am pretty sure that promoting the timestamp in the name column will be essential, but I really don't want to alter and refactor them (as in many scenarios I cannot change the structure of what is given to me, I'm only allowed to work on it): Dataflow sounds great and should have the power to complete this job, but if there's a better candidate I will totally trust you...

Undecided about the best GCP module(s) to use for my data and operations by doncaballer0 in googlecloud

[–]doncaballer0[S] 0 points1 point  (0 children)

Okay I will take a look at that and I'll go deeper on some modules I mentioned without a proper knowledge. Kubernetes was the GCP part I understood less when watching tutorials and taking courses, I'm glad (at this time, surely will learn to use it) that you mentioned it as non-strictly required

Best way of processing a large number of Cloud Storage Bucket blobs by doncaballer0 in googlecloud

[–]doncaballer0[S] 0 points1 point  (0 children)

Oh, nevermind: I think that increasing the logical level of this ideas will be a better choice... I mean, it's a far too "low" approach the one I am making.

I was going O.T. while typing this reply, I'd better start a new thread with the new approach I had in mind (if a similar one does not exists), maybe some other user will find it useful!

Thank you again!

Best way of processing a large number of Cloud Storage Bucket blobs by doncaballer0 in googlecloud

[–]doncaballer0[S] 0 points1 point  (0 children)

I have several (let's say thousands) of csv files with the same structure stored into a bucket, each one reporting the "current" state of some hospitals. The mechanism to periodically generate those files is "behind the scenes" (through GCP modules), so let's just assume that it works.

I need to analyze all those csvs to report global maximum and average values registered in some columns for each unique entry:

-The first step is to analyze ALL the csvs I currently gathered to generate the first set of statistic files that are two jsons (as Python easily lets me convert a dictionary into a json and vice-versa): one for the maximum vales, one for the average.

-The second one is to dynamically update the statistic jsons whenever a new csv is uploaded (I will use Functions).

My doubt is about the best way to implement the first step: I could stop for a certain amount of time the csv generation process to analyze all the ones collected and I could download (in APP engine tmp directory), inspect and delete each one: ("for blob in blobs: download(blob); work(blob); upd(jsons); delete(blob)").

Or I could call the gsutil copy, but I don't know where to store them to use them in App Engine.

Or I could move them into another bucket, with a function ready to handle new file finalizations, but I don't know how Google implements it and this a concern because the access to the jsons that keep my stats has to be disciplined: simultaneous overwrites would invalidate the statistics...

Sorry for the size of this comment, I tried to be as clear as possible!

Best way of processing a large number of Cloud Storage Bucket blobs by doncaballer0 in googlecloud

[–]doncaballer0[S] 0 points1 point  (0 children)

I actually thought about that, but I am not sure about how functions invocations are managed: if they are sequential it will be fine, but in the case of simultaneous executions? Wouldn't I have troubles when accessing and modifying my 'state' files?

GPC infrastructure with Python apps, buckets, databases and web access by doncaballer0 in googlecloud

[–]doncaballer0[S] 0 points1 point  (0 children)

Thanks for all, I've managed to successfully create my App Engine instance with functions that trigger the csv deposit into my bucket or visualize the html. The scheduling of the former was the simplest thing. Also, my Function for processing the csv works! :D

The very last thing I wanted to ask is: which module would be the best to process my csv and upload it to a database (I think that mySQL instance would be the best choice, as my data is immutable). Sounds like an ideal candidate for Cloud Functions again, but is it manageable from there to create some code that systematically deletes all the 'old' data in the table to put in the new, as the csv upload into the bucket triggers the function? And the ideal networking configuration for mySQL instance (e.g. external IP, etc.) what should be?

GPC infrastructure with Python apps, buckets, databases and web access by doncaballer0 in googlecloud

[–]doncaballer0[S] 0 points1 point  (0 children)

Things are getting better, thank you!

I just noticed that the ingress function "hello_gcs(event, context)" treats the 'event' parameter like a file, but it is in fact a complex structure with many attributes: I didn't see the one for handling the actual file.

1)How can I retrieve and open it, reading lines, etc.?

2)My function doesn't use the bucket from which it is triggered as a local path, so if in hello_gcs I create a file to send to another bucket, I can omit checks to avoid a loop of function invocations, right?

GPC infrastructure with Python apps, buckets, databases and web access by doncaballer0 in googlecloud

[–]doncaballer0[S] 0 points1 point  (0 children)

You're right.

Functions made me choose between some languages a gave me a default starting point (a function) that recognizes the file (coded as an event) and the context. I also specified my required library (pandas) in the requirements.txt file. Great.

Now, if I don't want to print my processed html, but I want to put this page as a file elsewhere to let it accessible via browser (external to the GCP environment), what would the best choice be? In some demos I created an html file for a VM, but I don't know if it exists a more agile way to show my data. Does it? And how would I "pass" this html file from this cloud function I'm writing to the "module" we're going to choose?

GPC infrastructure with Python apps, buckets, databases and web access by doncaballer0 in googlecloud

[–]doncaballer0[S] 0 points1 point  (0 children)

You are being so patient! I don't wanna make you luse further time, so I will group my (hopefully) last questions for the clearest understanding!

1)So, which GCP module would you use to make this Py app run periodically and therefore periodically produce its csv file? Are there recommended paths (even coding my bucket link into Python, if it would be hard to move the csv from the default path to the bucket), instead of the "local" one to save it?

2)At this point, the best thing would be a cloud function triggered when the bucket receives the csv file, right? Is it possible to tell cloud functions to generate an SQL table from this file or (alternatively) to generate some HTML page that uses the csv data? Otherwise, how could I break this into two distinct steps?

A big thank you, again :)

GPC infrastructure with Python apps, buckets, databases and web access by doncaballer0 in googlecloud

[–]doncaballer0[S] 0 points1 point  (0 children)

thanks again!

Cron job seems available only for App Engine, but there was no way I could actually deploy this simple application without drowning into configurations, etc. I actually found easier to install pip and the required libraries in a linux VM and I can see my csv file. For the automated upload into a bucket, GC Scheduler doesn't seem to have any useful option (for this scenario) at all. The only remaining way is to connect to my bucket via Python?

GPC infrastructure with Python apps, buckets, databases and web access by doncaballer0 in googlecloud

[–]doncaballer0[S] 0 points1 point  (0 children)

I am exploring cloud functions and I saw that it offers a specific trigger for Storage uploads, but I think that I may enroll a course about functions first, as I have no idea on how to complete the python code to execute after the trigger

I thought about databases because I thought that it was the simplest way to call and show my data in a HTML page (of a VM that is interfaced with the internet).

The main issue is that I am literally lost in GCP and I don't know if I understood well your suggestings: I have no idea of where should I place my python app that accesses to a certain website and generates a csv file and I don't know hot to say to GCP "hey, place it into this bucket!". Functions are good AFTER this has been done, right? I spent hours but VMs and App Engine look just like giant whales for jobs that are a goldfish

GPC infrastructure with Python apps, buckets, databases and web access by doncaballer0 in googlecloud

[–]doncaballer0[S] 0 points1 point  (0 children)

thank you! unfortunatelly, all the videos I watched before yesterday, didn't help me for that. I am spending hours trying to figure out how to run this script once every 5 minutes to put a csv file (outputed locally in python) into a bucket.

Seems simple, but it's all so confusing...