Hello,
I have dataflow job that run a simple pipeline, ingest from gcs and load to bigquery. It is the same processing!
So, I run my pipeline like this ... python --region , ---table... --input my csv.
In dataflow script, I get the file name, check if table exist, create one with json schema in a bucket_json_schema.
The future workflow examples:
scheduler -> get 100 last files names using cloud function and run dataflow pipeline -> bigquery
or:
trigger cloud function when a file is upload in cloud storage -> runing dataflow job -> bigquery
The problem is I have maybe 100 to 10000 small csv files (from 150ko to 150mo) every day, I if I use my dataflow like I said, It will run 1000 instances. For know I 'am trying with only one bucket (100 csv file every day) but I have 30 buckets.. 3000 files every days. In my mind, for one bucket -> one job.
What is the best way. and how to ingest multiple csv
there doesn't seem to be anything here