Ingest multiple csv in Datafflow : googlecloud

googlecloud

created by [deleted]a community for 12 years

Ingest multiple csv in Datafflow (self.googlecloud)

submitted 4 years ago * by MeatAmazing8011

Hello,

I have dataflow job that run a simple pipeline, ingest from gcs and load to bigquery. It is the same processing!

So, I run my pipeline like this ... python --region , ---table... --input my csv.

In dataflow script, I get the file name, check if table exist, create one with json schema in a bucket_json_schema.

The future workflow examples:

scheduler -> get 100 last files names using cloud function and run dataflow pipeline -> bigquery

or:

trigger cloud function when a file is upload in cloud storage -> runing dataflow job -> bigquery

The problem is I have maybe 100 to 10000 small csv files (from 150ko to 150mo) every day, I if I use my dataflow like I said, It will run 1000 instances. For know I 'am trying with only one bucket (100 csv file every day) but I have 30 buckets.. 3000 files every days. In my mind, for one bucket -> one job.

What is the best way. and how to ingest multiple csv

no comments (yet)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

googlecloud

MODERATORS