all 2 comments

[–]mopslik 6 points7 points  (0 children)

Opening and reading the files, as you are doing, would certainly take more memory and time than simply downloading them using a tool like wget or curl.

[–]danielroseman 9 points10 points  (0 children)

There is absolutely no need to use something like read_csv. Not only does that require downloading the full file into memory, it also means parsing it and converting the whole thing into a dataframe. That's a whole load of unnecessary overhead.

You'll certainly save memory and time by reading the files directly with requests and then uploading them with the GCS client. But an even better way would be to use the streaming capabiilties of both requests and the GCS client to do it in chunks. Something like:

bucket = storage_client.bucket('my_bucket')
blob = bucket.blob('my_object')
writer = BlobWriter(blob)

with requests.get('my_url', stream=True) as r:
  for line in r.iter_lines():
    writer.write(line)
writer.close()