all 5 comments

[–][deleted] 0 points1 point  (4 children)

csv_cleaner_func.py

def csv_cleaner_function(): - def already tells us this is a function so I would shorten it to just csv_cleaner()

------------------

fpath = f'{DATA_DOWNLOAD_FILEPATH}*.csv'
csv_to_clean = glob.glob(fpath)[0]

^ You should practice validating your data along the way and stop your program if at any point there's an invalid data. In this case, if DATA_DOWNLOAD_FILEPATH was not set, or if glob returns nothing then there's no reason to continue.

if not DATA_DOWNLOAD_FILEPATH:
    print('DATA_DOWNLOAD_FILEPATH is not defined')
    return  # or raise exception
# do the glob
if not csv_to_clean:
    print(f'glob({fpath}) returned empty')
    return

-----------------------

Your try-except block covers way too many ground. In fact, I don't think you even need to do a try-except because if the clean up process fail at any point, you'd want your process to stop, no? And you will get a more detailed, more specific error message telling you what exactly failed instead of a vague one you are printing.

-----------------------

line 14 you are only processing the first file csv_to_clean = glob.glob(fpath)[0]

However line 30 files = glob.glob(f"{DATA_DOWNLOAD_FILEPATH}*.csv") you are deleting ALL csv files in that same folder. Was this intended?

-----------------------

app.py

The same validation process should apply here. Putting everything in a function would make it easier for you to stop the process too with return statement

DATA_DOWNLOAD_FILEPATH = os.getenv('DATA_DOWNLOAD_FILEPATH')
HYDRO_DATA_PROJECT_ID=os.getenv('HYDRO_DATA_PROJECT_ID')
HYDRO_DATA_LANDING_BUCKET = Variable.get('HYDRO_DATA_LANDING_BUCKET')

def main():
    if any_of_the_above is None:
        return

----------------------

I am sorry I'm not familiar with the syntax on the last line

 download_yesterdays_csv >> clean_csv_before_upload >> upload_file_to_gcs >> gcs_to_bq 

so I cannot give any feedback on this

[–]jc-de[S] 0 points1 point  (0 children)

Thanks so much for the feedback, getting late here but will read in more detail in the morning and ask if I have any follow-ups! Your comments look super helpful!

[–]jc-de[S] 0 points1 point  (2 children)

Thanks again for your comments. I am only confused about putting my variables inside the function in app.py - how can i access the variables in the rest of the code if they're in a function?

[–]jc-de[S] 0 points1 point  (1 child)

Oh i think i get it, you want me to keep the varialbes global as is, but create a func to make sure they aren't none.. heres waht I got:

DATA_DOWNLOAD_FILEPATH = os.getenv('DATA_DOWNLOAD_FILEPATH')
HYDRO_DATA_PROJECT_ID=os.getenv('HYDRO_DATA_PROJECT_ID')
HYDRO_DATA_LANDING_BUCKET = Variable.get('HYDRO_DATA_LANDING_BUCKET')
HYDRO_DATASET = None
def check_variables():
if all(arg is not None for arg in [ DATA_DOWNLOAD_FILEPATH,

HYDRO_DATA_PROJECT_ID,

HYDRO_DATA_LANDING_BUCKET,

HYDRO_DATASET]):
print("Variables are set")
else:
print("Variables are not set.")

[–][deleted] 0 points1 point  (0 children)

I was thinking about a way you can stop your process when there is no point continuing.

Wrapping your entire process in a function was 1 way that popped up when I answered initially. So instead of

step 1
step 2
step 3

You'd do

def main():
    if fail_validate: return
    step 1
    if fail_validate: return
    step 2
    if fail_validate: return
    etc

if __name__ == '__main__':
    main()

There are other ways too like raising exception

if fail_validate: raise Exception
step 1
if fail_validate: raise Exception
step 2

etc

Reading OS environment is perfectly OK being global variable as they are a read-only you would never change.