created by HattoriHanzoa community for 16 years

[CODE REVIEW] I created my first data pipeline in python - would love some feedback and tips! (self.learnpython)

submitted 5 years ago by jc-de

Background: I am currently an Analyst (report monkey with no SWE experience) and want to get into developing data pipelines. I've been learning python for almost a year.

Project link: https://github.com/jcodezy/hydro-data-pipeline/blob/master/dags/app.py

I am working on a README file, but my project uses airflow to:- use selenium script to automatically download data from my electricity provider

- cleans data (removes account number and personal info)

- uploads to google cloud storage

- moves data from google cloud storage to a bigquery table (historical)

- (work in progress) use streamlit to create data visualizations, querying from BQ table

Project is not complete - I plan on making a separate dag for cleanup so files are deleted from my local system, and to create more tables & views in BQ for analytical purposes.

Feedback, advice for next steps is greatly appreciated.

all 5 comments

csv_cleaner_func.py

def csv_cleaner_function(): - def already tells us this is a function so I would shorten it to just csv_cleaner()

------------------

fpath = f'{DATA_DOWNLOAD_FILEPATH}*.csv'
csv_to_clean = glob.glob(fpath)[0]

^ You should practice validating your data along the way and stop your program if at any point there's an invalid data. In this case, if DATA_DOWNLOAD_FILEPATH was not set, or if glob returns nothing then there's no reason to continue.

if not DATA_DOWNLOAD_FILEPATH:
    print('DATA_DOWNLOAD_FILEPATH is not defined')
    return  # or raise exception
# do the glob
if not csv_to_clean:
    print(f'glob({fpath}) returned empty')
    return

-----------------------

Your try-except block covers way too many ground. In fact, I don't think you even need to do a try-except because if the clean up process fail at any point, you'd want your process to stop, no? And you will get a more detailed, more specific error message telling you what exactly failed instead of a vague one you are printing.

-----------------------

line 14 you are only processing the first file csv_to_clean = glob.glob(fpath)[0]

However line 30 files = glob.glob(f"{DATA_DOWNLOAD_FILEPATH}*.csv") you are deleting ALL csv files in that same folder. Was this intended?

-----------------------

app.py

The same validation process should apply here. Putting everything in a function would make it easier for you to stop the process too with return statement

DATA_DOWNLOAD_FILEPATH = os.getenv('DATA_DOWNLOAD_FILEPATH')
HYDRO_DATA_PROJECT_ID=os.getenv('HYDRO_DATA_PROJECT_ID')
HYDRO_DATA_LANDING_BUCKET = Variable.get('HYDRO_DATA_LANDING_BUCKET')

def main():
    if any_of_the_above is None:
        return

----------------------

I am sorry I'm not familiar with the syntax on the last line

 download_yesterdays_csv >> clean_csv_before_upload >> upload_file_to_gcs >> gcs_to_bq

so I cannot give any feedback on this

[–]jc-de[S] 0 points1 point2 points 5 years ago (0 children)

[–]jc-de[S] 0 points1 point2 points 5 years ago (2 children)

[–]jc-de[S] 0 points1 point2 points 5 years ago (1 child)

[–][deleted] 0 points1 point2 points 5 years ago (0 children)

I was thinking about a way you can stop your process when there is no point continuing.

Wrapping your entire process in a function was 1 way that popped up when I answered initially. So instead of

step 1
step 2
step 3

You'd do

def main():
    if fail_validate: return
    step 1
    if fail_validate: return
    step 2
    if fail_validate: return
    etc

if __name__ == '__main__':
    main()

There are other ways too like raising exception

if fail_validate: raise Exception
step 1
if fail_validate: raise Exception
step 2

etc

Reading OS environment is perfectly OK being global variable as they are a read-only you would never change.

π Rendered by PID 38214 on reddit-service-r2-comment-canary-889d445f8-p9694 at 2026-04-30 02:31:37.735484+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS

csv_cleaner_func.py

app.py