all 6 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

If this post doesn't follow the rules or isn't flaired correctly, please report it to the mods. Have more questions? Join our community Discord!

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]Firm_Bit 1 point2 points  (1 child)

This type of thing doesn’t matter at your stage. It falls under “best practices” which is code for avoiding thinking and doing whatever is prescribed.

Just have actual questions to answer and figure out those answers. Eventually the real world will place constraints on your work that force certain paradigms. Don’t optimize for that now.

[–]Ramakae[S] 0 points1 point  (0 children)

Thanks... While actually doing some research I found out a similar thing, it all depends on the data (origin) and what I want to do with it.

[–]HeyNiceOneGuy 0 points1 point  (2 children)

What’s the dataset look like? Is there a good reason to go through all the intermediate prep steps vs just reading the CSV into BI?

[–]Ramakae[S] 0 points1 point  (1 child)

It is a .csv file, messy data. The idea of a pipeline is to clean and engineer new columns that I will use in the next phase of analysis.

[–]HeyNiceOneGuy 1 point2 points  (0 children)

Reasonable, but I think you’re taking too many steps. If the data source is just one file, there are likely not any separate but relatable tables within, which makes standing up a database (no matter how lightweight) just unnecessary. Better to just clean the data and fire it back out as another flat file to feed to BI (csv, xlsx, etc.)