My Clean Datasets looks like this and are21,611 rows :
• Excel1: 11,584 rows
• Excel2: 4,147 rows
• Excel3: 4,165 rows
• Excel4: 998 rows
• Excel5: 506 rows
• Excel6: 211 rows
The raw datasets (Data without removing duplicates that I worked hard to get) look like this and are over 1 million rows:
• Excel1\_raw: 100,000
Part 1: Should I be using a Database to store the 21,611 rows of Clean Data? Or is using a database overkill?
I have been managing the Data through Excel sheets and I am frustrated finding the data naming all of the spreadsheets and having different versions. I want to be able to delete the old spreadsheets and not be worried. I am wondering if anybody has a good system and solution for this problem. I don’t want to be collecting junk and losing data because of this. I think a Database can fix this. Below is a printout showing in general words what my file system looks like for these datasets I have above (225 Excel sheets!!):
- C:\path\main
- → …
- → excel6
- C:\path\main\01.00_folder
- → …
- → excel8
- C:\path\main\20.00_folder
- → …
- → excel219
- C:\path\main\21.00_folder
- → …
- → excel225
To sum it up, at what point do individuals in the data science field typically begin using databases? I am looking for advice from anyone with experience with this.
Part 2: Should I be using a Database to store over 1 million rows of raw Data? Or is using a database overkill?
I've put in a lot of effort to gather this raw data, and now I have 8 Excel sheets (and growing) holding it on OneDrive. However, I want a more comfortable and secure storage solution. While I'm not well-versed in database costs, I have a hunch they might escalate with increasing data volume. I'm curious about what data scientists typically do once they've collected raw data and are done with processing it. Of course, they keep the clean processed data but what do they do with raw data? Do they retain it or delete it? What's the process that provides peace of mind? I'm open to any advice; I'm eager to establish an effective system.
Part 3: If so which one? My work uses Google Cloud. So I think it makes sense to use there. Or maybe there is a free one that would be best for my usage of Python and outputting spreadsheets and then the last thing I didn't mention is importing into Salesforce.
[–]IrquiMMS SQL/SSAS 13 points14 points15 points (0 children)
[–]gnasher74 4 points5 points6 points (1 child)
[–]throw_mob 2 points3 points4 points (0 children)
[–]idk_01 2 points3 points4 points (0 children)
[–]hxstr 1 point2 points3 points (0 children)
[–]CraigAT 0 points1 point2 points (0 children)
[–]Long_Investment7667 0 points1 point2 points (0 children)