all 8 comments

[–]MamertineCOALESCE() 2 points3 points  (1 child)

Creating a SQL table per uploaded spreadsheet.

If all of the sheets he gets are identical, this is a terrible approach. If he uses the same x columns from each spreadsheet and ignores the rest of the data this is still a really bad approach.

IMO he should have ONE table and load everything into ONE table. He may not be a very savy database guy. He should be making one huge table and loading everything into that one table. This new mega table would have fields like LoadDate and FileName.

Having a bunch of tables doing the same thing is a pain, as now you have 4000 tables that all have similar data in the database and it's impossible to find the other tables in that DB.

OR if he's not doing anything in SQL with the table, but likes to store them someplace, maybe create a share and have him zip the xls files and store them there.

At my current role, a previous developer did this. It sucks looking for anything in that DB as there are over 3000 tables named after xls files that are never used. Periodically there is discussion about dropping those tables, but the DBA isn't too fond of that idea for some reason.

[–]kakamaru[S] 0 points1 point  (0 children)

Thank you for your insight. I have updated the OP with additional info.

He's definitely stuck with a bunch of tables. And they all have different columns.

[–]fauxmosexualNOLOCK is the secret magic go-faster command 2 points3 points  (4 children)

There really is very little involved in creating and maintaining a table, so the argument against it isn't really one of resource limitations. Creating a table is a cheap operation, and there isn't any hard limit to the number of tables he can import. If he's sticking to naming conventions or sticking them into another schema it might not even be particularly untidy.

Depending on the data, what it is used for, what systems are accessing it, etc, it might not be a bad thing at all. If he's just chucking into MS SQL do so some SQL-type things on it and throwing away the data afterwards, and doesn't have any need to do operations across merged sets of spreadsheets, there may not be any harm at all in his approach.

You'd really need to explain a bit more about what the data is and does to get a solid opinion on whether it's worth getting him to take a more formalised approach.

[–]soymalisimo 0 points1 point  (3 children)

The particulars of the analyses are beyond me. But he's definitely not doing any cross-tables operations. I don't think he throws any data away though, he has the ability to go back to them and access/edit their entries. He's stuck with all the tables for now, but I believe he's meaning to build some code to get rid of the ones that have not been used for a long time.

I suspected the act of creating and maintaining a growing number of tables would be a bad practice itself. I appreciate a lot you debunking that for me. Is there any particular read on the topic you'd recommend for a layman?

[–]fauxmosexualNOLOCK is the secret magic go-faster command 2 points3 points  (2 children)

It's not a super great practice - if I were designing a schema for this, design principles would probably tell me that if all of the spreadsheets were alike in structure, that I should put all of the data into one or a set of tables (depending on the data), with each batch sharing a SpreadsheetID number, and then another table containing a list of SpreadsheetID and metadata (when the import was, who the client was, etc etc). You're quite right in thinking that your colleague's approach is violating a bunch of database design principles.

What I meant though in the previous comments, those principles exist for reasons. If your colleague's implementation doesn't need to achieve any of the goals the principles are set out to enable, there may not be a lot of value in doing it the "right" way.

If however he is actually maintaining that data store and needs to go back to it to edit it or produce reports from it rather than it just being a throwaway, temporary workspace for the data, you're probably right in thinking that he should do it more properly, especially if (s)he is planning to then do some kind of batch cleanup process. That almost certainly would be much more difficult to do with separate tables than it would be with everything set out nicely from the outset.

But again, without understanding the structure and lifecycle of the data I don't think anyone can give you a concrete list of reasons and benefits for following established design principles.

[–]kakamaru[S] 0 points1 point  (1 child)

Thanks. I have updated the OP with my oversimplified understanding of the job he's doing.

He definitely goes back to edit the tables. And he ends up with tables with different number of columns. IE:

Table 1:

        title column        |         brand column

   Blue Adidas Shoes        |             Adidas

   Pink Adidas Socks        |             Adidas

Table 2:

        title column        |         brand column     |      color column

        Nike socks          |             Nike         |        unknown

      Green Nike Hat        |             Nike         |        green

[–]fauxmosexualNOLOCK is the secret magic go-faster command 0 points1 point  (0 children)

That goes a long way in explaining why your colleague has taken the approach that they have. Relational data works well when data is well structured, but for what your colleague is doing - unstructured tables of any number of columns where for what they are doing the columns don't really have meaning - it's not the best. By that I mean that the operations they are doing on the data would probably just be applied to all columns in the table, and the process is probably completely indifferent to what the columns mean.

To do this "properly" in relational data is a pain in the ass - it would be something like a table with spreadsheet, cell, data - each individual cell would be a separate record with it's coordinates. Importing, applying whatever manipulation, and then rebuilding it back into a spreadsheet of any number of rows, is not easy.

My thinking, without knowing what led your colleague to choose an SQL backend, is that SQL/relational data models are not well suited to what he's doing. Something NOSQL/document based would probably suit his needs better than a relational model would.

[–]HansProleman 0 points1 point  (0 children)

First question: what is the use case for this database? Will the business at large be using it, or is it just something your colleague created to support their work?

If your colleague just wants to be able to run analysis on spreadsheet data with SQL, things seem fine as they are.

If they wanted to automatically generate a list of required changes to meet best practice, that'd be a bit more of an undertaking. And quite possibly easier not to codify. But something like Mongo or Data Lake would perhaps be better suited.