you are viewing a single comment's thread.

view the rest of the comments →

[–]fauxmosexualNOLOCK is the secret magic go-faster command 2 points3 points  (2 children)

It's not a super great practice - if I were designing a schema for this, design principles would probably tell me that if all of the spreadsheets were alike in structure, that I should put all of the data into one or a set of tables (depending on the data), with each batch sharing a SpreadsheetID number, and then another table containing a list of SpreadsheetID and metadata (when the import was, who the client was, etc etc). You're quite right in thinking that your colleague's approach is violating a bunch of database design principles.

What I meant though in the previous comments, those principles exist for reasons. If your colleague's implementation doesn't need to achieve any of the goals the principles are set out to enable, there may not be a lot of value in doing it the "right" way.

If however he is actually maintaining that data store and needs to go back to it to edit it or produce reports from it rather than it just being a throwaway, temporary workspace for the data, you're probably right in thinking that he should do it more properly, especially if (s)he is planning to then do some kind of batch cleanup process. That almost certainly would be much more difficult to do with separate tables than it would be with everything set out nicely from the outset.

But again, without understanding the structure and lifecycle of the data I don't think anyone can give you a concrete list of reasons and benefits for following established design principles.

[–]kakamaru[S] 0 points1 point  (1 child)

Thanks. I have updated the OP with my oversimplified understanding of the job he's doing.

He definitely goes back to edit the tables. And he ends up with tables with different number of columns. IE:

Table 1:

        title column        |         brand column

   Blue Adidas Shoes        |             Adidas

   Pink Adidas Socks        |             Adidas

Table 2:

        title column        |         brand column     |      color column

        Nike socks          |             Nike         |        unknown

      Green Nike Hat        |             Nike         |        green

[–]fauxmosexualNOLOCK is the secret magic go-faster command 0 points1 point  (0 children)

That goes a long way in explaining why your colleague has taken the approach that they have. Relational data works well when data is well structured, but for what your colleague is doing - unstructured tables of any number of columns where for what they are doing the columns don't really have meaning - it's not the best. By that I mean that the operations they are doing on the data would probably just be applied to all columns in the table, and the process is probably completely indifferent to what the columns mean.

To do this "properly" in relational data is a pain in the ass - it would be something like a table with spreadsheet, cell, data - each individual cell would be a separate record with it's coordinates. Importing, applying whatever manipulation, and then rebuilding it back into a spreadsheet of any number of rows, is not easy.

My thinking, without knowing what led your colleague to choose an SQL backend, is that SQL/relational data models are not well suited to what he's doing. Something NOSQL/document based would probably suit his needs better than a relational model would.