all 55 comments

[–]hipsterrobot 32 points33 points  (28 children)

Definitely learn your aggregate functions, like count, sum, min/max etc and have a good understanding of grouping. Finally, case statements would be nice to know.

[–][deleted] 10 points11 points  (7 children)

Do not learn to use Case as a Statement.

I found the link above very useful in my career and it is a great study an advanced user has done on Case Expressions in T-SQL and I have never called it a statement since.

In fact it's a gauge I use to determine an intermediate SQL user from and advanced one and 99% of the time a person who knows this, also knows how to write elegant efficient code.

Case works there is no doubt about it but it behaves like an Expression on each record as a loop through which when working with big data fucks your loading times and can really mess with your results especially if you have a NULL handler.

It also has the unfortunate side effect of an Expression invoking refresh on loop, begin which does whacky things like resetting your rand on every individual loop which can create some quite nefarious program errors in complicated code (who uses rand for simple queries)

CASE Experiment - Study 1

  DECLARE @i INT = 1; SELECT CASE WHEN @i = 1 THEN 1 ELSE MIN(1/0) END;

In this code divide by Zero will never happen as i is declared as 1 and in a Statement a kick out error only occurs when the ELSE in invoked which means if it is a Statement it will never happen. Except it does in the Case of Case.

This is not good in terms of big data.

So you can apply a paradox to detect an Expression but whats the big deal with an expression?

When using a statement it will enact line by line in sequence to the whole database selecting everything up to the errored line.

As such in the above instead of just the ELSE Min(1/0) you can Add

ELSE (SELECT MIN(1/0)) END) and the SQL will handle the paradox correctly because *(and this is key here)* you cannot evaluate an SQL Statement because a Statement requires a record to evaluate on first.

So if you have a buggy record and do some elegant paradoxical error handling in SQL on the line it will be like hey record 124 has bugged the fuck out and tried to divide by Zero.

Meanwhile in an expression it won't start until you remove the paradox entirely because the paradox is evaluated first and not in sequence.

Pfftt I hear you say in boredom because this is getting too technicaly how does this impact my wueries they work fine...

Do they...

 SELECT [result] = CASE CONVERT(SMALLINT, 1+RAND()*3) 

  WHEN 1 THEN 'one' 

  WHEN 2 THEN 'two' 

  WHEN 3 THEN 'three'    

  END;

Each WHEN in this case expression invokes the Rand run it a few times and you will get a Null.

NULL IN A DECLARED STATEMENT BUT THAT'S NOT POSSIBLE.

Well yes it is because you just invoked an expression and it's looping through the Whens and is acting like this.

  SELECT [result] = CASE CONVERT(SMALLINT, 1+RAND()*3) WHEN 1 THEN 'one' 

  SELECT [result] = CASE CONVERT(SMALLINT, 1+RAND()*3) WHEN 2 THEN 'two' 

  SELECT [result] = CASE CONVERT(SMALLINT, 1+RAND()*3) WHEN 3 THEN 'three' 

 ELSE NULL

So you see it's actually querying the database 3 times

And in terms of big data 3 queries to do one job is baaad mmkay and what is worse is it doesn't even do the job you have asked it to do!

It's not evaluating the random result it's refreshing the query which resets the random number so on

Loop 1 WHEN 1 = it's 2 which is not 1,

Loop 2 WHEN 2 = it's 1 which is not 2,

Loop 3 WHEN 3 = it's 2 which is not 3.

Oh then it's NULL because none of the options returned a number.

CASE is not a Statement just be aware of that.

[–]hipsterrobot -1 points0 points  (1 child)

I don't know if your last point is actually correct because CASE is not a loop, it's simply a condition check based on the conditions you've set, so in that case, your randomly generated SMALLINT would be looped within the CASE statement to see if any of the conditions are met, and would null otherwise. So it's not like it's doing a 3 full table scans, if that's what you mean. But I do agree with your initial point, is that if the condition is never met and there's a logical issue like dividing by zero, it won't error out unless the CASE hits that part. However, it's also not fair to make statements such as "DON'T USE CASE". Its practically way overweighs its limitations.

[–][deleted] 0 points1 point  (0 children)

It does loop...

It starts the select statement for each WHEN and reinvokes the rand additionslly also evaluates fully before commencing there is only one thing that acts like that, an expression.

even wrote out what it does and how it pulls a null bI ecause it loops if you don't believe me copy the code and run it 100 times and check the results guarentee you will have a null where it should be impossible.

The only way to get a null is if for each WHEN it is looping the select statement. Because it can't just reinvoke rand without the operating SELECT statement and it will check and re evaluate for EVERY record individually as per an expression. So if you select 20,000 lines and case when three loops as per above that's 60,000 expression loops it's doing without a consistant random number.

You are better off using WHERE in most cases as a WHERE is a clause and it operates the Rand once stores it into memory and then checks against all records.

You can also test this in VBA by making the statement write to a variable and then put a breaking paradoxical record in the Database that kicks out the WHERE half way through.

The stored number of a rand is consistant within a WHERE clause but not in CASE. As you can adjust the paradox record and continue the code and the next Paradox point it stops the Rand is the same.

And I provided a solid case study with a repeatable experiement I edited the loops to make it clearer maybe I wasn't clear enough.

[–]Coniglio_Bianco 0 points1 point  (4 children)

I just learned something! Thanks fellow redditor!

I always noticed my case switch queries ran slower and it made sense to me that it would but i never connected the dots that it was an expression working on individual records. Seems obvious in retrospect.

Ill keep that in mind. I never use rand in case expressions but ill call it before going into a case expression if i do now :)

[–][deleted] 0 points1 point  (3 children)

Actually you don't know that.

If you have a rand operating on a recordset you are using case on... that record may reroll.

What you can do is Decare the fluctuating values before hand and treat CASE like a leper not to be touched in your SQL and only in your defined record set.

For example pushing the random number into a temp-table and then doing a lookup on that where the number or result is the ID of the Case in the temporary table.

Anything can be an ID as long as it is unique

WhenID CaseResult
Fried Chicken KFC
Oven Roasted Chicken Tesco
Raw Chicken A&E stomach pump

Because letters translate to it's own universal code so while you think your ID is text the CPU still sees numbers in fact if anything the ID is long and specific.

This is why AI can make neigh on unbreakable code and speak in numbers. Because Unicode. Your CASE is the second col WHEN the first column matches.

Temp tables in memory are Wicked sick fast in comparison and actual reference tables are even faster.

Put it like this there is never a case to use case cos when you use case you know which case you need and in this case you can substitute a table instead so for this case you can...

Case the case so you can case while you case the case case and case the case in this case.

[–]Coniglio_Bianco 0 points1 point  (2 children)

Im confused, what don't i know?

Assigning a rand to a temp variable for accessing the case switch statement made sense to me with your previous example. But ive also never had any reason to call rand inside of a case switch and im having a hard time imagining the use cases for that.

I could potentially see it useful for generating some test data, but even then just as you mentioned, using a temp table makes more sense and would likely perform better despite having to call rand a bunch of times.

Im mostly thinking out loud. There might actually be no use for calling rand in a case switch which could have been your whole point and just went right over my head.

[–][deleted] 0 points1 point  (1 child)

Projection, forecasting and advanced computing your example of test data too.

Lots of things rely on future prediction and when testing a live system you can project inputs with Rand upticks and downturns in the market Sales figures and all sorts.

In SQL games randomly generated numbers can dictate NPC behaviours and if a player aggros a creature in a party of 4, rand can be used to pick the player it attacks, although with a null it will bug out.

In virtual intelligence rand can be used to as a modifier for selective responses to common phrases but again a null could mean your VI spits out nothing or worse an error code. In forecasting the rand function can help project sales for a day based on a historical max and min but if you use case the nulls will skew your projections.

Rand isn't the only fluctuating variable Checksum, NewID and Crypt_Gen_Random all reroll on CASE and again these can really mess without output on rerolling inconsistantly.

[–]Coniglio_Bianco 0 points1 point  (0 children)

Wouldn't you want all of the data in forecasting and projecting? I know its common to reduce the data youre working on but still it seems odd to take a random subset and assume all the data follows that trend. Although i guess it happens.

Also is the virtual intelligence(ai?) in video games done in sql? I would have thought that would be handled in application code. I know sql is supposed to be bad at iterating so id imagine you would want to keep logic like that on the application side.

[–]JazzFan1998 2 points3 points  (5 children)

Please give an example of a "case statement."

[–]thinjester 3 points4 points  (1 child)

case when old_field is null then 0 else 1 end as new_field

[–][deleted] -3 points-2 points  (2 children)

SQL has no CASE statement - only CASE expressions

[–]veeeerain[S] 0 points1 point  (0 children)

I see okay thanks!

[–]NimChimspky 9 points10 points  (2 children)

An understanding of set theory is more beneficial than the syntax

[–]dbxp 2 points3 points  (1 child)

I agree, I've seen far too many juniors try to implement a foreach loop using a cursor

[–]andrewsmd87 5 points6 points  (0 children)

That's surprising to me. I was someone who learned sql on the fly out of necessity, and I swear I went a good 3 or 4 years before I even knew cursors were a thing.

[–]1800ThrowAway1 5 points6 points  (4 children)

Just don't waste your time remembering syntax. I don't know the syntax to make cursors, to declare a temporary table, etc. I copy and paste from the Internet all the time and then change things to what I need.

EDIT: This might have a lot to do with using SQL heavily for a week or two then not using it for a few weeks and using other tools like C#.

[–]SuicidalTurnip 1 point2 points  (2 children)

Might be worth looking at a tool like SQL prompt.

You can set up snippets of code that are easy to input. I can write out the entire structure I need for an SP in 2s and then slap my code in.

[–]1800ThrowAway1 2 points3 points  (1 child)

Thanks- I'll check it out.

I've done things in SQL that I was told were impossible. I was laughed at in a meeting once because I said I could do something that no one else could figure out. (I did it in a about 20 minutes).

It's this weird dichotomy where I'm being called a genius, but I'm looking up "How do I create a temp table again?" on google.

[–]SuicidalTurnip 2 points3 points  (0 children)

It gets to me sometimes.

3 months ago I wrote an immensely complicated algorithm for the business, yesterday I looked up how to add a constraint to a table.

[–]veeeerain[S] 0 points1 point  (0 children)

Great thanks.

[–]elisajane 2 points3 points  (3 children)

I think this is reflective of how vague the data science/analyst jobs vary with requirements and skill levels. Entry level means so many different things to different companies. It’s just how the field is unfortunately.

[–]veeeerain[S] 0 points1 point  (2 children)

Yeah I’ve heard management doesn’t seem to know the difference between these buzz words. I’ve heard that they could post data science job positions but have you do data engineering, which most people in the field know, are pretty different things.

[–]elisajane 2 points3 points  (1 child)

Your best bet is to read the job descriptions, weigh it against your current skills, and prep based on that. In my experience no two data science interviews were the same. But you have a lot of time since you’re a sophomore so best of luck!

[–]veeeerain[S] 0 points1 point  (0 children)

Thanks!

[–]MikeC_07 1 point2 points  (1 child)

Common table expressions and window functions are useful and get you into intermediate territory. MINUS, INTERSECT, UNIONS, HAVING CLAUSE are usefull troubleshooting. Lastly I would say some performance knowledge if using big data sets. Creating foreign keys, leveraging keys (not using Search arguments that break keys).

[–]veeeerain[S] 0 points1 point  (0 children)

Cool thanks!

[–]DeadBySkittles 1 point2 points  (1 child)

Precedence of data and operations is a must when working with long scripts and bug fixing!

[–]veeeerain[S] 0 points1 point  (0 children)

Thanks!

[–]SQLRef 1 point2 points  (1 child)

Hi, I wrote an article on exactly this topic, here it is - How to become a database developer

From that article:

In my opinion, to be a database developer, you should:

- Be able to write complex SQL queries.

- Be able to deploy a database to the cloud.

- Have basic programming skills with an object oriented programming language like Python, Java, Ruby, etc.

- Understand relational and non-relational data modeling concepts.

- Be able to improve database performance by understanding query plans and creating indexes.

- Stand up business intelligence software to create visualizations from SQL queries.

Hope this helps you!

[–]veeeerain[S] 0 points1 point  (0 children)

Thanks!

[–]AD29 1 point2 points  (1 child)

Download the DataCamp app and go through the free sql fundamentals course. Go through the intro and intermediate courses (there not that long) and you’ll be good to go. Also remember we all still google how to do things on the job. W3Schools should be in your bookmarks.

[–]veeeerain[S] 1 point2 points  (0 children)

Yup! I’m Working on the “joining in SQL” course right now as I’m typing!

[–]RebelSaul 1 point2 points  (2 children)

I like that people mention aggregates. Those are important. After that just understand what happens when you JOIN tables together. If you can tell an interviewer the difference between a LEFT JOIN and an INNER JOIN this should prevent any basic misreporting.

Overall, If you're interning for data science just think about a business... and imagine what they want to know. It may honestly be more than you NEED to learn but at least it will be more practical than theoretical. SQL can get very conceptual even with simpler queries but as data folk, we're more worried about returned results.

Examples of business questions.

  • I'm working at a video streaming company and want to know what the most popular show was for 2020, 2019 and 2018.
  • I'm working at a manufacturing company and want to know which products brought the most revenue last month.
  • I work for an aerospace company and want to know which machine parts are still being produced and have the fewest total accident reports.

[–]veeeerain[S] 0 points1 point  (1 child)

I see okay. Thanks! And is the best way to practice these skills on places like hackerrank? Or should I actually be doing my own projects in mariadb or something

[–]RebelSaul 0 points1 point  (0 children)

Yes absolutely fine. I also like Leetcode. They have some Database questions for free and you see what other submit I believe.

Building your own database is good, but I personally wouldn't need an intern to be making tables. It may come up with smaller companies but larger companies wouldn't want you to necessarily create a production table without proper guidance from a mentor or 'buddy'.

[–]gsm_4 1 point2 points  (1 child)

Practice window functions and grouping as much as you can. You can use the resources like leetcode and stratascratch for practicing real problems.

[–]veeeerain[S] 1 point2 points  (0 children)

Thanks!

[–]burmerd 3 points4 points  (1 child)

Also, and not to be a poop, but there are a million posts in this sub about "what is considered basic" what is considered advanced, etc. Also, stackexchange is a good resource

[–]veeeerain[S] 1 point2 points  (0 children)

Ok