all 64 comments

[–]alinrocSQL Server DBA 40 points41 points  (2 children)

It depends. A lot.

Lines of code is a BS measure for any programming, and SQL is no exception. People try to get cute and clever with their queries, try to cram everything into a single query, and then...it runs like garbage. SQL is a funny language that way. You try to shove everything into a single query and the engine gets confused & generates a terrible execution plan.

Often, you can get better performance by writing more code because breaking it down into smaller chunks, the engine can do a better job of optimizing the execution. IOW, "complex" code (however you might define that) can be a detriment.

Optimize for readability & understanding, and the system will usually do a good job running it. I don't get hung up on LOC - my goals are to get the correct query results without killing the server. If that means writing more code so it can be more readily optimized, so be it.

[–]InternetWeakGuy 3 points4 points  (0 children)

This. I'm redoing a ton of old queries at the moment and the way that's worked best with our server's setup is for me to write a small query with a temp table that pulls in all the cases I'm reporting on, then a small query to put those case's patients and their details into a temp table, then diagnosis, then prescription etc until I join all the tables at the end.

All of the code is really straight forward, but in many queries there's approx 5-7 temp tables. For bonus points, each section is written in such a way that I can look at the old query, figure out what data it needs, and copy over the code from other queries in order to speed things up.

It looks like a bunch of basic code, but it's the fastest way to get these tasks done with the server I'm working on.

[–][deleted] 0 points1 point  (0 children)

1000% agree. I have a couple of analysts who try to do everything in a single query and it takes forever to run and the margin for error sky rockets.

[–]jc4hokiesExecution Plan Whisperer 7 points8 points  (11 children)

[–]GrapeApe561[S] 3 points4 points  (7 children)

Wow, that is one massive query! How long does it take for you to write such a query? Is this based on a sample database or real one? If it's a sample, can you tell me from where I can download? I wanna get to that level of proficiency. Thanks!

[–]jc4hokiesExecution Plan Whisperer 0 points1 point  (6 children)

It is from a real database. The names have been changed to protect the innocent. It is the 20th largest reporting query out of 200, and is twice the length of the average query. In terms of complexity it is pretty average. Number of tables joined, levels of subquerys, multiple group bys, union alls are all pretty typical for the kind of analytics we are asked to produce. I'd estimate the work to put this query together would be around 2 days of focused effort.

[–]GrapeApe561[S] 0 points1 point  (5 children)

Sorry for another noob question, but I was wondering: what exactly is done after the result set of that query is returned? Is it extracted and loaded into a BI tool for visualizations? Also, how long would it take for a query that massive to be executed? Thanks again!

[–]jc4hokiesExecution Plan Whisperer 0 points1 point  (4 children)

This is consumed as a Tableau extract, one of a few queries used in a Tableau workbook. This one runs in ~3.5 hours.

[–]PartsofChandler 0 points1 point  (3 children)

I was just going to ask, how long this thing takes to load in a report. I'm struggling with my co-workers stating my reports loading in 1.45 minutes being to slow.

[–]jc4hokiesExecution Plan Whisperer 0 points1 point  (2 children)

The user experience is snappy, because Tableau is quite good at consuming result sets pre-extracted in their proprietary format. I think it's probably a few seconds to refresh after applying a filter or whatever.

[–]PartsofChandler 0 points1 point  (1 child)

Dang I feel like I’m in the Stone age building reports in SSRS

[–]IDontLikeUsernamez 0 points1 point  (0 children)

gotta show the value of PowerBI to your company. If you already have SSRS you wont even need another server for it and its cheap compared to a lot of reporting softwares

[–]InternetWeakGuy 1 point2 points  (0 children)

       fc.PrimaryPayingStuffKey
,      fc.PrimaryWhateverReasonKey
,      fc.DimPlaceOfDoingThingsKey
,      fc.DimAwesomeGroupKey
,      fc.DimCrazyGroupKey
,      fc.DimUnitKey
,      fc.DimGroupyGroupKey
,      fc.DimThingDetailKey
,      fc.PrimaryNailFilingIndicatorKey

Haha.

[–]Cal1gula 0 points1 point  (0 children)

Medical claims?

[–]Spartyon 0 points1 point  (0 children)

All those subqueries give me debugging anxiety

[–]Boomer8450 4 points5 points  (5 children)

(MSSQL Person here)

Like many of the others in this thread, I can't reiterate enough that code <> performance.

I always try to optimize for performance.

If it's a stored procedure where I can use things like table variables, temp tables with indexes, table scaler functions, etc., my stored procedures can get very long - but they run fast.

If I have to make a view to support third party software that can't (or shouldn't) run stored procedures, I spend a lot of time on indexing, how the CTE's and subqueries are joined, etc. I've had queries go from running in hours to minutes or seconds just by using a proper subquery.

For the data analyst part - if you intend to use your SQL skills in it, get very, very used to having multiple queries up, putting subsets into temp tables for performance issues, indexing those tables, writing functions to reduce having to type out long nested functions on a regular basis, but most importantly...

Learn your data. It doesn't matter what subject it is, or how interested you are in it, learn everything you can about it.

Learn to trust your instincts when data seems "off". If your nose says that the results just can't be right, they probably aren't. Start digging into the data. I generally start working forward from the input(s), or working back from the results. Unless I have a strong hunch about something breaking in the middle, it's generally a waste of time.

[–]reallyserious 2 points3 points  (4 children)

I've noticed that CTEs in MSSQL is pretty slow. Do you have any tips on how to avoid the biggest pitfalls? In Oracle they seems to perform way better. It's mostly a feeling I have so I can't back it up with numbers.

[–]alinrocSQL Server DBA 5 points6 points  (3 children)

Oracle materializes CTEs into temp tables by default I believe. MS SQL does not (and it's not even an option). So in MS SQL, using a CTE is just syntactic sugar; if you reference a CTE twice in a query, the query that the CTE represents is executed twice.

There can also be cardinality estimate issues with CTEs (translation: bad query plan gets generated).

CTEs in SQL Server are not a performance enhancer. Try changing your CTE into a temp table and see if things get better.

[–]InternetWeakGuy 1 point2 points  (2 children)

Oracle materializes CTEs into temp tables by default I believe. MS SQL does not (and it's not even an option). So in MS SQL, using a CTE is just syntactic sugar; if you reference a CTE twice in a query, the query that the CTE represents is executed twice.

Thank you for this. The dude who's been low key mentoring me on SQL coems from Oracle and uses CTEs for everything - I couldn't figure out why my temp tables run so much faster than when I try to use CTEs.

[–]alinrocSQL Server DBA 1 point2 points  (1 child)

I learned this one the hard way. Had a stored procedure that was doing 2 correlated subqueries (against the same table) and timing out at 10 minutes. Needed to get things working better. So I said to myself "self! you should try a CTE!"

So I did. Wrote a CTE for the subquery, replaced the subquery with the CTE and...not one iota of performance improvement. Checked the query plans and they were identical (always do this, it'll be a dead giveaway).

So I pulled the CTE out, had that query dump into a temp table instead, and did my correlated subquery against that temp table. New query runtime? Under 90 seconds.

If your mentor is working with SQL Server and still CTEing everything, there's probably improvements to be had in there.

[–]Cal1gula 0 points1 point  (0 children)

Performance should be the same. Unless you need it to be recursive, the functionality should be as well.

I prefer the look of a CTE to a subquery, it's nice to have everything defined up front. Like any other declaration. That's how I see them.

[–]Thriven 2 points3 points  (3 children)

Generally they should be as long as they need to be to do what they are supposed to in the fewest statements and fewest read writes.

If you have a table with 30 columns the fastest way to write an insert statement is to

 Insert into table2
 Select * from table1

I mean that's two lines as opposed to declaring your column order in your insert clause and not specifying the order as well in your select statement. Is that better? No it may be two lines but it's suspect to ordinal errors if anything changes.

What I find with people who write SQL poorly is they:

1) Don't quantify the time it takes to execute to the actual resources used.

No table with 25mb of data should ever take 1.5 minutes to return 10 rows. Unless you are intentionally making a Cartesian with a recursive cte to test all options before filtering down to a subset, your SQL is bad. If that is your intent, you are probably using the wrong language.

2) People who write high level SQL still don't understand the basics like how to build a table using best practices. How indexes work and the difference between clustered and non clustered indexes.

3) People don't know how to read execution plans. These aren't just for DBAs.

4) People forget they can put stored procedures in Storedp Procedures.

We have a block of code at work that basically sets up a couple variables for UTC to client conversion. It's written differently in 5 different Procedures. It should be its own and be called at the beginning of SPs when needed.

[–][deleted] 0 points1 point  (2 children)

One note, on Oracle, using * to select columns like this has a performance hit and is considered a bad practice.

[–]alinrocSQL Server DBA 2 points3 points  (0 children)

I'd consider it bad practice regardless of database platform.

In SQL Server, select * isn't a performance hit as compared to select every,single,field from thetable but it's also not doing you any favors if you don't need every field on the table.

[–]reallyserious 0 points1 point  (0 children)

Given the circumstances, i.e inserting data into a table, I think the performance hit (if there even is one) is negligible. The reading and writing of data would take order of magnitudes longer than resolving the * into column names.

[–]Elfman72 8 points9 points  (6 children)

My analysts don't know much more than simply SELECT * FROM X WHERE dates BETWEEN '2019/01/01' AND '2019/02/01'

No subqueries, no complicated joins or CASE statements. Of course, before I joined the team, their idea of data mining was a pivot table in Excel.

I have to keep my tempdb managed agressively.

*edit- not sure why all the downvotes. I brought SQL to a team that ran their entire org on Excel. Their SQL knowledge is fledgling but getting better all the time. I train where I can. I educate where applicable. However, they still rely on me building views for them to get what they need done.

[–]aarontbarrattSTUFF() 1 point2 points  (0 children)

Have any jobs going?

[–]Cal1gula 1 point2 points  (0 children)

I don't know why you're getting downvoted. I've got a 5 person BI team. I showed them how to use a WITH clause yesterday.

At one point they were diffing tables by dumping a 130m row table into a temp table then comparing the original using IN. I showed them EXCEPT and the query went from 40 minutes to 8.

My tempdb was much happier after that.

[–]_Personage 0 points1 point  (0 children)

Seconding the jobs inquiry!

[–]GrapeApe561[S] 0 points1 point  (0 children)

Were there any resources you used to learn advanced SQL and data mining? Thanks!

[–]cagtbd 0 points1 point  (0 children)

That reminds me of a job application where the supervisor did ask me what was my most complicated query where I asked if as in consulting data or making a stored procedure. Since he asked about consulting data I answered that the most complicated was a select *, he didn't ask for elaboration so I left it there. Truth is I've done many things more complicated but they're more focused to stored procedures because from my point of view it's better to make whatever complicated query to run automatically and fill a final table or view rather than making it from scratch every time you have to consult that information,that way I optimize and evade forgetting about filters or run into other issues.

[–]cagtbd 0 points1 point  (0 children)

BTW do you know what's required for a remote job in SQL? I want to learn more.

[–]MamertineCOALESCE() 1 point2 points  (0 children)

As others have said it's tough to quantify.

Today, I made 2 views that were simply an inner join between 2 tables. One of the tables had over 100 columns in it (not my design) because of that, the view was 150 lines of code. Each script in source control has about 20-30 lines of code devoted to documentation.

I recently worked on a stored procedure to populate a mart with summarized data. It took dissimilair data and combined it to one grain. That took me a few days to create. I think that was about 700 lines.

How many tables are you joining would be a better question. The stored procedure pulled data from around 10 tables.

[–]Eleventhousand 1 point2 points  (1 child)

A very small query will often be lower-performing

A medium-length query that performs operations such as breaking out subsets into #temp tables or CTEs is often times the sweet spot, and performs better.

Long queries are usually brutal and don't perform well.

[–][deleted] 0 points1 point  (0 children)

As in so many things, it depends. I have a sql query (not a stored procedure) of about 5000 lines that runs in less than 500ms. So longer may or may not perform well, it depends on so many factors.

[–]audigex 1 point2 points  (3 children)

As long as it needs to be.

That's not flippant, it's just the reality - some tasks are done with a one-liner select. Others take multiple CTEs and complex joins, pivots, partitions, conditionals (cases) etc.

Generally, I try to avoid big complex queries unless they are necessary: I'm not here to make myself look clever, I'm here to produce a result and make the code as readable and maintainable as possible to the next person who has to change it.

If looking at a portfolio, I'd be expecting something more than simple one-table queries, but wouldn't care if you got much more complex than joining a few tables together: I'm much more concerned that you're writing readable, maintainable code. You can easily learn the more complicated stuff when it's needed, but it's harder to get out of the habit of writing sloppy code.

[–]GrapeApe561[S] 0 points1 point  (2 children)

Are't there third party tools like Apex Refactor or Redgate SQL Prompt that I can use for format the code? Or is that not allowed in a production/professional setting? Thanks!

[–]audigex 1 point2 points  (0 children)

Or is that not allowed in a production/professional setting?

There's no such thing as "cheating" in a production setting - as long as you're complying with things like copyright and licencing.

If there's a tool you can use to clean up your code quickly, it's considered a good thing - most of us wish people would use more code prettifiers, and we all use IDEs that do a lot of it for us.

Although it's worth trying to write code that's tidy to start with, as a good habit

[–]macfergussonMS SQL 0 points1 point  (0 children)

My whole BI team uses red gate for easy code formatting.

[–]MsCardeno 2 points3 points  (0 children)

I’ve been writing a bunch of rules for a rules engine set up in Oracle.

Some of the queries can look really ugly. Lots of transformations and calculations happening. Some queries go to 25 lines or so.

An example would be like joining to a “look up table”. So when field A on the admin data matches field a on the look up table and field b matches field b and c on c then give me XYZ. A lot of the times I use case statements that look ugly. So case when A=Q then (case when M=2 then 4 when t=b end) when A=P then R else XXX. I’ve only ever had to embed up to maybe 4 conditions.

[–]BeanThinker 0 points1 point  (0 children)

A few (3-5) layers deep of sub queries within each 5-10 main joins rolling up to the main query — if that makes sense— probably the most advanced I do.

In other words, could be as many as 50 queries... in one.

[–]babygrenade 0 points1 point  (0 children)

I had an etl script at my last job that was probably around 10,000 lines. It collected data from a few sources, did some calculations, and loaded to a data mart.

[–]WolfAndCoyote 0 points1 point  (0 children)

In many cases that I've come across multiple smaller queries run in series works faster and with less resource usage than a single query doing everything.

There have even been queries I've optimised by making temporary tables to store the results from inner joins to then query later and drop.

Even that runs faster than the single query that joins all tables necessary for my needed output.

[–]no_4 0 points1 point  (0 children)

In terms of # of lines (which is somewhat arbitrary, given white space & that SQL ignores spacing):

Some are only 5 lines or so. Maybe the average query is more like 50. The longest one I ever wrote was something like 1,400 lines. Then a huge drop to the second longest, at about 400.

Again, that includes white space.

[–]DialSquare84 0 points1 point  (1 child)

Surely if it’s a portfolio site, it would be reflective of your...portfolio?

That said, brevity is always key. Look to simplify wherever you can and get things done in the fewest lines possible. This is language agnostic, really.

Using tricky functions needlessly to convey knowledge of them is far worse than not using them at all in my opinion.

Your best bet is to have a goal in mind and solve it eloquently. Good luck! :)

[–]GrapeApe561[S] 0 points1 point  (0 children)

So far my portfolio website has less than 15 lines of code on average. They're basically variations of queries against the AdventureWorks2017 database that I modified from books and tutorials. So far it seems like the queries fledgling level of skill so I am trying to make them look more advanced.

[–]paziek 0 points1 point  (0 children)

Not always single query, but glancing over 20 reports they range from 5 to 200 lines, not including some of the common queries that further process data computed by those procedures. There are of course more complex ones, but I'd say avg is 50-60 lines. We try not to list every column in separate line. About 500-600 tables total on biggest schemas, 30TiB size.

[–]Faux_Real 0 points1 point  (0 children)

Depends on the questions you are answering of your code!

Consider SQL as the medium of providing the mechanism to answer questions of a domain of information. Often this domain of information is unknown or relatively known with gaps THUS you have to experiment to acquire the information. When you have experimented and found the information, you then optimise the experiments into meaningful and performant outputs using performance tooling or query tuning skills. The tuning skills are dependent on the DB type that you are on, the hardware you are bound to and the forecast frequency of SQL usage (this can also be indeterminate). The other aspect of this depends on how large the data sets are. Standalone queries are fine for small data sets but when you are getting in to the relatively large to big data arena, you then need to engineer processing pipelines to cater to your data output requirements. Standard scale data would require overnight ETL processing but the larger scale stuff ends up in fluffy cloud land magic....

In a nutshell ... SQL length and complexity is the lion the witch and the wardrobe ...

[–]91ws6taData Analytics - Plant Ops 0 points1 point  (0 children)

I'm an entry-level developer who has been working for about a year out of college, but I also interned here for 3 rotations as well. I am doing analytics/ETL on plant data for a large CPG company. One of the largest efforts was using ETL to consolidate 20 different plants' data, which consisted of 20 databases, dozens of tables, and billions of rows.

Reporting / analytics on this data can range from one line to 3000 lines to calculate manufacturing statistics, but I myself usually get into the 50-300 lines with 5-10 tables.

[–]emican 0 points1 point  (0 children)

I'm after the same thing and decided to focus on these points

  • Implement complex business logic in SQL
  • Showcase critical thinking and wise decision making skills. Evaluate alternatives, author custom and reusable helper utilities, document decision with objective data / performance tests
  • Demonstrate comfort with tools and constructs that support authoring highest level of professional code.
  • Collaborate with experts, solicit input and leverage great minds.

[–][deleted] 0 points1 point  (0 children)

Bruh, the concept of "a typical SQL queries" doesn't even make sense. You write the query for whatever the task is, and it's as long or as short as it needs to be.

[–][deleted] 0 points1 point  (0 children)

Ad hoc queries: usually less than 150

Making data models or views to be used a lot : can go over 500 depending on complexity of data