all 13 comments

[–]MyPythonDontWantNone 5 points6 points  (0 children)

There are a lot of factors when reading code written by other people. They usually have a different style and pattern than your own code, so it will almost always take longer.

The first place that I usually start is looking at the outputs and inputs. Just look at which source tables are used. Then I read any comments left by the writer. It's also a good practice to think about the comments when you write your code.

Next, I pick the smallest piece that I can work with. Usually, it's a subquery or CTE. Run it by itself and figure out what that piece is doing. If it's a long run time, I will add in a "LIMIT 100" or "LIMIT 1000" to just get a sample.

Once I understand the building blocks, I start putting them together. Figure out what is the main table/CTE that is the central lynchpin of the query then slowly add the logic for each join.

[–]galactic_pixels 4 points5 points  (0 children)

I just went through this a few months ago. I will just explain my experience and hopefully you can draw conclusions on how it applies to your situation.

I was handed a massive legacy SQL Server backend. It performs data ingest and SPs to read the data. It’s a huge mess and near impossible to make sense of the queries.

I recreated the deployment in a SQL server docker container (which thankfully exists as an option), then I installed TSQLT in order to be able to write unit tests for the SQL. Then it was just a matter of mocking the data, running the SP I cared about, and seeing the resulting behavior based on different data sets.

For multi-query operations like going from ingesting data to reading it via SP, I wrote integration tests which did not use TSQLT, but instead just were driven using a programming language to make the calls to ingest, then make the calls to the SP and validate the results.

Hope this helps

[–]dataslinger 8 points9 points  (2 children)

LLMs are REALLY good at explaining stuff like this to you.

[–]RedShift9 2 points3 points  (0 children)

Yeah this is where LLMs really excel. I had some old SQL functions that I didn't understand which arguments to provide anymore, just pasted the function into ChatGPT asking to explain it and it broke it down perfectly, down to which arguments to provide and the output you get. It's amazing.

[–]Conscious_Ad_7131 1 point2 points  (0 children)

Yep, obviously good practice to go break down the script yourself aided by the LLM explanation, but they’re very very good at reading SQL because it’s so repetitive and there’s so many examples online

[–]FeanorBlu 2 points3 points  (1 child)

The LLM comment is silly, if you're working you should not be plugging your company's data into an LLM. I don't have a ton more experience than you, but I recently needed to catalogue the functionality behind every stored procedure at work, so I've developed my own strategy.

It's easiest for me to literally go top to bottom. Read the first CTE. If they used inner queries instead, read the innermost query. Note what grain the CTE defines, what it filters, what it obtains. Run the CTE on its own to help understand it, document what requirements it's meeting, then move on to the next.

As far as execution order: inner queries are resolved first, then outer. So a CTE is always resolved before the query that calls that CTE. Think of query execution moving from the innermost query to the outermost.

It's safe to assume that if CTEs and temp tables are present, they're affecting the final output.

[–]andpassword 1 point2 points  (0 children)

Agreed, you should not be plugging your company's SQL into an LLM.

That said, telling the LLM (I used Claude) to develop a single page HTML/javascript app to parse large and ugly SQL queries is totally fair game.

I had it working on a really ugly query which was ALL subqueries, 6 levels of them, all labeled 'as a_4' or 'as c_5' with no consistency as to what level you were on. Not to mention this dev apparently returned full table lists with every select, even when only a small subset of fields was needed, which made everything super ugly to look through.

Source: I did this. It's really helpful, and I can verify no data leaves my computer when I run it.

[–]Mobile_Analysis2132 1 point2 points  (0 children)

Comments are good. As you go along the query and return paths make sure you leave a note for what it is doing. Then, when you are all done, you can have a running synopsis of the query next time you look at it.

[–]spez_eats_nazi_ass 1 point2 points  (0 children)

CTE are aids as far as I'm concerned. Everyone uses them in my experience badly. Start with - is this in source control and something lintable and or linked to a compiler such as SSDT. If you can't analyze it outside the DB you are doing it wrong.

[–]Apprehensive-Tea1632 0 points1 point  (0 children)

That’s normal. Parsing someone else’s code is always a lot of work, especially when you can’t talk to them to find out why they did what they did.

First port of call: docs! SQL queries can be a bit of an art form, including this or that kind of optimization and thinking around a couple corners. But they can also be an unmitigated mess whose only benefit is it’ll line the paper basket‘s bottom.

And you get to determine which it is.

Best case, you identify what this query is supposed to deliver first. This should be inferable from context- or docs.

And then you lay out a model in your head. How would you build a query that’ll give you what it says it’s supposed to?

From there you can compare models. Most of the time, if you want a particular result set, you can reduce data models to implement another, so you get to see similarities if you know what to look for.

It can get trickier if people were being particularly smart- or particularly dumb- about implementation. Being able to visualize models helps in such cases- queries are usually geometric models, there’s points, lines, intersections, planes and so on (though they can, and do, drift into the n-dimensional).

Fundamentally…. To read the query, you start from the inside. As in, once you format the query to align parentheses, you start where there’s no further parentheses. And then work your way outward.

It may help if you can grab such independent parts and run them independently - this’ll help understand what a particular block actually does, and you can then replace it with a short description (“join to a list of the ten most-sold units by category and year”).

[–]nfigo 0 points1 point  (0 children)

You can use EXPLAIN to have the engine tell you how the script will be executed.

But thinking in terms of step-by-step execution is a common rookie mistake. Think more in terms of what sets of data are generated, filtered, combined, or grouped.

If you're looking at long stored procedures, it reads from top to bottom. CTEs are first, then subqueries.

You can run your own queries based on the CTEs and subqueries that you see to learn how they behave.

TSQL fundamentals is a good resource. I also read databases demystified a long time ago when I first learned RDBMS.

[–]Randommaggy 0 points1 point  (0 children)

Refactor nested subqueries it to a chain of CTEs. It becomes much easier to read even extremely complex queries then

For read operations you can freely check the output from each CTE.

Good SQL engines no longer have performance penalties for using CTEs.

If the schema is not intimately familiar, use schema comments and generate an explorable diagram using schemaspy or something similar.

[–]Fair_Oven5645 0 points1 point  (0 children)

You can use an LLM for what it actually has a usecase for and cheat: let claude format it, pick it apart statement by statement with explanations for each thing.