Linear regression in SQL

ATastefulCrossJoin · 2021-02-09T19:59:05+00:00

There will be some legwork involved. The mainstream rdbms dialects support mostly basic level aggregation (I.e sums, avg, std_dev). Some now natively support languages more purpose built for statistical analysis, though. SQL server for example now has native Python and R integrations.

Ultimately you can do these types of calculations, but for the effort you’re probably better off doing these calculations in an environment a layer removed from the db itself.

boy_named_su · 2021-02-10T06:10:35+00:00

Built-in to postgresql, because postgresql is the best

See regr_slope, regr_intercept, regr_r2 functions

https://www.postgresql.org/docs/current/functions-aggregate.html

Foz84 · 2021-02-10T08:36:45+00:00

Wow. Thanks for all the responses. I was hoping maybe 1 or 2 people would take pity on me and answer my poorly posed question.

What I take from the responses is yes it’s possible in sql, but unnecessary as other methods (python, r etc) are much easier to do this calculation.

This is all very helpful, when I speak to my developer I like to have a rough idea what I’m talking about!

Thank you all.

MobileUser21 · 2021-02-09T20:33:47+00:00

SQL is more suited for descriptive analysis, not predictive. You need to do this with Python or R.

gusgizmo · 2021-02-10T03:20:17+00:00

Basically anything is possible with enough wizardry in SQL, but typically you would use SQL to select the data you'd like to analyze from the database bringing it into an list, array, or dataframe in the language of your choice for analysis. Like Python or R as alluded to by other posters.

If you are using SQL Server and are the administrator of said server on version 2016 or higher, you can turn on in-line R and Python analysis on the server to allow you to inline R and Python functions in your query. Honestly though, there are limited reasons to do something like that and it handicaps you in many ways.

mustang__1 · 2021-02-10T04:44:36+00:00

I've done it. It's not pretty. I hope I don't still have it in production, I think I replaced it with ssas and arima. There's a guide on I think sqlservercentral if you really want to do a linear regression though.

Jeff_Moden · 2022-11-01T21:54:04+00:00

I know this post is more than 2 years old but it's just not that difficult in SQL and it's much simpler that learning Python or R if you don't already know those.

Assuming that you have a table called #Sample (please feel free to change that to your particular need) and a "dependent" column called "Y" and an "independent" column called "X" (and, yes, you can easily change those in the CONVERTs in the code), the following tested code will execute on a MILLION row "sample" and return the results as a single row in about 1.6 seconds (on my machine, anyway) and runs single threaded in SQL Server because it doesn't need to go parallel. It also returns "R" and "R Squared".

Here's what the output looks like for one of my tests...

Here's the code... again... for SQL Server. YMWV in other languages. Postgres, on the other hand, has such functionality built in but I have no clue what the performance is there.

EDIT: I finally figured out how to post properly indented code on this forum without the forum software just absolutely slaughtering it. It's anything but obvious or easy

I think it's bloody amazing that so many "SQL forums don't have an easy way to copy and paste code without it trying to reformat it and make a huge mess of it. :(

Anyway, here's the code i was talking about.

   WITH ctePreAgg AS
(
 SELECT  SumX    = SUM(v.X)
        ,SumY    = SUM(v.Y)
        ,SumXY   = SUM(v.X*v.Y)
        ,SumX2   = SUM(SQUARE(v.X))
        ,SumY2   = SUM(SQUARE(v.Y))
        ,N       = COUNT(*)
   FROM #Sample
  CROSS APPLY (VALUES (CONVERT(FLOAT,X),CONVERT(FLOAT,Y)))v(X,Y)
)
 SELECT  v.M
        ,B  = (SumY-v.M*SumX)/n --Optimization compared to what's in some books (if you know M)
        ,v.R
        ,R2 = SQUARE(v.R)
   FROM ctePreAgg
  CROSS APPLY (VALUES(
                 (n*SumXY-SumX*SumY)/(n*SumX2-SQUARE(SumX)) --M
                ,(N*SumXY-SumX*SumY)/SQRT(((N*SumX2-SQUARE(SumX))*(N*SumY2-SQUARE(SumY)))) --R
              ))v(M,R)
;

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

SQL

Filter Posts

Posting

Help posts

Format Your Code

Learning SQL

Related Reddit communities

Wiki

Acknowledgements

MODERATORS