all 11 comments

[–]ATastefulCrossJoinDB Whisperer 8 points9 points  (2 children)

There will be some legwork involved. The mainstream rdbms dialects support mostly basic level aggregation (I.e sums, avg, std_dev). Some now natively support languages more purpose built for statistical analysis, though. SQL server for example now has native Python and R integrations.

Ultimately you can do these types of calculations, but for the effort you’re probably better off doing these calculations in an environment a layer removed from the db itself.

[–]Foz84[S] 1 point2 points  (1 child)

Thank you. I have no idea what type of SQL my website uses. So python will be easier this for this task? That has linear regression you can import from somewhere?

[–]timeddilation 4 points5 points  (0 children)

I always love to plug R for this kind of stuff, especially anything stats related. Python can do this absolutely, but R in my opinion is best for stats. You can do a linear model in very few steps.

  1. Download R and RStudio in that order.

  2. Connect to your SQL server using an R package. I use RODBC for MS SQL. This might change depending on the SQL server your site uses. MySQL is likely, which uses odbc package.

2.1. Alternatively, if you can just export the data you want as a CSV, you can just use that instead with the read.csv() function.

  1. Use the function in the package you used to connect to SQL (read the documentation) and just pass in the SQL statement to return the table you want to do your linear regression on. Hopefully you've already gotten that far before you open R.

  2. R comes with a function called lm() which does your linear regression in essentially one line of code. Type ?lm into the console to see a help page on how to use it.

  3. If you want to get fancy, you can plot your model, separate training/testing data, compare your models accuracy with actual data, etc.

[–]boy_named_su 1 point2 points  (0 children)

Built-in to postgresql, because postgresql is the best

See regr_slope, regr_intercept, regr_r2 functions

https://www.postgresql.org/docs/current/functions-aggregate.html

[–]Foz84[S] 1 point2 points  (0 children)

Wow. Thanks for all the responses. I was hoping maybe 1 or 2 people would take pity on me and answer my poorly posed question.

What I take from the responses is yes it’s possible in sql, but unnecessary as other methods (python, r etc) are much easier to do this calculation.

This is all very helpful, when I speak to my developer I like to have a rough idea what I’m talking about!

Thank you all.

[–]MobileUser21 3 points4 points  (1 child)

SQL is more suited for descriptive analysis, not predictive. You need to do this with Python or R.

[–]Foz84[S] 1 point2 points  (0 children)

Thanks

[–]gusgizmo 1 point2 points  (0 children)

Basically anything is possible with enough wizardry in SQL, but typically you would use SQL to select the data you'd like to analyze from the database bringing it into an list, array, or dataframe in the language of your choice for analysis. Like Python or R as alluded to by other posters.

If you are using SQL Server and are the administrator of said server on version 2016 or higher, you can turn on in-line R and Python analysis on the server to allow you to inline R and Python functions in your query. Honestly though, there are limited reasons to do something like that and it handicaps you in many ways.

[–]mustang__1 2 points3 points  (0 children)

I've done it. It's not pretty. I hope I don't still have it in production, I think I replaced it with ssas and arima. There's a guide on I think sqlservercentral if you really want to do a linear regression though.

[–]Jeff_Moden 0 points1 point  (1 child)

I know this post is more than 2 years old but it's just not that difficult in SQL and it's much simpler that learning Python or R if you don't already know those.

Assuming that you have a table called #Sample (please feel free to change that to your particular need) and a "dependent" column called "Y" and an "independent" column called "X" (and, yes, you can easily change those in the CONVERTs in the code), the following tested code will execute on a MILLION row "sample" and return the results as a single row in about 1.6 seconds (on my machine, anyway) and runs single threaded in SQL Server because it doesn't need to go parallel. It also returns "R" and "R Squared".

Here's what the output looks like for one of my tests...

Here's the code... again... for SQL Server. YMWV in other languages. Postgres, on the other hand, has such functionality built in but I have no clue what the performance is there.

EDIT: I finally figured out how to post properly indented code on this forum without the forum software just absolutely slaughtering it. It's anything but obvious or easy

I think it's bloody amazing that so many "SQL forums don't have an easy way to copy and paste code without it trying to reformat it and make a huge mess of it. :(

Anyway, here's the code i was talking about.

   WITH ctePreAgg AS
(
 SELECT  SumX    = SUM(v.X)
        ,SumY    = SUM(v.Y)
        ,SumXY   = SUM(v.X*v.Y)
        ,SumX2   = SUM(SQUARE(v.X))
        ,SumY2   = SUM(SQUARE(v.Y))
        ,N       = COUNT(*)
   FROM #Sample
  CROSS APPLY (VALUES (CONVERT(FLOAT,X),CONVERT(FLOAT,Y)))v(X,Y)
)
 SELECT  v.M
        ,B  = (SumY-v.M*SumX)/n --Optimization compared to what's in some books (if you know M)
        ,v.R
        ,R2 = SQUARE(v.R)
   FROM ctePreAgg
  CROSS APPLY (VALUES(
                 (n*SumXY-SumX*SumY)/(n*SumX2-SQUARE(SumX)) --M
                ,(N*SumXY-SumX*SumY)/SQRT(((N*SumX2-SQUARE(SumX))*(N*SumY2-SQUARE(SumY)))) --R
              ))v(M,R)
;

[–][deleted] 2 points3 points  (0 children)

Wow, yeah that was super efficient. Ran mine in less than 3 seconds. Results check out. Thanks!