Are there any Linear Regression Function in SQL Server?
Asked Answered
S

8

32

Are there any Linear Regression Function in SQL Server 2005/2008, similar to the the Linear Regression functions in Oracle ?

Spearman answered 29/3, 2010 at 9:31 Comment(0)
L
47

To the best of my knowledge, there is none. Writing one is pretty straightforward, though. The following gives you the constant alpha and slope beta for y = Alpha + Beta * x + epsilon:

-- test data (GroupIDs 1, 2 normal regressions, 3, 4 = no variance)
WITH some_table(GroupID, x, y) AS
(       SELECT 1,  1,  1    UNION SELECT 1,  2,  2    UNION SELECT 1,  3,  1.3  
  UNION SELECT 1,  4,  3.75 UNION SELECT 1,  5,  2.25 UNION SELECT 2, 95, 85    
  UNION SELECT 2, 85, 95    UNION SELECT 2, 80, 70    UNION SELECT 2, 70, 65    
  UNION SELECT 2, 60, 70    UNION SELECT 3,  1,  2    UNION SELECT 3,  1, 3
  UNION SELECT 4,  1,  2    UNION SELECT 4,  2,  2),
 -- linear regression query
/*WITH*/ mean_estimates AS
(   SELECT GroupID
          ,AVG(x * 1.)                                             AS xmean
          ,AVG(y * 1.)                                             AS ymean
    FROM some_table
    GROUP BY GroupID
),
stdev_estimates AS
(   SELECT pd.GroupID
          -- T-SQL STDEV() implementation is not numerically stable
          ,CASE      SUM(SQUARE(x - xmean)) WHEN 0 THEN 1 
           ELSE SQRT(SUM(SQUARE(x - xmean)) / (COUNT(*) - 1)) END AS xstdev
          ,     SQRT(SUM(SQUARE(y - ymean)) / (COUNT(*) - 1))     AS ystdev
    FROM some_table pd
    INNER JOIN mean_estimates  pm ON pm.GroupID = pd.GroupID
    GROUP BY pd.GroupID, pm.xmean, pm.ymean
),
standardized_data AS                   -- increases numerical stability
(   SELECT pd.GroupID
          ,(x - xmean) / xstdev                                    AS xstd
          ,CASE ystdev WHEN 0 THEN 0 ELSE (y - ymean) / ystdev END AS ystd
    FROM some_table pd
    INNER JOIN stdev_estimates ps ON ps.GroupID = pd.GroupID
    INNER JOIN mean_estimates  pm ON pm.GroupID = pd.GroupID
),
standardized_beta_estimates AS
(   SELECT GroupID
          ,CASE WHEN SUM(xstd * xstd) = 0 THEN 0
                ELSE SUM(xstd * ystd) / (COUNT(*) - 1) END         AS betastd
    FROM standardized_data pd
    GROUP BY GroupID
)
SELECT pb.GroupID
      ,ymean - xmean * betastd * ystdev / xstdev                   AS Alpha
      ,betastd * ystdev / xstdev                                   AS Beta
FROM standardized_beta_estimates pb
INNER JOIN stdev_estimates ps ON ps.GroupID = pb.GroupID
INNER JOIN mean_estimates  pm ON pm.GroupID = pb.GroupID

Here GroupID is used to show how to group by some value in your source data table. If you just want the statistics across all data in the table (not specific sub-groups), you can drop it and the joins. I have used the WITH statement for sake of clarity. As an alternative, you can use sub-queries instead. Please be mindful of the precision of the data type used in your tables as the numerical stability can deteriorate quickly if the precision is not high enough relative to your data.

EDIT: (in answer to Peter's question for additional statistics like R2 in the comments)

You can easily calculate additional statistics using the same technique. Here is a version with R2, correlation, and sample covariance:

-- test data (GroupIDs 1, 2 normal regressions, 3, 4 = no variance)
WITH some_table(GroupID, x, y) AS
(       SELECT 1,  1,  1    UNION SELECT 1,  2,  2    UNION SELECT 1,  3,  1.3  
  UNION SELECT 1,  4,  3.75 UNION SELECT 1,  5,  2.25 UNION SELECT 2, 95, 85    
  UNION SELECT 2, 85, 95    UNION SELECT 2, 80, 70    UNION SELECT 2, 70, 65    
  UNION SELECT 2, 60, 70    UNION SELECT 3,  1,  2    UNION SELECT 3,  1, 3
  UNION SELECT 4,  1,  2    UNION SELECT 4,  2,  2),
 -- linear regression query
/*WITH*/ mean_estimates AS
(   SELECT GroupID
          ,AVG(x * 1.)                                             AS xmean
          ,AVG(y * 1.)                                             AS ymean
    FROM some_table pd
    GROUP BY GroupID
),
stdev_estimates AS
(   SELECT pd.GroupID
          -- T-SQL STDEV() implementation is not numerically stable
          ,CASE      SUM(SQUARE(x - xmean)) WHEN 0 THEN 1 
           ELSE SQRT(SUM(SQUARE(x - xmean)) / (COUNT(*) - 1)) END AS xstdev
          ,     SQRT(SUM(SQUARE(y - ymean)) / (COUNT(*) - 1))     AS ystdev
    FROM some_table pd
    INNER JOIN mean_estimates  pm ON pm.GroupID = pd.GroupID
    GROUP BY pd.GroupID, pm.xmean, pm.ymean
),
standardized_data AS                   -- increases numerical stability
(   SELECT pd.GroupID
          ,(x - xmean) / xstdev                                    AS xstd
          ,CASE ystdev WHEN 0 THEN 0 ELSE (y - ymean) / ystdev END AS ystd
    FROM some_table pd
    INNER JOIN stdev_estimates ps ON ps.GroupID = pd.GroupID
    INNER JOIN mean_estimates  pm ON pm.GroupID = pd.GroupID
),
standardized_beta_estimates AS
(   SELECT GroupID
          ,CASE WHEN SUM(xstd * xstd) = 0 THEN 0
                ELSE SUM(xstd * ystd) / (COUNT(*) - 1) END         AS betastd
    FROM standardized_data
    GROUP BY GroupID
)
SELECT pb.GroupID
      ,ymean - xmean * betastd * ystdev / xstdev                   AS Alpha
      ,betastd * ystdev / xstdev                                   AS Beta
      ,CASE ystdev WHEN 0 THEN 1 ELSE betastd * betastd END        AS R2
      ,betastd                                                     AS Correl
      ,betastd * xstdev * ystdev                                   AS Covar
FROM standardized_beta_estimates pb
INNER JOIN stdev_estimates ps ON ps.GroupID = pb.GroupID
INNER JOIN mean_estimates  pm ON pm.GroupID = pb.GroupID

EDIT 2 improves numerical stability by standardizing data (instead of only centering) and by replacing STDEV because of numerical stability issues. To me, the current implementation seems to be the best trade-off between stability and complexity. I could improve stability by replacing my standard deviation with a numerically stable online algorithm, but this would complicate the implementation substantantially (and slow it down). Similarly, implementations using e.g. Kahan(-Babuška-Neumaier) compensations for the SUM and AVG seem to perform modestly better in limited tests, but make the query much more complex. And as long as I do not know how T-SQL implements SUM and AVG (e.g. it might already be using pairwise summation), I cannot guarantee that such modifications always improve accuracy.

Lacroix answered 29/3, 2010 at 9:58 Comment(17)
Thanks!! had to use this to solve my problem. Problem, in a broader perspective, was to get a trend line in SSRS (2005) report. This was the only way.Spearman
@pavanrao: you are welcome. Added estimate for constant alpha to the queryLacroix
I realize the thread is 2 years old, but is it possible for you to get the r-squared value with this method as well?Laruelarum
@Peter: sure, easy. See amended answer.Lacroix
Is it possible to expand your solution for multiple regression?Duke
@sqluser: in theory, yes, in reality no. Long story short, you have to invert a matrix which gets ugly quickly. There are at least three alternatives: 1) implement the linear regression as function minimization or a less direct method not relying on full matrix inversion which is easy to implement (e.g. the answer https://mcmap.net/q/446222/-are-there-any-linear-regression-function-in-sql-server by Colin); 2) the SQL Server Analysis Service (msdn.microsoft.com/en-US/library/ms174824.aspx); 3) some third-party package, possibly outside of SQL Server (with a lot of data, you might want "online linear regression").Lacroix
select (avg(xy) - avg(x)*avg(y))/VARIANCE(X) as slope , avg(y) - ((avg(xy) - avg(x)*avg(y))/VARIANCE(X)) * avg(x) as intercept from #rawdata --FARSHORTERGigue
@Chris: not sure where to start, so I start with the nitpicking ;). You obviously mean avg(x*y) and VARP(X) in your formula. Another minor point is that avg(expression) gives you an integer if your input data has type integer. But now for the real issue: your code is not numerically stable, see code comments and starting at "Edit 2". Also feel free to look at the revision history of the answer and you will notice that the first version is pretty close to yours. Long story short: I would never use your version because I would not trust it in many numerically well-behaved situations.Lacroix
@Lacroix Thank you for the instability alert. I noticed that I used this in a project before and it calculates the same as what you have but it is more concise: select avg(y) - avg(x) * ((count() * sum(x * y)) - (sum(x) * sum(y)))/ ((count() * sum(x * x)) - (Sum(x) * Sum(x))) as intercept, ((count() * sum(x * y)) - (sum(x) * sum(y)))/ ((count() * sum(x * x)) - (sum(x) * sum(x))) AS slope from tablexy --The FARSHORTER 2 comments above does suffer from instability as I tested it and it produces different results. I validated your solution and the concise solution in R and it is matches lm()Gigue
@Chris: agree, this is much better. The only two differences of the code above relative to yours are 1) I coerce the type (that strange AVG(x * 1.) hack) - I believe your version gives the wrong result if x and y are integers; 2) the version in my answers standardizes the data which might help with some idiosyncrazies / edge-cases of floating-point arithmetics. But in any normal use-case your version looks fine to me.Lacroix
@Lacroix With regards to the multiple regression question I agree partly. You don't need to go the matrix multiplication route though. You can actually estimate the parameters of multiple regression by using multiple simple linear regressions. stats.stackexchange.com/a/166718/4737 It would be incredibly tedious but it is possible...Indohittite
It might not hurt to add a tolerance in your comparison when checking if the standard deviation of x is different from 0. I'm using essentially your code but had some cases where the data looks constant for x but due to some numerical issues it wasn't exactly the same (even though it should be) and that blew the slope estimate up since the x standard deviation was just a hair above 0.Indohittite
@Dason: that's a very good point. Never happened to me, but can very well imagine that it can happen. Let me think how to best do this. Any suggestions?Lacroix
@Lacroix To do it generally might be tough. In my particular case I knew the smallest the gap between consecutive x values could be so I tested if the range was smaller than that. I guess one could make a test by comparing the range of x values to some multiple of the machine's smallest floating point or something along those lines.Indohittite
In the AVG(x * 1.) line of code, what does the 1. mean? Is 1. the same as 1.0?Sine
@Jackson: 1. is indeed the same as 1.0. This is a hack to force a non-integer data type even if the data type of x is integer (in T-SQL, AVG for integers returns an integer which is not what we want here). It would probably have been cleaner to write CAST(x as FLOAT), but this is shorter, and IIRC it also works better with NUMERIC (no guarantee). This is RDBS-dependent, I have only tested this with MS SQL Server.Lacroix
Good to know thanks. I’ll keep using 1.0 over 1. though because I feel like it makes the code more readable. Agreed that maybe casting is better.Sine
M
27

This is an alternate method, based off a blog post on Linear Regression in T-SQL, which uses the following equations:

enter image description here

The SQL suggestion in the blog uses cursors though. Here's a prettified version of a forum answer that I used:

table
-----
X (numeric)
Y (numeric)

/**
 * m = (nSxy - SxSy) / (nSxx - SxSx)
 * b = Ay - (Ax * m)
 * N.B. S = Sum, A = Mean
 */
DECLARE @n INT
SELECT @n = COUNT(*) FROM table
SELECT (@n * SUM(X*Y) - SUM(X) * SUM(Y)) / (@n * SUM(X*X) - SUM(X) * SUM(X)) AS M,
       AVG(Y) - AVG(X) *
       (@n * SUM(X*Y) - SUM(X) * SUM(Y)) / (@n * SUM(X*X) - SUM(X) * SUM(X)) AS B
FROM table
Mini answered 19/7, 2010 at 16:51 Comment(4)
This proves the answer with the second most votes is best.Gigue
@Mini - Contrary to what Chris posted, your solution is a much better answer than what the currently accepted answer is because it's nasty fast and only makes two passes on the table AND IT'S SIMPLE! The only problem is that you didn't consider the effects of "Integer Math" but that can be easily fixed by changing the datatype of @n to DECLARE @n DECIMAL(19,6) .Wadding
@JeffModen thank you, Chris is suggesting that my answer is the better answer :) At least when he posted that comment I had the second most votes.Mini
@Icc97 - Ah... you're correct. I misread his comment. Thank you for the feedback. It still needs the tweak to avoid the integer math problem. And, THANK YOU for posting the formulas, as well. Real nice job you did there.Wadding
G
5

I've actually written an SQL routine using Gram-Schmidt orthoganalization. It, as well as other machine learning and forecasting routines, is available at sqldatamine.blogspot.com

At the suggestion of Brad Larson I've added the code here rather than just direct users to my blog. This produces the same results as the linest function in Excel. My primary source is Elements of Statistical Learning (2008) by Hastie, Tibshirni and Friedman.

--Create a table of data
create table #rawdata (id int,area float, rooms float, odd float,  price float)

insert into #rawdata select 1, 2201,3,1,400
insert into #rawdata select 2, 1600,3,0,330
insert into #rawdata select 3, 2400,3,1,369
insert into #rawdata select 4, 1416,2,1,232
insert into #rawdata select 5, 3000,4,0,540

--Insert the data into x & y vectors
select id xid, 0 xn,1 xv into #x from #rawdata
union all
select id, 1,rooms  from #rawdata
union all
select id, 2,area  from #rawdata
union all
select id, 3,odd  from #rawdata

select id yid, 0 yn, price yv  into #y from #rawdata

--create a residuals table and insert the intercept (1)
create table #z (zid int, zn int, zv float)
insert into #z select id , 0 zn,1 zv from #rawdata

--create a table for the orthoganal (#c) & regression(#b) parameters
create table #c(cxn int, czn int, cv float) 
create table #b(bn int, bv float) 


--@p is the number of independent variables including the intercept (@p = 0)
declare @p int
set @p = 1


--Loop through each independent variable and estimate the orthagonal parameter (#c)
-- then estimate the residuals and insert into the residuals table (#z)
while @p <= (select max(xn) from #x)
begin   
        insert into #c
    select  xn cxn,  zn czn, sum(xv*zv)/sum(zv*zv) cv 
        from #x join  #z on  xid = zid where zn = @p-1 and xn>zn group by xn, zn

    insert into #z
    select zid, xn,xv- sum(cv*zv) 
        from #x join #z on xid = zid   join  #c  on  czn = zn and cxn = xn  where xn = @p and zn<xn  group by zid, xn,xv

    set @p = @p +1
end

--Loop through each independent variable and estimate the regression parameter by regressing the orthoganal
-- resiuduals on the dependent variable y
while @p>=0 
begin

    insert into #b
    select zn, sum(yv*zv)/ sum(zv*zv) 
        from #z  join 
            (select yid, yv-isnull(sum(bv*xv),0) yv from #x join #y on xid = yid left join #b on  xn=bn group by yid, yv) y
        on zid = yid where zn = @p  group by zn

    set @p = @p-1
end

--The regression parameters
select * from #b

--Actual vs. fit with error
select yid, yv, fit, yv-fit err from #y join 
    (select xid, sum(xv*bv) fit from #x join #b on xn = bn  group by xid) f
     on yid = xid

--R Squared
select 1-sum(power(err,2))/sum(power(yv,2)) from 
(select yid, yv, fit, yv-fit err from #y join 
    (select xid, sum(xv*bv) fit from #x join #b on xn = bn  group by xid) f
     on yid = xid) d
Granitite answered 7/1, 2014 at 18:7 Comment(3)
Rather than just posting a link to your blog (which could go away at some point in the future), could you summarize the relevant information from your blog in your answer here?Fluoroscopy
I have a dataset and when I use your code, everything looks what I expected except R Squared. Are you sure the calculation is fine in R2. I am comparing the result with excel regression and they are different.Duke
Also can you expand your solution to include p-values for each variable(X)?Duke
E
3

There are no linear regression functions in SQL Server. But to calculate a Simple Linear Regression (Y' = bX + A) between pairs of data points x,y - including the calculation of the Correlation Coefficient, Coefficient of Determination (R^2) and Standard Estimate of Error (Standard Deviation), do the following:

For a table regression_data with numeric columns x and y:

declare @total_points int 
declare @intercept DECIMAL(38, 10)
declare @slope DECIMAL(38, 10)
declare @r_squared DECIMAL(38, 10)
declare @standard_estimate_error DECIMAL(38, 10)
declare @correlation_coefficient DECIMAL(38, 10)
declare @average_x  DECIMAL(38, 10)
declare @average_y  DECIMAL(38, 10)
declare @sumX DECIMAL(38, 10)
declare @sumY DECIMAL(38, 10)
declare @sumXX DECIMAL(38, 10)
declare @sumYY DECIMAL(38, 10)
declare @sumXY DECIMAL(38, 10)
declare @Sxx DECIMAL(38, 10)
declare @Syy DECIMAL(38, 10)
declare @Sxy DECIMAL(38, 10)

Select 
@total_points = count(*),
@average_x = avg(x),
@average_y = avg(y),
@sumX = sum(x),
@sumY = sum(y),
@sumXX = sum(x*x),
@sumYY = sum(y*y),
@sumXY = sum(x*y)
from regression_data

set @Sxx = @sumXX - (@sumX * @sumX) / @total_points
set @Syy = @sumYY - (@sumY * @sumY) / @total_points
set @Sxy = @sumXY - (@sumX * @sumY) / @total_points

set @correlation_coefficient = @Sxy / SQRT(@Sxx * @Syy) 
set @slope = (@total_points * @sumXY - @sumX * @sumY) / (@total_points * @sumXX - power(@sumX,2))
set @intercept = @average_y - (@total_points * @sumXY - @sumX * @sumY) / (@total_points * @sumXX - power(@sumX,2)) * @average_x
set @r_squared = (@intercept * @sumY + @slope * @sumXY - power(@sumY,2) / @total_points) / (@sumYY - power(@sumY,2) / @total_points)

-- calculate standard_estimate_error (standard deviation)
Select
@standard_estimate_error = sqrt(sum(power(y - (@slope * x + @intercept),2)) / @total_points)
From regression_data
Exocarp answered 2/4, 2014 at 1:27 Comment(2)
Can you expand your solution to include p-value as well? Also how can we make a multiple liner regression based on your answer?Duke
@Duke - The R-squared is too large because the total sum of squares uses raw Y values rather than deviations from the mean. In the following, yv should be replaced by yv-@meanY select 1-sum(power(err,2))/sum(power(yv,2)) fromConlin
M
2

Here it is as a function that takes a table type of type: table (Y float, X double) which is called XYDoubleType and assumes our linear function is of the form AX + B. It returns A and B a Table column just in case you want to have it in a join or something

CREATE FUNCTION FN_GetABForData(
 @XYData as XYDoubleType READONLY
 ) RETURNS  @ABData TABLE(
            A  FLOAT,
            B FLOAT, 
            Rsquare FLOAT )
 AS
 BEGIN
    DECLARE @sx FLOAT, @sy FLOAT
    DECLARE @sxx FLOAT,@syy FLOAT, @sxy FLOAT,@sxsy FLOAT, @sxsx FLOAT, @sysy FLOAT
    DECLARE @n FLOAT, @A FLOAT, @B FLOAT, @Rsq FLOAT

    SELECT @sx =SUM(D.X) ,@sy =SUM(D.Y), @sxx=SUM(D.X*D.X),@syy=SUM(D.Y*D.Y),
        @sxy =SUM(D.X*D.Y),@n =COUNT(*)
    From @XYData D
    SET @sxsx =@sx*@sx
    SET @sxsy =@sx*@sy
    SET @sysy = @sy*@sy

    SET @A = (@n*@sxy -@sxsy)/(@n*@sxx -@sxsx)
    SET @B = @sy/@n  - @A*@sx/@n
    SET @Rsq = POWER((@n*@sxy -@sxsy),2)/((@n*@sxx-@sxsx)*(@n*@syy -@sysy))

    INSERT INTO @ABData (A,B,Rsquare) VALUES(@A,@B,@Rsq)

    RETURN 
 END
Merrow answered 25/4, 2013 at 22:16 Comment(0)
P
2

I hope the following answer helps one understand where some of the solutions come from. I am going to illustrate it with a simple example, but the generalization to many variables is theoretically straightforward as long as you know how to use index notation or matrices. For implementing the solution for anything beyond 3 variables you'll Gram-Schmidt (See Colin Campbell's answer above) or another matrix inversion algorithm.

Since all the functions we need are variance, covariance, average, sum etc. are aggregation functions in SQL, one can easily implement the solution. I've done so in HIVE to do linear calibration of the scores of a Logistic model - amongst many advantages, one is that you can function entirely within HIVE without going out and back in from some scripting language.

The model for your data (x_1, x_2, y) where your data points are indexed by i, is

y(x_1, x_2) = m_1*x_1 + m_2*x_2 + c

The model appears "linear", but needn't be, For example x_2 can be any non-linear function of x_1, as long as it has no free parameters in it, e.g. x_2 = Sinh(3*(x_1)^2 + 42). Even if x_2 is "just" x_2 and the model is linear, the regression problem isn't. Only when you decide that the problem is to find the parameters m_1, m_2, c such that they minimize the L2 error do you have a Linear Regression problem.

The L2 error is sum_i( (y[i] - f(x_1[i], x_2[i]))^2 ). Minimizing this w.r.t. the 3 parameters (set the partial derivatives w.r.t. each parameter = 0) yields 3 linear equations for 3 unknowns. These equations are LINEAR in the parameters (this is what makes it Linear Regression) and can be solved analytically. Doing this for a simple model (1 variable, linear model, hence two parameters) is straightforward and instructive. The generalization to a non-Euclidean metric norm on the error vector space is straightforward, the diagonal special case amounts to using "weights".

Back to our model in two variables:

y = m_1*x_1 + m_2*x_2 + c

Take the expectation value =>

= m_1* + m_2* + c (0)

Now take the covariance w.r.t. x_1 and x_2, and use cov(x,x) = var(x):

cov(y, x_1) = m_1*var(x_1) + m_2*covar(x_2, x_1) (1)

cov(y, x_2) = m_1*covar(x_1, x_2) + m_2*var(x_2) (2)

These are two equations in two unknowns, which you can solve by inverting the 2X2 matrix.

In matrix form: ... which can be inverted to yield ... where

det = var(x_1)*var(x_2) - covar(x_1, x_2)^2

(oh barf, what the heck are "reputation points? Gimme some if you want to see the equations.)

In any case, now that you have m1 and m2 in closed form, you can solve (0) for c.

I checked the analytical solution above to Excel's Solver for a quadratic with Gaussian noise and the residual errors agree to 6 significant digits.

Contact me if you want to do Discrete Fourier Transform in SQL in about 20 lines.

Peacetime answered 19/9, 2015 at 7:4 Comment(0)
G
2

To add to @icc97 answer, I have included the weighted versions for the slope and the intercept. If the values are all constant the slope will be NULL (with the appropriate settings SET ARITHABORT OFF; SET ANSI_WARNINGS OFF;) and will need to be substituted for 0 via coalesce().

Here is a solution written in SQL:

with d as (select segment,w,x,y from somedatasource)
select segment,

avg(y) - avg(x) *
((count(*) * sum(x*y)) - (sum(x)*sum(y)))/
((count(*) * sum(x*x)) - (Sum(x)*Sum(x)))   as intercept,

((count(*) * sum(x*y)) - (sum(x)*sum(y)))/
((count(*) * sum(x*x)) - (sum(x)*sum(x))) AS slope,

avg(y) - ((avg(x*y) - avg(x)*avg(y))/var_samp(X)) * avg(x) as interceptUnstable,
(avg(x*y) - avg(x)*avg(y))/var_samp(X) as slopeUnstable,
(Avg(x * y) - Avg(x) * Avg(y)) / (stddev_pop(x) * stddev_pop(y)) as correlationUnstable,

(sum(y*w)/sum(w)) - (sum(w*x)/sum(w)) *
((sum(w)*sum(x*y*w)) - (sum(x*w)*sum(y*w)))/
  ((sum(w)*sum(x*x*w)) - (sum(x*w)*sum(x*w)))   as wIntercept,

((sum(w)*sum(x*y*w)) - (sum(x*w)*sum(y*w)))/
  ((sum(w)*sum(x*x*w)) - (sum(x*w)*sum(x*w))) as wSlope,

(count(*) * sum(x * y) - sum(x) * sum(y)) / (sqrt(count(*) * sum(x * x) - sum(x) * sum(x))
* sqrt(count(*) * sum(y * y) - sum(y) * sum(y))) as correlation,

(sum(w) * sum(x*y*w) - sum(x*w) * sum(y*w)) /
(sqrt(sum(w) * sum(x*x*w) - sum(x*w) * sum(x*w)) * sqrt(sum(w) * sum(y*y*w)
- sum(y*w) * sum(y*w))) as wCorrelation,

count(*) as n

from d where x is not null and y is not null group by segment

Where w is the weight. I double checked this against R to confirm the results. One may need to cast the data from somedatasource to floating point. I included the unstable versions to warn you against those. (Special thanks goes to Stephan in another answer.)

Update: added weighted correlation

Gigue answered 14/4, 2017 at 16:55 Comment(1)
+1 The weighted version is helpful but the excess brackets make it harder to read. It's also much cleaner to define the intercept using the gradient.Hindward
P
1

I have translated the Linear Regression Function used in the funcion Forecast in Excel, and created an SQL function that returns a,b, and the Forecast. You can see the complete teorical explanation in the excel help for FORECAST fuction. Firs of all you will need to create the table data type XYFloatType:

 CREATE TYPE [dbo].[XYFloatType] 
AS TABLE(
[X] FLOAT,
[Y] FLOAT)

Then write the follow function:

    /*
-- =============================================
-- Author:      Me      :)
-- Create date: Today   :)
-- Description: (Copied Excel help): 
--Calculates, or predicts, a future value by using existing values. 
The predicted value is a y-value for a given x-value. 
The known values are existing x-values and y-values, and the new value is predicted by using linear regression. 
You can use this function to predict future sales, inventory requirements, or consumer trends.
-- =============================================
*/

CREATE FUNCTION dbo.FN_GetLinearRegressionForcast

(@PtXYData as XYFloatType READONLY ,@PnFuturePointint)
RETURNS @ABDData TABLE( a FLOAT, b FLOAT, Forecast FLOAT)
AS

BEGIN 
    DECLARE  @LnAvX Float
            ,@LnAvY Float
            ,@LnB Float
            ,@LnA Float
            ,@LnForeCast Float
    Select   @LnAvX = AVG([X])
            ,@LnAvY = AVG([Y])
    FROM @PtXYData;
    SELECT @LnB =  SUM ( ([X]-@LnAvX)*([Y]-@LnAvY) )  /  SUM (POWER([X]-@LnAvX,2))
    FROM @PtXYData;
    SET @LnA = @LnAvY - @LnB * @LnAvX;
    SET @LnForeCast = @LnA + @LnB * @PnFuturePoint;
    INSERT INTO @ABDData ([A],[B],[Forecast]) VALUES (@LnA,@LnB,@LnForeCast)
    RETURN 
END

/*
your tests: 

 (I used the same values that are in the excel help)
DECLARE @t XYFloatType 
INSERT @t VALUES(20,6),(28,7),(31,9),(38,15),(40,21)        -- x and y values
SELECT *, A+B*30 [Prueba]FROM dbo.FN_GetLinearRegressionForcast@t,30);
*/
Pettus answered 15/1, 2015 at 17:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.