How to calculate the likelihood of curve-fitting in scipy?

Asked 11/4, 2014 at 5:42 Answered 14/4, 2014 at 1:42

I have a nonlinear model fit that looks like this:

Curve fit

The dark solid line is the model fit, and the grey part is the raw data.

Short version of the question: how do I get the likelihood of this model fit, so I can perform log-likelihood ratio test? Assume that the residual is normally distributed.

I am relatively new to statistics, and my current thoughts are:

Get the residual from the curve fit, and calculate the variance of residual;
Use this equation And plug in the variance of residual into sigma-squared, x_i as experiment and mu as model fit;
Calculate the log-likelihood ratio.

Could anyone help me, with these two full-version questions?

Is my method correct? (I think so, but it would be really great to make sure!)
Are there any ready-made functions in python/scipy/statsmodels to do this for me?

Cobaltous answered 11/4, 2014 at 5:42 Comment(5)

If your residuals are normally distributed, you just need to use least squares to get the model that has the highest likelihood. Can you show what you already tried? Just to know that this is not homework? – Shepherd 11/4, 2014 at 12:16

@Shepherd 0) Lol - This is not homework, just trying to address a review comment for a research paper; 2) the model fitting has already been done via least squares of the residual, I'm trying to perform likelihood-ratio test; 3) to do any likelihood-based job, I need to get the likelihood first, which is my problem 4) I have written what I've tried under "my current thoughts are...". Lastly, sorry about really naive in statistics :( – Cobaltous 11/4, 2014 at 12:44

While the question is well written and articulated, it may be worth moving the question over to stats.stackexchange.com as this is a programming site. – Rainie 11/4, 2014 at 14:2

@Rainie Thanks for the suggestion! Could you please instruct on how...? Do I manually copy and paste this there? – Cobaltous 11/4, 2014 at 14:32

You could simply delete this question and repost it (the formatting is already done!). It wouldn't hurt to incorporate the comments into the new question. Good luck! – Rainie 11/4, 2014 at 14:52

Your likelihood function

enter image description here

which is simply the sum of log of probability density function of Gaussian distribution.

enter image description here

is the likelihood of fitting a mu and a sigma for your residue, not the likelihood of your model given your data. In one word, your approach is wrong.

Sine you are doing non-linear least square, following what @usethedeathstar already mentioned, you should go straight for F-test. . Consider the following example, modified from http://www.walkingrandomly.com/?p=5254, and we conduct F-test using R. And we will discuss how to translate it into python in the end.

# construct the data vectors using c()
> xdata = c(-2,-1.64,-1.33,-0.7,0,0.45,1.2,1.64,2.32,2.9)
> ydata = c(0.699369,0.700462,0.695354,1.03905,1.97389,2.41143,1.91091,0.919576,-0.730975,-1.42001)
# some starting values
> p1 = 1
> p2 = 0.2
> p3 = 0.01

# do the fit
> fit1 = nls(ydata ~ p1*cos(p2*xdata) + p2*sin(p1*xdata), start=list(p1=p1,p2=p2))
> fit2 = nls(ydata ~ p1*cos(p2*xdata) + p2*sin(p1*xdata)+p3*xdata, start=list(p1=p1,p2=p2,p3=p3))

# summarise
> summary(fit1)

Formula: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata)

Parameters:
   Estimate Std. Error t value Pr(>|t|)    
p1 1.881851   0.027430   68.61 2.27e-12 ***
p2 0.700230   0.009153   76.51 9.50e-13 ***
---
Signif. codes:  0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1

Residual standard error: 0.08202 on 8 degrees of freedom

Number of iterations to convergence: 7 
Achieved convergence tolerance: 2.189e-06

> summary(fit2)

Formula: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata) + p3 * xdata

Parameters:
   Estimate Std. Error t value Pr(>|t|)    
p1  1.90108    0.03520  54.002 1.96e-10 ***
p2  0.70657    0.01167  60.528 8.82e-11 ***
p3  0.02029    0.02166   0.937     0.38    
---
Signif. codes:  0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1

Residual standard error: 0.08243 on 7 degrees of freedom

Number of iterations to convergence: 9 
Achieved convergence tolerance: 2.476e-06

> anova(fit2, fit1)
Analysis of Variance Table

Model 1: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata) + p3 * xdata
Model 2: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata)
  Res.Df Res.Sum Sq Df     Sum Sq F value Pr(>F)
1      7   0.047565                             
2      8   0.053813 -1 -0.0062473  0.9194 0.3696

here we have two model, fit1 has 2 parameters, therefore the residue has 8 degrees-of-freedom; fit2 has one additional parameter and the residue has 7 degrees of freedom. Is model 2 significantly better? No, the F value is 0.9194, on (1,7) degrees of freedom and it is not significant.

To get the ANOVA table: Residue DF is easy. Residue Sum of squares: 0.08202*0.08202*8=0.05381 and 0.08243*0.08243*7=0.04756293 (notice: 'Residual standard error: 0.08243 on 7 degrees of freedom', etc). In python, you can get it by (y_observed-y_fitted)**2, since scipy.optimize.curve_fit() doesn't return the residues.

The F-ratio is 0.0062473/0.047565*7 and to get P-value: 1-scipy.stats.f.cdf(0.9194, 1, 7).

Put them together we have python equivalent:

In [1]:

import scipy.optimize as so
import scipy.stats as ss
xdata = np.array([-2,-1.64,-1.33,-0.7,0,0.45,1.2,1.64,2.32,2.9])
ydata = np.array([0.699369,0.700462,0.695354,1.03905,1.97389,2.41143,1.91091,0.919576,-0.730975,-1.42001])
def model0(x,p1,p2):
    return p1*np.cos(p2*x) + p2*np.sin(p1*x)
def model1(x,p1,p2,p3):
    return p1*np.cos(p2*x) + p2*np.sin(p1*x)+p3*x
p1, p2, p3 = 1, 0.2, 0.01
fit0=so.curve_fit(model0, xdata, ydata, p0=(p1,p2))[0]
fit1=so.curve_fit(model1, xdata, ydata, p0=(p1,p2,p3))[0]
yfit0=model0(xdata, fit0[0], fit0[1])
yfit1=model1(xdata, fit1[0], fit1[1], fit1[2])
ssq0=((yfit0-ydata)**2).sum()
ssq1=((yfit1-ydata)**2).sum()
df=len(xdata)-3
f_ratio=(ssq0-ssq1)/(ssq1/df)
p=1-ss.f.cdf(f_ratio, 1, df)
In [2]:

print f_ratio, p
0.919387419515 0.369574503394

As @usethedeathstar pointed out: when you the residue is normally distributed, nonlinear least square IS the maximum likelihood. Therefore F-test and likelihood ratio test is equivalent. Because, F-ratio is a monotone transformation of the likelihood ratio λ.

Or in a descriptive way, see: http://www.stata.com/support/faqs/statistics/chi-squared-and-f-distributions/

Monetmoneta answered 14/4, 2014 at 1:42 Comment(2)

Thank you so much! That saved me a week. A really quick follow-up question: if the residual of the fitting is not normally distributed (although least square was used in the fitting), would the f-test still be valid? – Cobaltous 14/4, 2014 at 4:30

People over in stats.stackexchange.com will have better say on this. In theory it may appear so, but in fact F-test is quite robust in respect to the normality assumption therefore you shall still be able to use it most of the time. Hope this helps. Good luck for your revision! – Monetmoneta 14/4, 2014 at 5:23

Your formula looks correct to me. It should give you the same results as scipy.stats.norm.logpdf(x, loc=mu, scale=sigma)

Since you already have your estimates of mu and sigma, I don't think there is a function for the likelihood ratio test where you can plug your results in.

If you have the estimates of two models, where one is nested in the other, then you can easily calculate it yourself.

http://en.wikipedia.org/wiki/Likelihood-ratio_test

Here is the part of a method in statsmodels that calculates the LR-test for comparing two nested linear models https://github.com/statsmodels/statsmodels/blob/master/statsmodels/regression/linear_model.py#L1531

Laticialaticiferous answered 13/4, 2014 at 23:42 Comment(0)

Recommended topics

Hot tags