03 Feb Beyond Linear Modeling using R (part-1)
Many problems in econometrics are based on the assumption that two (or more) observed variables have linear dependency with each other. However, this is not applicable in every case. Non-linear relations are equally probable. In this series of posts, we will highlight some cases and ways to fit non-linear models using R.
First, we will generate some data for two variables x and y:
# generate data > set.seed(1) > length <- 30 > x <- runif(length) > y <- x^3 + rnorm(length, 0, 0.06) > ds <- data.frame(x = x, y = y) > str(ds) 'data.frame': 30 obs. of 2 variables: $ x: num 0.266 0.372 0.573 0.908 0.202 ... $ y: num 0.016 0.0506 0.2446 0.7984 0.0438 ...
Now we will visualise the dataset in a scatter plot to see the non-linear relationship and also we will draw quadratic, cubic and quartic lines with red, green and yellow colors respectively:
# plotting > plot(y ~ x, main = "Known cubic, with noise") # Add lines > lines(s, y1, col="red", lty=1) > lines(s, y2, col="green", lty=2) > lines(s, y3, col="yellow", , lty=3) # Add a legend > legend("topleft", legend=c("cubic", "quadratic", "quartic"), col=c("red", "green", "yellow"), lty=1:3, cex=0.8, box.lty=1, box.lwd=1)
Clearly, the relationship between x, y is exponential and can be described by the following mathematical form:
In order to built a model that will appropriately describe the above mathematical formula, we will use the function nls() as follows:
# non-linear modeling > nlmodel <- nls(y ~ I(x^power), data = ds, start = list(power = 1),trace = F) # summary > summary(nlmodel) Formula: y ~ I(x^power) Parameters: Estimate Std. Error t value Pr(>|t|) power 3.0470 0.1188 25.65 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.04676 on 29 degrees of freedom Number of iterations to convergence: 5 Achieved convergence tolerance: 4.257e-07
Inspecting the output of the summary() result, we can see that the value of the power is 3.047 ± 0.1188 and is highly significant. The magnitude of the standard error is a measure of uncertainty for the estimated value of the exponent.
Fitting the model obtained above and stored in nlmodel and plotting the lines we have the following graph:
plot(y ~ x, main = "Non-linear model") lines(s, s^3, lty = 2, col = "red") lines(s, predict(nlmodel, list(x = s)), lty = 1, col = "blue") legend("topleft", legend=c("fitted", "cubic"), col=c("blue", "red"), lty=1:2, cex=0.8, box.lty=1, box.lwd=1)
We can see that the fitted model is very close to the cubic model that we used to create the data with the additional noise.
Goodness of fit
To examine the goodness of fit of the model, we will calculate R squared, seen in a previous post for linear regression:
# sum of squared error > SSE <- sum(residuals(nlmodel)^2) # total sum of squares > SST <- sum((y - mean(y))^2) # Rsquared > Rsquared <- 1-(SSE/SST)
R squared is calculated at 0.9778 which is a very good result highlighting the overall good performance of the model.