10/30/2009 Recitation # Fourier Series Given any function f(x) f(x) = \infinite-sum (coefficients * basis terms) = c_1*sin(\theta_1) + c_2*sin(\theta_2) + ... The c_i's can be thought of as regression coefficients for how important the basis terms are. # Linear Regression What is the association between wage (y) and education (x)? Some possible scenarios - avg y given some x is linear E[y|x] = a + bx this is true in the aggregate. for individual i, you have y_i = a + bx_i + e_i - the avg y given some x is not linear E[y|x] = some nonlinear function of x So we estimate this nonlinear function with a linear function, and want to fit a "best" line to it How to find the best line? Estimate "best" by least squares. Least squares: for all (y_i, x_i), find line (a, b) such that sqrt(sum(y_i - (a + bx_i))) is smallest Given y = [y_1 ... y_N] and X = [x_1 ... x_N] (both transposed vectors) y = x\Beta + \epsilon [epsilon is the error vector] so minimize sum(y_i - (a + bx_i)) = ||y - xb||^2 = min ||\epsion||^2 (minimize the error) \Beta_min = (X^{T}X)^{-1}X^{T}y = x dot y / x dot x = \sum x_iy_i / \sum (x_i)^2 = (1/n)\sum x_iy_i / (1/n)\sum (x_i)^2 = Cov(x,y)/Var(x,x) This generalizes to higher dimensions of the input space---can have many input variables x, z, etc., and each will have a coefficient \beta_x, \beta_z. Note: we dropped a (the y-intercept) from this analysis. In practice, X = [ 1 1 1 1 1 ... x1 x2 x3 x4 x5 ... z1 z2 z3 z4 z5 ...]^{T} The 1's serve to get a coefficient \alpha, which estimates the average for all of the unexplained terms (the y-intercept). Ignore this term---it doesn't mean anything. Given the number of measurements (N), your estimates for \beta can fall into a smaller and smaller range around the real value. So now, to interpret a regression table (1) LS (2) IV education 1.8 --- [.9] --- IQ 2 4 [3.3] [.07] wealth --- 10 --- [11.1] so y = 1.8*educ + 2*IQ The [values] are the standard errors General rule of thumb: if coefficient > 2 * error, then it's likely that the coefficient is different from 0, since there's a 95% chance you're not zero if you're outisde of 1.96 standard deviations from the assumed mean (0)