## Regression Bask

[**UPDATE** below.]

I’m seeing someone do something in a paper that strikes me as odd. So let me ask some of you stats guys what you think:

Suppose I run a regression to see what effect independent variables X1, X2, …, Xn have on Y. I come up with my regression coefficients on each of them. The coefficient on X3 is (say) 5.

Then I take out X3 from my list, and run the regression again. Obviously the coefficients on the other variables change.

The thing is, the R-squared in both regressions is about the same. I.e., when I took out variable X3, the “fit” of my predicted curve to the actual curve, is about the same.

Would it be correct for me to say, “According to my regression analysis, X3 has no effect on Y”?

**UPDATE:** I didn’t want to say what the regression was initially, because now politics will get involved. But it’s from this paper (Tables 2 and 3 at the end). My obvious concern with the economics of it, is that “refiner margin” is closely related to the implementation of Tier 2. I.e. the *way* Tier 2 regulations would (possibly) raise gasoline prices, is by first reducing refiner margins.

What was the coefficient on X3 compared to the other variables? If it was low, then yes you might be able to say that. BUT, you should also look at the covariance matrix for all the variables. If X3 is highly dependent on the others, then it’s possible to take it out and see little difference.

So my question to you is how independent is X3 to the other variables?

Speaking as an engineer (not a math genius):

You should plot every input variable against every other input variable and stare at it for at least a minute each pair. It is entirely possible that X3 closely tracks some other input variable (say X1 for example) and so when you plot X1 vs X3 you get a straight line.

The result is that X1 contains information, and X3 contains the same information, but X1 and X3 together contain no additional information that would not be available from one or the other. This related to trying to discover a minimalist set of independent input variables.

Note that you can get bunches of variables where there are three variables, but only contain two variables worth of information (i.e. any single one of the three can be left out, but not two). Or four variables that between them only contain three variables worth of information… etc.

Or it could simply be that X3 has no effect, but then the regression algorithm should give it a low score (I would expect).

Concur.

First off the obvious answer, in a pure-statistical sense if the coefficient is anything but zero with ALL variables in (With some X% Confidence Interval) Then it is non-significant. however here you are saying its “5” is that five significant ? (what is its T-stat?) if it is significant with all variables in, then yes to the best of the models knowledge it is significant.

The reason the R^2 is remaining the same is likely (again Confidence Interval) due to Ommitted Variable Bias.

If you want Ill explain more formally.

T”hen it is non-significant.” should be:

Is Significant** thats a pretty major f***up.

Btw Grants explanation is much better than mine, I was rushing to get off the computer and that was the fastest I could type. Re-reading it, I cant even understand what I was trying to say, although you can see its the same idea as grants.

Would it be correct for me to say, “According to my regression analysis, X3 has no effect on Y”?No.

At least, the information that you have given doesn’t allow you to make that claim. For instance, if the other variables change with the omission of X3, then that indicates that you are probably dealing with a problem of omitted variable bias. The included regressors are unduly significant because they co-vary with the omitted X3 and that feeds through to the overall goodness-of-fit measurement (i.e. R-squared).

If you’re interested in whether X3 has an impact on Y, then why not simply check it’s t-value?

(Out of interest, what do the adjusted R-squares say?)

Grant’s answer is better than mine.

Following your update, the answer is clear from simply looking at the small t-value on “Tier 2” in Table 2 (i.e. 0.79 and corresponding p-value of 0.433.) In other words, it is

nota significant explanatory variable.Moreover, your initial comment that “obviously the other coefficients all change” when Tier 2 is excluded is actually pretty misleading… The other variables hardly change at all! You’d have to conduct, say, a Chow test to be certain. However, a pretty good rule of thumb in this case is simply to look at the standard errors of “refiner margin” across the models and whether they overlap. In this case they clearly do, so you’d have a very tough time arguing that this “Tier 2” legislation is eating into the profit margins of refineries.

Where in the who is the whatsit?

Can you post output?

I’d say no, you can’t say that. It’s possible one of your other variables was endogenous so when you drop X3 that variable re-absorbs the effect of X3. You’re still explaining the same amount of variation but your new coefficients are biased because X3 did have an effect.

Can you think of why X3 would be correlated with some of your other regressors? If so – keep it in there.

Don’t let statistics dictate model specification!

The correct test for what you’re asking is a joint F-test. Does the paper mention it?

It’s all very vague, and stats was a long time ago, but I’d say no. DK is right about one variable being endogenous or proxy for another. Consider a regression on 5 things including being born in Poland in the 1950s, and being Catholic. For at least some data sets I can imagine the same effect you describe. More realistically I have seen people toss in a whole suite of things that together effectively proxy race and then conclude race can be eliminated and has no effect. Suspect.

Classic symptoms of high colinearity. between the X variables. For example, suppose X3 is very highly correlated with X4. Then both X3 and X4 would have low statistical significance, and dropping either X3 or X4 would leave R^2 almost the same, but dropping both X3 and X4 might cause R^2 to drop a lot.

My guess is that if you regress X3 on all the other Xs (leave out Y), you would get a high R^2. That way you would also find out which other Xs X3 is correlated with.

I’m saying about the same as Daniel. Whether you want to drop X3 or not depends on what you want the regression for. And remember, if all the Xs were totally uncorrelated with each other, we wouldn’t need multiple regression. Simple regressions of Y on each of the Xs in turn would do fine. We do multiple regression because we think there’s a chance the Xs might be correlated with each other.

I’m not very good at econometrics though, so don’t trust me.

It depends on whether or not we are in a liquidity trap.

+1

Colinearity is a definite possibility (as others have said).

I’m curious what you’d need before you *could* make the case that X3 doesn’t matter. I guess if all your X3 sample values [not the coefficient] were 0, you could make this case. Or maybe if the X3’s where just the same? Then you’re linear fit would be the same plane, just offset by that fixed value (so it’d matter in terms of contribution to Y, but not in terms of sensitivity).

“Would it be correct for me to say, “According to my regression analysis, X3 has no effect on Y”?”

No. The only way to say for sure that X3 has no effect on Y is if the regression analysis results in a zero (or near-zero) coefficient. Otherwise, it is having some effect that can be duplicated by other variables, and as others have pointed out, you should look for collinearity. It might be that in this case that X3 is caused by another variable you have already included (or a set of them), as well, which could invalidate the whole analysis.

Bob, a few other thoughts . Though the authors include a lag and diff on Brent its not clear if they looked in to any cintegration issues. Gas price looks nonstationary at least thru 2008. Modelling the sensitivity of changres in price might then be more in order. I think they stacked the deck against Tier2-05 on by usinng the Shocks0304 variable and hand waving away any sulphur regulation impact by sayingthat other stuff was going on and saying DOE said so. I bet if Tier2 were 04 instead-even if the dummy var ramped to 1 by 05, the tstat woul at least be over 1 and thus contributing to rbar sq. I would also estimate the model through mid 08 first given the disruption. There are many other things to check if one only had the data! More later when time permits including Trans-log cost functions perhaps–who knows there may have been scale effects impacting price or margin!

“its not clear if they looked in to any cintegration issues”

Sorry gofx, but they do. From FN # 12 on p. 6 of the technical appendix:

Formal statistical tests indicate that the “residuals” from the regressions are stationary. This indicates that they represent a “cointegrating vector” among the explanatory variables so that standarderrors are consistently estimated, and that the regression results are meaningful (and not spurious). In particular, while the times series used in the regressions (i.e., gasoline price, Brent price, and refinery margin)

are individually non-stationary, they are cointegrated.

At any rate, I would have been extremely surprised to find that gasoline prices were not cointegrated with the price of brent crude for starters 🙂

On the subject of cointegration, here’s an older post that some of you may find interesting. (If the topic of cointegration doesn’t grab you, does it help to say that it is about gold prices and money supply?)

Tier 3 is a technology for protecting property rights. It improves property protection by destroying part of sacrificing community’s wealth, which depresses the expected payoff of plundering them.

That comment made up for everything you wrote about Krugman and the national debt.

Bob,

Your thinking is right. The author would be well served to look at the mathematical definition of R^2 which is the amount of variance in Y explained by the model divided by the total variance in Y. So it’s not correct to say “According to my regression analysis, X3 has no effect on Y.” It’s correct to say that only a tiny portion of the variation in Y is explained by variation in X3.

Beyond that, saying “my regression analysis shows that X3 is unimportant” is a variation on the theme of “based on these results, I accept the null hypothesis.” Also, your author neglects that the regression analysis only tests the linear relationship. If the reality is, just to give one example, that X^2 + Y^2 = 0, then X and Y are definitely related but their relationship will not manifest itself with a high R^2 value.

In fairness to your author, I think the author may be misapplying the valid procedure of comparing the F-stats, not R^2 values, from different regressions to jointly test whether two or more of the independent variables are significant. An example of the above is here.