05 Mar 2011

## Econometrics Bask

This is a geek question: I am working with a data set where we are trying to determine if more occurrences of something leads to a higher probability that another condition will occur. This calls for a logit model and that’s what we’re doing. These aren’t the numbers but they’re close:

Occurrences for a person *** % of people with condition
==================================

1 *** 68%
2 *** 71%
3 *** 74%
4 *** 77%
5 *** 81%

10 *** 85%
11 *** 85%
12 *** 85%

So in other words, if you look at all the people in the sample who only had occurrence of the independent variable, then 68% of them had the condition, and 32% didn’t have it. If you look at all the people who had two occurrences, then 71% had the condition and 29% didn’t. Etc.

So it looks like there is definitely an effect as we go from 1 through 8 or 9 of the occurrences, but then it plateaus and further occurrences don’t increase the % in the population who have the condition.

Any thoughts on this? Is there any justification for just running a logit with the data from 1 to 8 (or whatever) occurrences, coming up with the constant and the other coefficient, and then just not extrapolating beyond occurrences?

Or do you always have to include the whole data set, even though it seems throwing in the rest of the data points will swamp the effect and make it seem as if the marginal impact of occurrences is lower than it really is upfront?

Finally, is it better just to show the actual averages (as I’ve done hypothetically above) and use that to estimate the marginal impact of occurrences, from 1 to 8, etc.?

#### 9 Responses to “Econometrics Bask”

1. Daniel Kuehn says:

As for your last question – what is it that you’re looking at exactly? That would largely determine it.

Do the log of the occurence number, or some kind of polynomial – that will allow you to incorporate the plateau. Or even more fancy – to a local linear regression. I would not simply drop data.

2. hawk30 says:

If there’s a nonlinear relationship, include the squared term:

Y = B1 + B2X + B3X^2

Then, the marginal impact is nonconstant and varies with X.

3. Matt Flipago says:

Maybe I’m messing something, but why don’t you normalize the data. Simply make take the data and divide it by .85. Then once you do the regression rescale it. Only if it really has an upper bound less than 1, and marginal impact of occurrences.
As for averaging the data to have single points I don’t think I can give an answer there.

4. Captain_Freedom says:

It looks like your data follow a “sigmoid” probability function. A regular logit function, the one you want to use, is actually the inverse of a sigmoid function.

Sigmoid functions are functions whose initial increases rise modestly in relation to an independent variable (in your case it looks to be 3% increase with each occurrence), then the probability increases at a growing rate, hitting a maximum rate of increase (in your case it is 4% per occurrence from 4 to 5 occurrences), and then the rate of increase again falls back down, approaching zero, plateauing asymptotically to a maximum (in your case it is 85%).

If you visualize your data, it looks like an “S” letter shape. A logit function on the other hand has a “Z” letter shape. That means we should use sigmoid, which will be of the form P(occurrence) = 1/[1 + e^(-occurrence)]

If the probability plateaus at 10 occurrences, and remains at 85% beyond 10 occurrences, then a model that includes ALL data will get weaker and weaker in terms of being able to predict in the lower occurrences range. I would include data only up to the first time it hits 85%, since you don’t need to predict beyond that. There is no reason to include all of the 85%’s above that, because you will not be predicting beyond 10 or so occurrences, as you already know it will be 85%.

I agree with your intuition that including all data will swamp out the initial data and thus make your predictive model weaker. Imagine having data up to 1,000,000,000 occurrences. There is variability of probability only in the first 10 occurrences, and from 10 occurrences to 1 trillion occurrences, it remains at 85% probability. A fitted predictive sigmoid function that includes ALL the data will resemble a straight line with almost zero positive slope, something like the function P(occurrence) = 85% + 10^(-99) * (occurrence). The predicted probability implied here will result in close to 85% probability no matter what the number of occurrences are, because your function “thinks” that because you gave it so many examples of occurrences where the probability is 85%, then the 10 occurrences and below would be treated more like “outliers”, when that range is actually the most important range you want to analyze.

I think your last question is best answered once you have a fitted sigmoid regression model that fits the data.

5. RG says:

I’ll usually include the entire data set in the packet as an appendix.

6. Doc Merlin says:

Better to include the entire sample and add a squared term.
So you regress

Y=A*X1+B*X2+C*Z

where X2=X1^2
and Z=control

This will probably work if you want to use a logit model.

7. Doc Merlin says:

bah, forgot to include a constant term.

8. GSL says:

When modeling event counts I usually go with a negative binomial regression framework rather than a logit. On the other hand, if the distinctions in each level of the dependent variable are qualitatively meaningful, and you have a large sample, you might try ordered probit.

• Captain_Freedom says:

On the other hand, if the distinctions in each level of the dependent variable are qualitatively meaningful, and you have a large sample, you might try ordered probit.

Don’t think that would work, since in Murphy’s dataset, the probabilities are not mutually exclusive “bins”. They’re cumulative, i.e. overlapping. But then again, I could have no idea what I’m talking about.