Maria Cuellar and I were on a long drive back from a conference recently, and to keep ourselves entertained we had a wide-ranging argument about the difference between prediction, inference and causal inference. Yea, this really is how statisticians have fun.

I was confused about where inference fit in the whole story. I figured, prediction is just fitting a model to get the best , regardless of the “truth” of the model. If I find some coefficients , I’m only saying that if I plug in some new , I’ll predict a new according to this model. Easy.

If I care what the real relationship is between variables, I’m doing inference, right? That is, I claim because I think that every increase in really implies a increase in , with some normal error. In other words, I think that when was being generated, it really was generated from a normal distribution with mean and some variance. I’ll get confidence intervals around my coefficients and say that I’m 95% sure about my conclusions.

But I’m playing fast and loose with language here. When I say “implies” do I mean “causes”? Most people will quickly and firmly say no to that can of worms. But! when people talk about regression, they will often say that *affects *–affects is just a different word for causes so… what’s the deal? How is this not (poor) causal inference?

Well, it’s sort of still my impression that it is. But that doesn’t mean there isn’t such a thing as inference that’s totally separate from causal inference.

Inference asks the question — from this sample, what can I learn about a parameter of the entire population? So if I estimate the median of a sample, I can have some idea of what the median is in the whole population. I can put a confidence interval around it and be totally happy. This isn’t the same as prediction and prediction intervals, because I’m not asking about the median for some future sample and how sure I am that I’ll my guess of the median will be in the right range. I’m asking about the real, true, underlying median in the population.

So what about that regression example? Well, inference will say, there is a true in the population, such that if I took I would get back . Does that mean that has any real meaning? No. It’s some number that exists and I can get a confidence interval around. But if my model is wrong, the coefficients don’t say anything particularly interpretable about the relationship between and .

All that to say, Maria was right and I’m sorry.