Model-Vis in the New Black
Written for STAT 503 after reading this paper on model visualization by Wickham, Cook, and Hofmann.
What struck me the most about this paper was how intuitive the idea of model visualization seemed to me, yet I don’t think I’ve ever used it in a statistical modeling course. (Outside of those taught by Drs. Cook and Hofmann, of course!) I really do think the key difference is that of “d-in-ms” versus “m-in-ds.” Every model diagnostic tool I can think up off the top of my head (e.g. residual plots) looks at the data in the model space, instead of the model in the data space. It doesn’t seem to me that “m-in-ds” is a terribly revolutionary idea, but I think it might be in some ways. I’m reminded of the “old statistics vs. new statistics” idea that Granville wrote about in his blog (see my Jan. 14 post): this paper seems to be using some “new statistics” ideas (though still using a lot of “old statistics”) and I could see some “old statistics” proponents disliking it a lot.
I definitely think we can learn more about modeling and about data by using the “m-in-ds” approach instead of the “d-in-ms” approach. For instance, in figure 6, they show the data, the boundaries between the groups, the prediction regions for each of the 3 groups separately, then the prediction regions for all 3 groups at the same time. I really appreciated this figure because it demonstrates the “huge-ness” of the 3D space that the data inhabit. Being able to see how the model fills up the 3-dimensional space is crucial if one wants to reach a true understanding the behavior of the model.
Another figure that really stood out to me was Figure 11. The plots are not doing anything terribly revolutionary: they’re just summarizing the model summary statistics on a standardized scale for a series of models fit to the data. In fact, I’ve done this type of thing before in a methods course, but I don’t think I’ve ever standardized and ploted all the statistics for each fitted model on one plot when doing model selection. (I usually just look at the numbers in a table.) I had a similar reaction to figure 12. I’ve fit several models to data and compared the values of the coefficients, but it never before occurred to me to plot all these coefficients in order to determine which one is the most important to modeling. (Perhaps I’m giving myself too much credit here. Should I have been doing everything in this way from the start of my statistics career?)
Finally, I was excited to learn about all of the R packages that are available to perform model visualization: classifly, clusterfly, meifly, in addition to ones I have already seen, like ggobi and ggvis. I look forward to exploring them more in this course!