Friend vs. Foe: Data Analysis Edition
Written for Stat 503 after reading
- This paper on the housing crisis in the Bay Area,
- This interview with Hadley Wickham,
- Naomi Robbins’ blog at Forbes,
- Some entries on the Beautiful Data blog, and
- This column on variation.
There was a lot of material to read through this week, and I focused on the housing crisis paper (the friend), the post on the importance of variation (the foe), and the interview with Wickham because they were the ones I found most interesting.
I really liked the way Wickham et. al. presented the data in their paper. To begin, they spent over 2 pages discussion their data collection and cleaning methods, which I haven’t seen in many other papers. There may be a paragraph or two describing data collection and/or cleaning, but I can’t think of another example with such a detailed description. This paper will be a great resource for me to go to in the future when I start collecting and cleaning a data set and get stuck, especially when I have geographical data. They also structured their analysis in a very clear and logical way, starting with the aggregated average weekly data, then breaking their analysis up, first into deciles (of home prices) and then by city. By breaking their data up in this way, they were able to see a trend that was unlearnable in the aggregate: cheaper homes decreased in price (measured as proportional change in value) at greater rates than the more expensive homes. I also really liked that they showed a smoothed curve of average home price change in each city so that it was easy to see the different trends in each location. As it turns out, one city never experienced a drop in home prices after the crash, even when all the others were! There are many other excellent finding and plots in this article, but the above two were by far my favorites.
The post on mistakes made in business when handling data and variation stood out because it made me think of my summer internship experience. I worked as a statistical analyst for a small coporation, which was a completely new experience for me: I had never worked for a corporation before. After reading this post, I had an “Aha!” moment of sorts because I finally understood why I struggled so much to understand what they wanted from me. They wanted to know which locations, teams, project, etc. were most or least effective, and I had a difficult time with that. I now realized that the disconnect comes from the misunderstanding of variation. As the author says, “business leaders have typically been taught to treat everything they don’t like as having a ‘special cause’ reason as to why it happened, and thus want to investigate what one thing or person was responsible for causing the ‘aberration.'” The company was looking for the one bad egg or the one good egg, instead of looking for bad trends or good trends in their overall process and procedures. If I ever find myself working for a corporation again, I will refer to this article in order to make sure I don’t make any of the 6 critical mistakes!
Finally, I just wanted to briefly say that I really hope Hadley Wickam and RStudio make that statistical learning package he speculated about in the interview. I think that package could be so incredibly useful and important! So, Hadley, if you need some help writing it, let me know! (I’m mostly kidding for now. But hopefully after this course and STAT 602, I will be able to contribute something! nudge, nudge, wink, wink)