When Heinrich Schliemann excavated Troy, his team found layers of past civilizations: a striking physical manifestation of the dependency of present on past accomplishments. Today, about everything we think we know depends on a layer cake of processes that IT people call “the stack.” At the bottom is an integrated circuit chip and, if we extrapolate to neurology, the top of the stack—through layers of hardware and software, inference, and visualization--is the academic leader trying to predict enrollment next year or optimize debt payments.
Futurists like Nick Bolstrom muse that the software stack may become self-aware and turn the world into paperclips, but for a university’s C-suites the challenge of this epistemological parfait lies in the two questions What do we know? and How do we know that? Given trustable answers to those questions, it’s their job to figure out What do we do?
Plato was right about shadows on the wall. By one estimate our eyeballs process about a megabyte of raw data per second. So we fudge and use hacks and approximations to get anywhere at all. Then, by necessity, we fool ourselves into thinking the shadows are real. This works until it doesn’t, until the Trojan horse makes us think one thing when the opposite is true. Statisticians call it a Type S error: when you’re not just wrong about a cause and effect, the sign is actually backwards. The horse isn’t a signifier of victory; it’s a signifier of defeat.
This essay is about my stack and how it can help you avoid Type-S errors.
Being responsible for Institutional Research at Furman University, I sit spider-like at the nexus of information strands, at the last outpost of technology, before the data hits what cyberpunk authors call the wetware. I have to translate questions into numbers and back into answers: questions like “How do our lower-SES students perform relative to other students?”, “What’s the likely population of residence halls by class and gender next year?”, “What’s the most affordable variable to change to improve rankings?”, “Did my intervention program have an effect?”, and “If we change the probation policy, what will happen to retention?”
When I started this job in the 1990s, few good tools were available, and I wasn’t very expert at using them. I would print correlation tables of 200 variables on nine pages, tape them to the wall, and manually highlight in yellow the largest coefficients. Cleaning and exploring data was a tedious job in Excel, augmented later with Perl scripts, and the inevitable consequence of any report was “This is all great, Dave, but what about separating out the athletes?” The only way to do it was to start all over.
"Being responsible for Institutional Research at Furman University, I sit spider-like at the nexus of information strands, at the last outpost of technology, before the data hits what cyberpunk authors call the wetware"
This personal story contrasts the requirements for an effective analytical stack: organization, integration, and perception. It’s all about speed and accuracy, just like with the computer chip.
An example of organization: your institution, like mine, probably has a large amount of student survey data. Is it all in one place and well-indexed? Can you pull threads across time or demographics to reveal trends in student perceptions, goals, and outcomes? If all that data is sitting in random folders around campus, there’s too much friction to get good use out of it. Surveys are messy, but there are ways to get them into accessible databases, and it pays off by reducing the barriers between questions and answers.
Integration means between data sources, like having student IDs on surveys when possible, and between stack layers. Having to switch back and forth between SQL queries and analytical code is a lack of integration, and it can be avoided. This line of code queries a database and sets up the result for analysis, all in R:
tbl(dbc, "CourseSections") %>%
filter(Term %in% c("2019D1","2019E1"%>%
select(PersonID, Last, First, Subject, Number) %>%
group_by(PersonID, Last, First) %>%
summarize(Courses = paste0(Subject,"-",Number, collapse = ","),
NCourses = n())
It seamlessly integrates the database query with data clean-up to begin the statistical work, meaning I don’t have to switch back and forth between SQL and R, and it efficiently does work on the server side that doesn’t need to be done on my desktop. But the integration of the R/RStudio stack I use goes much further than that. The same code gets used to do the analysis, write the report, and push out to the intranet an interactive app to explore the data. Instead of three different jobs, it’s all one thing.
It’s not that I can work 100 times faster than I could in the early days; that’s true, but it’s much more significant than that. It’s like the difference between doing a jigsaw puzzle in the dark or with the lights on.
Remember that research question about low-SES students? When I gave the first draft of the report to the financial aid director, he told me the number of PELL students was too high. To fix it, I had to change a single line of code and then just regenerate the report. All the charts and tables automatically updated; I just had to check that the conclusions I’d reached were still valid.
The organization and integration possible now are like the digital revolution in photography. When shooting on film, it takes at least hours, if not days, to see the results. With digital, it’s instant. This makes the learning process much, much faster—you can just try things out to see what happens, and tune the settings in real time to get the picture you’re after.
Similarly, the right software stack, whether it’s R or Python or something else, when combined with highly organized institutional data, creates the possibility of rapid exploration. Our intelligence is tied to the rate at which we can entertain reasonable ideas and sort out the bad ones from the good ones.
The third requirement is perception: the wetware at the top of the automation chain. This is a person or a team of people who understand their analytical stack, the data organization, and are familiar with the data itself. I think it is easy to get distracted by the glitter of software systems that advertise “predictive analytics” (an expensive way to say “statistics”), when investing in the right people and their professional development might be a higher priority. If your internal research team is going to help you avoid Type-S errors, you want them highly qualified. The possibilities for analysis of your institutional data are endless, and the software is cheap. Knowing how to use it is dear.
Don’t be like Troy.