Thinking Critically About Statistics and Their Sources

In the sciences, we use theory and methods to empirically assess
“reality”. While we can often play with data to explore the relationships
between our concepts(our variables), it is important to frame what we’re doing
with good theory.

An interesting graph has made its rounds through social
media lately. It shows a strong relationship between Internet Explorer market
share and murders in the U.S.

Source: http://gizmodo.com/5977989/internet-explorer-vs-murder-rate-will-be-your-favorite-chart-today

I first encountered this graph on Facebook when a friend
sent it to me so I could use it in my statistics classes. When I post graphs or
other forms of data, I like to include the source so that students can assess
the veracity of the data and whether or not to trust that it is accurate.

In my search for its source, I thought it was first posted
on Twitter (1/21/2013) but then traced it to reddit and imgur (1/18/2013) –
which then reference each other as a source. Gizmodo picked it up on 1/22/2013
and it reached me in April 2013.

The comments on the gizmodo site provide other examples of such
spurious relationships:
telephone poles and rapes, temperature and number of pirates, and other
nonsensical pairs that co-exist but are not directly related.

These graphs and other relationships show clearly how
correlation does not equal causation. Showing an apparent relationship with
statistics may illustrate a correlation, yet such a relationship does not prove
that one variable causes a change in the other or that they are causally
related.

This chart is a great example of how data may show similar
trends – which we may interpret as a relationship – although there is no reason
or logic as to why a relationship might exist between them. The relationships
could be spurious or simply coincidence that they have the same trend line. Or
perhaps this apparent relationship could be due to the relationship of these
two variables to other variables that are unmeasured in this analysis.

In any case, I could not find the source of the data used to
create the graph – thus this “data” could be fabricated and it is most likely
false. I did find a site – geek.com – that did post a graph of “real” data and
found that the trend line is not as similar for these two variables.

Source: http://www.geek.com/microsoft/does-internet-explorers-falling-market-share-mirror-the-drop-in-us-homicides-1537095/

They, however, did not include a source of the data in the
post. At the end of their post, they included a link to the Twitter post
mentioned previously and to a blog. This graph came from that blog and the
source of the data – cited clearly and with links – came from Wikipedia and
w3schools.com. Some of the Wikipedia data is attributed to the Bureau of
Justice Statistics, but it is unclear what the source was on the “Crime in the
United States” Wikipedia page. The w3schools page is a web development site
that logged their tally of browsers used. Used how? That isn’t clear.

Is murder rate the same as homicide rate? Is Internet
Explorer market share the same as browser usage? These are just some of the
problems with how these concepts were defined and measured.

OK, then. What do we know for sure here? Is murder rate
closely linked with Internet Explorer usage (or market share) or not?

Is there a good reason to find this out? Will it tell us
anything about those two phenomena?

Generally, this question raises the importance of theory to
guide our research and statistical analysis. Theory provides the reasons why
things in society may be connected and helps us understand why those things may
be connected.

You may notice that published high quality research start
with a review of theory and the literature to assess what previous research has
discovered about the phenomena in question. We use that to guide how we
investigate “reality” in that we might replicate a study to ensure what they
found can be found again or to pursue a new angle.

In any case, there must be some logical reason why we
connect two variables – why might they be related and how might that work? To
do that, we need theory and clear definitions of what we’re looking at (or for)
and valid and reliable data. We might find that something as simple as
population density affects both of the original variables.

If we cannot find the source of the data or details of the
research, we should not trust that the illustrated relationship is real and
accurate.

Have you heard of other relationships that sound odd or
illogical? If so, has research uncovered the reason(s) why they might be
related? Was this relationship established by empirical data and is its source
reliable?