By Roger Weber
Sabermetrics has been improving the way fans enjoy baseball for
30 years. Bill James has been publishing Baseball Abstracts since 1977 and now books are being released with findings based
on watching and studying every baseball game since 2003. So I don't think I have to invent new statistics. Even if I try,
my fairly amateur approach would pale compared to the work of some of the great "sabermetricians".
But I can use some existing statistics to make some findings about
other statistics. Introducing correlation tables - tables of correlation values between certain stats. Correlation can be
shown easily through statistics. If you have, say, batting averages and hit totals for a set of players, you can graph those
two sets of data against each other. Most likely using those two stats, you will get a set of dots that pretty closely connect
to form a straight line. A graph like that shows a high correlation between those variables being measured.
Any advanced computer graphing system or calculator can tell correlation
between two sets of data using a figure called a correlation coefficient. This is a value between -1 and 1 that tells how
closely the sets of data correlate. For this project it makes most sense to measure a linear correlation because we want to
see if the points form a straight line, if when one variable increases, the other also increases or decreases. If the correlation
coefficient is positive, it means that as one variable increases, so does the other. If it is negative, it means that as one
variable increases, the other decreases. A coefficient farther away from zero means the data correlate more closely to a straight
line. A correlation coefficient, often called an "r^2", close to zero indicates that there is no discernible linear relationship
between these two sets of data. It doesn't mean there isn't one between two variables. It just means that it doesn't show
up in these sets of data.
Correlation doesn't simply imply a cause and effect relationship.
It just measures the numerical increases and decreases of variables against each other. The salary of schoolteachers in North
Dakota from 1990-2000 may closely positively correlate with the number of residents of Fresno, California each year during
the same period, but the two probably aren't causing each other. The first is likely due to inflation and the second due to
increases in population. But because we know population generally increases over time and inflation rises over time, we are
able to say with some degree of certainty that if we know that population is increasing, we can predict based on that fact
that inflation is rising. And we'd likely be right.
This is true for baseball statistics. A player with many home
runs will also likely have many RBI. While home runs often result in RBI, the player doesn't have a high RBI count solely
because he hits home runs and he certainly doesn't have many home runs because he has many RBI. But we can pretty safely say
that if a player has many home runs that he has many RBI, and with a slightly less certain air we can say that a player with
few RBI probably has few home runs.
So we can use these correlations to tell us information about
the value of certain statistics. Sabermetrics has shown fans that Bill James' fabricated statistic, "runs created", and the
more common statistic, "On base percentage plus slugging average", are valuable measures of a player's offensive ability to
produce runs and thus help his team win. But these are difficult to calculate and somewhat obscure statistics compared to
the easier to use and more commonly used "Triple Crown statistics", batting average, home runs and runs batted in.
But are these common statistics, which supposedly tell us how
good a hitter a player is, really valuable measures? Obviously they tell us something. They tell us how often a player can
get a hit (of any type) when he has an at bat. And they tell us how many players already on base that a hitter can cause to
score with his hits. But they don't tell us quite so much about the players' ability to produce runs.
The correlation coefficients of these statistics with the more
"valuable" statistics follow:
|
BA |
HR |
RBI |
On Base + Slugging |
.35 |
.57 |
.55 |
Runs Created |
.49 |
.44 |
.45 |
Before reading too much into these correlations, I should note
that these figures probably aren't perfectly accurate. They encompass only the statistics of the players with the most at
bats in 2004 and 2005. And the sample size is fairly small – just 150 batters. So these probably aren't totally right.
But assuming they are fairly close, we can see that none of the
Triple Crown stats are that accurate in telling a player's run producing ability. Together they give us a little more accurate
picture, but still with no "r^2" over .57, the correlations don't seem to be that strong. And none of the stats seem to be
more telling than the other two.
But here's some food for thought. The ten most positively correlative
and seemingly most accurate commonly used statistics as far as their correlation with "runs created" are:
Stat |
R^2 |
OPS |
.95 |
Total Bases |
.86 |
Slugging average |
.80 |
On base pct. |
.61 |
Batting average |
.49 |
Runs scored |
.46 |
RBI |
.45 |
HR |
.44 |
Hits |
.33 |
Walks |
.20 |