Recalculating a tainted record
By Roger Weber
Current Highest single season home run totals:
Bonds, 73, 2001
McGwire, 70, 1998
Sosa, 66, 1998
McGwire, 65, 1999
Sosa, 64, 2001
Sosa, 63, 1999
Maris, 61, 1961
Ruth, 60, 1927
Ruth, 59, 1921
Foxx, 58, 1932
Greenberg, 58, 1938
McGwire, 58, 1997
Gonzalez, 57, 2001
Rodriguez, 57, 2002
Griffey Jr., 56, 1998
Griffey Jr., 56, 1997
Wilson, 56, 1930
Kiner, 54, 1939
Mantle, 54, 1961
It has been called the most coveted record in baseball. The single season home run record stood at 60 for 34 years,
then at 61 for the next 37. Since then that count has been exceeded six times. This spike in home runs has caused many to
feel the home run has become a stat totally unique today from what it once was. The single season home run record has been
devalued by many fans who feel that steroids have destroyed the home run.
Steroids, though, aren't the only variable that can add to or subtract form a player's home run total. There has been
a steady rise in total home runs in the major leagues since the beginning of the twentieth century. This can be expected.
As more players join the game, the talent is sure to improve. Equipment improves, style of play improves and more is done
to make the game more exciting, which often means adding more home runs.
Four major events have drastically affected home run totals. Prior to the 1930 season, baseballs were made lighter
as an attempt to intentionally increase home run production. The total number of home runs rose at different rates after the
change, depending on the time period used to compare. These differences average to about a 6% rise in home run production
on top of the natural rise. I got this figure basically by finding the percentage difference over and under the linear model
of natural home run rise described later for the ten years prior and ten years after the re-weighting of the ball, giving
more emphasis to years closer to 1930. This makes older home run hitters' totals more impressive. It is interesting to note,
though, that in 1931 a drastic dip in National League home runs occurred. Some suspect this may be due to secret changes to
the ball. This change should be balanced, though, through a statistic described a few paragraphs later.
In 1947, perhaps the most important change to baseball ever occurred. Blacks began playing the game. By 1957 they made
up 11.5% of major league rosters. Based on comparisons between the non-war years leading up to 1947 and the years shortly
after 1947 based on percentages of minorities in the game, I have determined that adding minorities to the game has increased
the quality of play by about 13%. For example, in 1954 20% more home runs were hit than during a comparable year before integration.
Of course, these calculations measure the amount of home runs hit over the amount naturally predicted by the progression of
baseball itself. And they do not include post-1960 results that were skewed by other factors. At that point, blacks made up
7% of the majors. This means that pre-1947 play is theoretically only about 88% as strong as baseball when minorities make
up a decent percentage of rosters.
Two big events happened in 1961. The regular season was lengthened from 154 to 162 games, giving a slight benefit to
modern players as far as accumulating records like home run totals. Also in 1961, the leagues were increased in size, spreading
out pitching talent, which caused home run production to rise. Each time the league expands, home run production tends to
rise.
The later of these two 1961 changes can be mapped and applied to other confounding variables and league expansions
by using the linear regression model for total home runs hit in per team per year. During a year like 1961, when 8% more home
runs were hit than predicted, there were clearly variables that caused totals to rise. In this study, home run totals are
multiplied to make up for these variables.
Another
factor that greatly influences a home run total is a player's home ballpark. Since 81 games are played in that park, a player
benefits greatly if fences are short.
The study
The purpose is to put each of the top home run totals on a balanced scale so they can be effectively compared between
years. The study is not exact and does not account for every variable. That must be expected. There will never be two equal
years and thus it does not make sense to account for every possible variable. It can also be expected that the total number
of home runs hit in the major leagues rises at a fairly linear rate. The r^2 for that linear graph is very high.
On the next page is a graph of total number of MLB home runs per year. This graph does not account for differences
in the number of teams in the league, but does show the linear relationship. An exponential model has an even higher correlation
coefficient, but that appears to be just because of the recent rise in home runs, which most attribute to increased steroid
use and smaller ballparks. For totals pre-1990 a linear model works best.
But for adjustments in the study, four separate graphs are taken for time in between the most noticeable shifts based
on variables described above. Each home run total factored in this comparison is adjusted by the percentage of home runs hit
versus home runs predicted by these individual models and by the overall linear model. This should make up for confounding
variables. Adjustments are made for years during the shifts.
Click here to download report with graphs.
Following are the four equations of the linear models for the eras of home run production:
Years |
Equation |
1900-1929 |
y = 31.634x – 59995 |
1930-1941 |
y = 17.098x – 31731 |
1947-1960 |
y = 53.354x - 102240 |
1961-2005 |
y = 64.677x - 124703 |
Each of the variables described earlier in the study is measured.
The park factor is measured for home games only. Predicted home runs per team divided by actual home runs per team
is used. Other adjustments are made for season length, integration of races and adjustments to the baseball. Each total is
altered, either raised or lowered, based on whether it is higher or lower than whether it gave that player benefit or harm
compared to the norm today. Theoretically, in the end totals are based on what players would have during a neutral situation
in the modern game.
Also, a measurement is taken using a quadratic model of each player's career excluding the year of his large home run
total and any other years in question for injuries, steroid use, etc., predicting the home run total he would have had based
on the other years of his career.
The table below includes the percentages each stat affects the player's home run total to balance between years. At
the right side of the table is each player's predicted total home runs for the year in which they made the list for top single
season home run totals and the highest total that would be reasonable expected by that player given the variation between
his own totals during other years of his career. Several players exceed that total, which is understandable. It just means
his total for that year was unusually high and can be attributed to many variables.
|
|
|
|
|
|
|
|
Table: predicted |
|
|
|
|
|
|
|
|
total for career |
|
|
|
|
|
|
|
|
and 95th% |
Table: Variables included and (total x value) to yield result |
|
possible total |
player |
homers |
year |
BPK / 2 |
HR for yr. |
adj. seas. |
race adj. |
ball adj. |
career pred |
career max |
|
|
|
|
|
|
|
|
|
|
Bonds |
73 |
2001 |
1.0485 |
0.864 |
1 |
1 |
1 |
33 |
46 |
McGwire |
70 |
1998 |
1.005 |
0.862 |
1 |
1 |
1 |
40 |
59 |
Sosa |
66 |
1998 |
0.985 |
0.862 |
1 |
1 |
1 |
40 |
48 |
McGwire |
65 |
1999 |
1 |
0.83 |
1 |
1 |
1 |
38 |
58 |
Sosa |
64 |
2001 |
1.02 |
0.864 |
1 |
1 |
1 |
39 |
47 |
Sosa |
63 |
1999 |
0.966 |
0.83 |
1 |
1 |
1 |
40 |
48 |
Maris |
61 |
1961 |
1.025 |
0.922 |
1 |
1 |
1 |
27 |
40 |
Ruth |
60 |
1927 |
0.985 |
1.047 |
1.0519 |
0.88 |
1.06 |
44 |
57 |
Ruth |
59 |
1921 |
1.01 |
0.826 |
1.0519 |
0.88 |
1.06 |
37 |
47 |
Foxx |
58 |
1932 |
0.98 |
0.974 |
1.0519 |
0.88 |
1 |
39 |
52 |
Greenberg |
58 |
1938 |
0.97 |
0.995 |
1.0519 |
0.88 |
1 |
27 |
49 |
McGwire |
58 |
1997 |
1.005 |
0.897 |
1 |
1 |
1 |
40 |
58 |
Gonzalez |
57 |
2001 |
0.97 |
0.864 |
1 |
1 |
1 |
23 |
36 |
Rodriguez |
57 |
2002 |
0.94 |
0.867 |
1 |
1 |
1 |
49 |
56 |
Griffey Jr. |
56 |
1998 |
0.971 |
0.862 |
1 |
1 |
1 |
40 |
59 |
Griffey Jr. |
56 |
1997 |
0.971 |
0.897 |
1 |
1 |
1 |
40 |
58 |
Wilson |
56 |
1930 |
0.995 |
0.81 |
1.0519 |
0.88 |
1 |
29 |
40 |
Kiner |
54 |
1949 |
0.98 |
1.09 |
1.0519 |
0.9 |
1 |
42 |
59 |
Mantle |
54 |
1961 |
1.025 |
0.922 |
1 |
1 |
1 |
37 |
56 |
On the table below in bold is the calculation of the player's total if the variables already discussed are adjusted
to create a balanced field over the years. Of course, that total does not account for variables like steroids. On the right
side is the number of home runs more in the adjusted total than in the predicted value for that player at that time in his
career. It is understandable that these are fairly significant numbers since most players' careers do not conform exactly
to a quadratic model. The large differences, though, like Bonds', Sosa's, and McGwire's, are likely attributable to their
use of steroids. Remember that the likely steroids years were eliminated from each player's figures used to create the quadratic
model.
It is also interesting to see how high Roger Maris' and Luis Gonzalez' totals are compared to what the rest of their
careers predicted. While Maris' can be attributed in part to the changes to the game that occurred in 1961 and while there
is a change Luis Gonzalez' total was affected by steroids, these appear to be simply stellar years by good baseball players.
|
|
# over |
# over |
|
avg. var. |
pred. |
career |
career |
|
|
total |
max |
pred. |
|
0.9125 |
66.613 |
-20.61 |
33.613 |
Bonds |
0.867 |
60.69 |
-1.69 |
20.69 |
McGwire |
0.847 |
55.902 |
-7.902 |
15.902 |
Sosa |
0.83 |
53.95 |
4.05 |
15.95 |
McGwire |
0.884 |
56.576 |
-9.576 |
17.576 |
Sosa |
0.796 |
50.148 |
-2.148 |
10.148 |
Sosa |
0.947 |
57.767 |
-17.77 |
30.767 |
Maris |
1.0239 |
61.437 |
-4.437 |
17.437 |
Ruth |
0.8279 |
54.051 |
-7.051 |
17.051 |
Ruth |
0.8859 |
51.385 |
0.615 |
12.385 |
Foxx |
0.8969 |
52.023 |
-3.023 |
25.023 |
Greenberg |
0.902 |
52.316 |
5.684 |
12.316 |
McGwire |
0.834 |
47.538 |
-11.54 |
24.538 |
Gonzalez |
0.807 |
45.999 |
10.001 |
-3.001 |
Rodriguez |
0.833 |
46.648 |
12.352 |
6.648 |
Griffey Jr. |
0.868 |
48.608 |
9.392 |
8.608 |
Griffey Jr. |
0.7369 |
41.269 |
-1.269 |
12.269 |
Wilson |
1.0219 |
55.185 |
3.8148 |
13.185 |
Kiner |
0.947 |
51.138 |
4.862 |
14.138 |
Mantle |
Several of the top totals after adjustment still belong to the big three- McGwire, Sosa and Bonds. If these three were
genuinely clean home run hitters, their feats could be recognized as incredible. If the steroid allegations are truly false,
then the totals in bold are the final numbers for this study. Bonds, McGwire and Sosa would make up the top six. Most evidence,
though, points to steroid use. The fact that the big three mentioned above have totals that so deviate from the pattern of
their early-career home runs supports the claims so widely publicized.
Until the extent to which they used steroids, and the extent to which the steroids affected their play can be made
open and clearly calculated, it is my contention that for the purposes of a mathematical study, there needs to be a list in
addition to the bold totals above that eliminated the characters in the scandals.
If they are included in the list, the following is the adjusted list of the top single season home run totals:
Rank |
Player |
Actual |
Year |
Adjusted |
1 |
Bonds |
73 |
2001 |
66.6125 |
2 |
Ruth |
60 |
1927 |
61.43688 |
3 |
McGwire |
70 |
1998 |
60.69 |
4 |
Maris |
61 |
1961 |
57.767 |
5 |
Sosa |
64 |
2001 |
56.576 |
6 |
Sosa |
66 |
1998 |
55.902 |
7 |
Kiner |
54 |
1949 |
55.18519 |
8 |
Ruth |
54 |
1920 |
54.05119 |
9 |
McGwire |
65 |
1999 |
53.95 |
10 |
Mantle |
52 |
1956 |
52.676 |
11 |
McGwire |
58 |
1997 |
52.316 |
12 |
Greenberg |
58 |
1938 |
52.02299 |
13 |
Foxx |
58 |
1932 |
51.38499 |
14 |
Mantle |
54 |
1961 |
51.138 |
15 |
Mays |
52 |
1965 |
50.232 |
When the steroid suspects are removed, the list looks like this:
Rank |
Player |
Actual |
Year |
Adjusted |
1 |
Ruth |
60 |
1927 |
61.43688 |
2 |
Maris |
61 |
1961 |
57.767 |
3 |
Kiner |
54 |
1949 |
55.18519 |
4 |
Ruth |
59 |
1921 |
54.05119 |
5 |
Mantle |
52 |
1956 |
52.676 |
6 |
Greenberg |
58 |
1938 |
52.02299 |
7 |
Foxx |
58 |
1932 |
51.38499 |
8 |
Mantle |
54 |
1961 |
51.138 |
9 |
Mays |
52 |
1965 |
50.232 |
10 |
Ruth |
54 |
1920 |
48.84894 |
11 |
Griffey Jr. |
56 |
1997 |
48.608 |
12 |
Mays |
51 |
1955 |
47.98835 |
13 |
Kiner |
51 |
1947 |
47.78435 |
14 |
Gonzalez |
57 |
2001 |
47.538 |
15 |
Killibrew |
49 |
1964 |
46.844 |
So Ruth wins out. Of course, Ruth played in an era that is difficult to compare to even the era three decades after
he played. For percentage of total home runs hit by one man, Ruth is far and away at the top of the list. His 1921 total,
in my mind, should actually be considered the most impressive ever because he did eclipse so many teams in the league and
so greatly overshadowed all other players in the game at that time. Maris' one great year, though, also stands out well above
many of the other contenders. Ralph Kiner had most of his best seasons in what some call the best era ever for baseball (1947-1961)
because it included integration and was before the league expanded.
These totals may seem low. This is due much in part because more home runs are lost due to the variable adjustment
than are gained. Much of this is based on the assumption that the recent rise in home runs is just that- a rise- caused by
steroids, small ballparks and other factors. Some may fault the study for this, but the top home run totals are likely helped
more than hurt by lurking variables the same way that this formula would also lift the lowest home run totals. It is only
natural that the highest totals come from players in the smallest parks during the longest seasons during the weakest pitching
eras. These effects are negated by these rankings. On the other side, though, players who play in pitchers' parks during shortened
seasons see their totals rise when adjusted.
A few other effects aren't accounted for by the rankings. There was no easy or clear way to determine the differences
caused by night games without sifting through many box scores. It is also difficult to give a value to lineup position or
equipment quality.
Like all baseball questions, there is no clear answer and plenty of room for debate. In the coming years more light
should be shed on the importance of steroids to the recent game. Whenever there is a scandal, trust is lost, even in a baseball
record. Ultimately, though, the great players should stand out. Their performances are inspiring to millions.
--
Click here to download the study with graphs
--
"It's a long drive…the Giants win the pennant, the Giants win the pennant…"
|