Another look at UZR
I've had a fairly specific concern about UZR/150 for a little while, and think this is as good a place as any to bring it. I know there has been some heated discussion about the metric's ability to accurately measure individual value. My question/concern has more to do with the metric at a team level. I'm hoping people can look at my concern and let me know their thoughts. Is my interpretation right? Are my methods of measuring reasonable? I'm not a statistical expert, so I'll have pretty thick skin about it. I'm posting this more to learn myself than to educate anyone else.
Anyway, my general view of UZR/150 is that it is a step in a positive direction but still a long ways from what we have when we quantify offense and pitching. I accept that you need a very large sample to measure at an individual level. At a team level, I would think that you could be able to interpret data sooner. More specifically, I wanted to look at how the difference between a teams FIP and ERA correlates with the teams UZR/150. Presumably, all other things being equal, a team that plays good defense will have an ERA that is better than their FIP and those that play poor defense will have an ERA worse than FIP. With over half the season behind us, I expected to see a pretty tight correlation between where a team ranked in UZR/150 and where they rank in differential between ERA and FIP. I'd say there is a correlation, but I was surprised to see how weak it is.
| Team | UZR Rank | E-F Rank | Difference |
| Angels | 11 | 22 | 11 |
| Astros | 15 | 13 | 2 |
| Athletics | 12 | 20 | 8 |
| Blue Jays | 17 | 14 | 3 |
| Braves | 27 | 19 | 8 |
| Brewers | 7 | 10 | 3 |
| Cardinals | 13 | 7 | 6 |
| Cubs | 14 | 2 | 12 |
| Diamondbacks | 10 | 27 | 17 |
| Dodgers | 16 | 5 | 11 |
| Giants | 1 | 8 | 7 |
| Indians | 24 | 30 | 6 |
| Mariners | 4 | 1 | 3 |
| Marlins | 21 | 23 | 2 |
| Mets | 30 | 16 | 14 |
| Nationals | 29 | 29 | 0 |
| Orioles | 19 | 21 | 2 |
| Padres | 23 | 25 | 2 |
| Phillies | 9 | 6 | 3 |
| Pirates | 3 | 12 | 9 |
| Rangers | 8 | 3 | 5 |
| Rays | 2 | 15 | 13 |
| Red Sox | 25 | 24 | 1 |
| Reds | 6 | 4 | 2 |
| Rockies | 22 | 28 | 6 |
| Royals | 28 | 26 | 2 |
| Tigers | 5 | 9 | 4 |
| Twins | 26 | 18 | 8 |
| White Sox | 18 | 17 | 1 |
| Yankees | 20 | 11 | 9 |
| average | 6.00 | ||
| Stan Dev | 4.448944 |
***The standard deviation is based off the difference using absolute difference rather than tracking both negative and positive variances. I think that is the "right way" to do it, but not sure. I think the key really is to make sure you keep comparisons between standard deviations apples to apples, so that's what I'll do here. I'll compare the variability with other measurements using absolute difference.***
I then wondered if I really need to be looking at the connection between UZR/150 and a pitching staff's BABIP. This link is even a little bit weaker, with an average difference at 6.87 and a standard deviation at 4.58.
I would not be considered a great advocate of UZR, but even so, these connections are much looser than I expected them to be. Translating it into offensive terms, these items have about the same connection that batting average has to a team's ability to score runs (Average difference 6.47, standard deviation - 3.80). For another point of reference, Woba ties much closer to ability to score runs (at avg difference of 2.2 and standard deviation of 1.88).
Most surprising, though, was how much tighter the measurements of luck (LOB %, BABIP) tied to ERA minus FIP. The luck factors correlate about twice as strongly as the defense behind the pitching (as measured by UZR/150). LOB % has an average difference of 3.33, with a standard deviation of 2.75. BABIP has a 2.73 average difference and a 2.24 standard deviation.
This really wasn't what I expected to see. I've sat on this for a couple weeks, deciding whether to bring it here. I'm hoping it will either draw some good discussion, or point me to factors that I should (or shouldn't) be looking at. Maybe they are obvious to all but me....I've just started trying to substantiate my general position on the metric. Maybe this is as much a conversation on FIP as it is UZR. Not sure. I just found this all interesting and a little surprising. We probably have the resources here to make sense of what I am unable to, and wanted to give it a shot.
0 recs |
24 comments
Comments
I'm probably the last guy...
that should post on a heavy sabr topic, but I wonder how much a teams HR/FB% and GB% effect these numbers. These numbers have a strong tie to FIP if I’m not mistaken.
"Don't do anything till I get back!" - Jesus to the Cubs
by cardzfanbub on Jul 21, 2009 6:18 PM EDT reply actions 0 recs
I would think that the
“luck” factor being included might help to get a higher correlation.
Also, for Merry, what are the p-values when you regress this data?
"I just wish that the late Harry Caray were still around so I could hear him mispronounce 'Kosuke Fukudome' every fukun' night" -- Dennis Miller
by fourstick on Jul 21, 2009 8:20 PM EDT up reply actions 0 recs
I'm wading into way too deep of water here
My limited knowledge of P level comes from statistics classes about 15 years ago. My understanding of it is that it is a measure of testing the significance of the sample. And that’s about where it ends for me.
I don’t think this would necessarily be considered statistically significant, but I would think a half season nears a threshold of relevance. That is completely intuitive though, and I don’t know how to measure it. This is part of why I was reluctant to bring it here…I’m not going to be the guy to answer most follow ups. I decided to mention it anyway because the team rankings did not hover near as close to each other as I would have anticipated. I have not looked at past years yet, but I had intended to. It looks like VEP is already working on much of this, though, thankfully.
What I was really wondering was whether anyone is as surprised as I am with the extent of separation. I would not expect the 2 values to be completely in sync with each other, but I would not expect there to be this amount of variability either.
by Merry CRasmus on Jul 21, 2009 11:02 PM EDT up reply actions 0 recs
If you did the work, you should post it.
This is part of why I was reluctant to bring it here…I’m not going to be the guy to answer most follow ups.
There are plenty of us here who find stuff like this fascinating, and we’ll always be able to help shore whatever questions or inaccuracies that we see. Better to post it and learn if you’ve made a mistake while the rest of us learn about your topic than to keep it to yourself, imho.
"I just wish that the late Harry Caray were still around so I could hear him mispronounce 'Kosuke Fukudome' every fukun' night" -- Dennis Miller
by fourstick on Jul 22, 2009 9:53 AM EDT up reply actions 0 recs
I would think after half a season of games
a team’s HR/FB% should pretty much have stabilised at its “true” level.
Felonius Monk - bitching to contact since 2008
by Felonius_Monk on Jul 22, 2009 5:21 AM EDT up reply actions 0 recs
I agree...
but HR’s have no effect on UZR for obvious reasons, and a great impact on ERA-FIP. A team that gives up a good number of HR’s is likely to see a large difference between these two categories.
"Don't do anything till I get back!" - Jesus to the Cubs
by cardzfanbub on Jul 22, 2009 9:05 AM EDT up reply actions 0 recs
The best thing to do
Would be to run a regression between tRA (which takes batted ball factors into account) and TIPS (timing independent pitching). Of course I just made up TIPS, but it would essentially be a component ERA based off of actual events (doubles, triples, homers). That would effectively take out the timing aspect of UZR.
Derosa.
by vivaelpujols on Jul 21, 2009 6:42 PM EDT reply actions 0 recs
Good idea
Would be to run a regression between tRA (which takes batted ball factors into account
Would BIS adding a batted ball timer, like was reported in your linked post earlier today in the main thread give you similar data to what your TIPS stat would, only through UZR?
"I just wish that the late Harry Caray were still around so I could hear him mispronounce 'Kosuke Fukudome' every fukun' night" -- Dennis Miller
by fourstick on Jul 21, 2009 8:24 PM EDT up reply actions 0 recs
I'm not sure what your asking
TIPS would just be a timing independent run estimator, used solely to compare with tRA to test UZR. I’m not sure how hang time would be factored into it.
Derosa.
by vivaelpujols on Jul 21, 2009 9:29 PM EDT up reply actions 0 recs
Never mind
I was thinking about something completely different, my bad.
"I just wish that the late Harry Caray were still around so I could hear him mispronounce 'Kosuke Fukudome' every fukun' night" -- Dennis Miller
by fourstick on Jul 22, 2009 9:53 AM EDT up reply actions 0 recs
Just for anyone who cares...
Here is the regression line for UZR and ERA-FIP this year:

Obviously that’s pretty bad, however, Jack Moore looked at it last year, and found it was higher:

Derosa.
by vivaelpujols on Jul 21, 2009 7:06 PM EDT reply actions 0 recs
What would account for
such a large difference in R2 values from year to year? I’m not a huge proponent of UZR, but I would think that the data would correlate similarly from year to year, even if it’s really noisy.
"I just wish that the late Harry Caray were still around so I could hear him mispronounce 'Kosuke Fukudome' every fukun' night" -- Dennis Miller
by fourstick on Jul 21, 2009 8:33 PM EDT up reply actions 0 recs
Could be just small sample size
When I have time later tonight, I’ll run a regression with all of the UZR data (since 2002). Should only take me a couple of minutes.
Derosa.
by vivaelpujols on Jul 21, 2009 9:23 PM EDT up reply actions 0 recs
the Moore data...
… reports an “r” rather than “r^2”. is that just for simplicity’s sake? because if you square .7 you get .49 which is at least in the same ballpark at .37. that difference could easily come from different sample sizes.
i have no idea why they’d report an “r” that isn’t squared, but maybe he did?
by kindred on Jul 21, 2009 10:49 PM EDT up reply actions 0 recs
Yeah, I was wondering about that too
Excel only offers R^2 on the trendline options, but he might not have been using Excel.
Derosa.
by vivaelpujols on Jul 22, 2009 12:05 AM EDT up reply actions 0 recs
I was squaring it Moore's R
I agree it’s in the same ballpark, but I would think they would be closer together for a sample that large — maybe if we had all of 2009 they would be. It would be interesting to see 2007 and 2008 side by side for this reason.
"I just wish that the late Harry Caray were still around so I could hear him mispronounce 'Kosuke Fukudome' every fukun' night" -- Dennis Miller
by fourstick on Jul 22, 2009 9:56 AM EDT up reply actions 0 recs
that R2 is really low...
… last year’s is somewhat better but still not great. might someone with the data handy run a regression with the “luck” factors included on the right-hand side and see what happens to the model R2. or even better, report the p-values?
by kindred on Jul 21, 2009 7:19 PM EDT reply actions 0 recs
I think there's one major problem here
you’ve looked at ranks instead of actual values. If one team’s ERA and FIP differ, say, 10x more than anyone else’s (hypothetical) the “E-F rank” value only puts them one point ahead of the second team, so it brings in a lot of potential bias.
You should list the actual team ERA, actual team FIP, the difference between the two, and the number of runs that works out as (for instance, a team’s played 90 innings with a 3 ERA and a 4 FIP, they’ve actually conceded 30 runs when FIP predicts they should’ve conceded 40). You can then see if their total team UZR makes up the difference (i.e. take the team’s cumulative UZR so far, not the UZR/150). You could then list the deficit or positive difference between ERA and FIP in runs, and the number of runs saved or booted by total UZR. You’d hope the values would somewhat correlate.
But yeah, I think this is a really useful exercise.
Felonius Monk - bitching to contact since 2008
by Felonius_Monk on Jul 22, 2009 5:25 AM EDT reply actions 0 recs
Ah I see someone's basically already done that....
frankly, the only conclusion I can draw from that R^2 value is that either FIP or UZR isn’t hugely accurate. As there’s a fair bit of data in support of FIP, I think I’m going to have to view UZR as pretty vague from now on (although I guess it could be cumulative error from both metrics)….
Just looking at the graphs visually there’s a pretty huge difference, we’re talking multiple wins for a lot of the data points. Although I suppose you could argue that (per player) that’s not really going to be a huge amount of runs.
Felonius Monk - bitching to contact since 2008
by Felonius_Monk on Jul 22, 2009 5:33 AM EDT up reply actions 0 recs
Perhaps we could take components of UZR
such as the % of GB turned into outs and the % of FB turned into outs and compare that with GB% and FB% for a pitching staff. It would take out the HR noise, since UZR doesn’t factor that but FIP does.
"I just wish that the late Harry Caray were still around so I could hear him mispronounce 'Kosuke Fukudome' every fukun' night" -- Dennis Miller
by fourstick on Jul 22, 2009 10:17 AM EDT up reply actions 0 recs
maybe
the problem isn’t FIP or UZR, but ERA. We all know that the official scoring of errors is crap.
by cdb on Jul 22, 2009 12:08 PM EDT reply actions 0 recs
sure...
… but as long as the scoring “mistakes” are somewhat random then they shouldn’t bias the result. the only way it’s a problem is if some teams were getting consistently different rulings than other teams, and i don’t know of any evidence that this actually occurs.
by kindred on Jul 22, 2009 5:25 PM EDT up reply actions 0 recs
as the scoring is done by the home scorer
I suspect there might be. There’s evidence that there is a significant difference in the way batted balls are recorded as LD/FB/GB in different stadiums, so I find it hard to believe there maybe isn’t the same difference in the stringency of the “error” ruling. I guess if you play half your games at home with a scorer who is very lenient on errors, you’ll end up with pitchers with a lower ERA than they would in a context-neutral environment…
Felonius Monk - bitching to contact since 2008
by Felonius_Monk on Jul 23, 2009 6:54 AM EDT up reply actions 0 recs
would it be better then to run against RA rather than ERA? i think i've seen some saber types
do that.
the truth can't hurt you, it's just like the dark/ it scares you witless, but in time you see things clear and stark -- macmanus
by tom s. on Jul 23, 2009 1:40 PM EDT up reply actions 0 recs

by 

















