clock menu more-arrow no yes mobile

Filed under:

Viva El WAR (Pitchers, part 1)

Last time, I used my mod powers to look at one of the primary evaluative tools for position players, WAR.  This time, we'll focus on pitching.


The basic formula for pitcher WAR goes something like this:

((((B/A)^(((A+B)/0.92)^0.28)/((B/A)^(((A+B)/0.92)^0.28)+1))-C)*D/9) +

Where A = the pitcher's ERA

And B = the league average ERA (which should be different for starters and relievers)

And C = a replacement level pitcher's neutral winning percentage, which is set around .38 for starters and .46 for relievers

And D = innings pitched

And E = the pitchers leverage index

The formula really isn't that complicated when you break it down into individual steps.  That formula is just one that I used for the Cardinals WAR spreadsheet, and it has a lot of repitiion in it.  When you break pitcher WAR down into individual steps, it goes like this:

1) Find the average run environment that a pitcher will pitch in.  This is equal to the pitcher's run average (RA) plus the league RA for his role (starter or reliever).  So for Chris Carpenter last year, he gave up 2.28 runs per 9 compared to a league average of 4.7 R/9 for starters.  Add that up, and you get

2) You then figure out the expected winning percentage (W%) of that pitcher.  This is done by using PythagenPat, which is a modification of the Pythagorean formula.

Remember, the Pythagorean formula is just a way to figure out the W% of a team given it's runs scored and runs allowed.  Pythagenpat is a slight modification of that formula, using a floating exponenent, depending on the run environment of that team, instead of 2.  This is done to insure that, for example, teams who play in a really low run environment (like the Padres) get more credit towards expected W% for each marginal run scored than a team like the Yankees.  The same concept can be applied towards pitchers in the WAR formula.

I won't go through the blood details, but Chris Carpenter's expected W% would be .77

3) Once you have the pitcher's expected W%, you compare it to that of a replacement level player's and multiply by innings pitched per 9.  So Carp's WAR would be 8.3 in 192.2 innings last year.        

4) Add leverage.  This primarily for relievers who's innings are more important than those of starters.  Basically, you figure out the equivalent innings pitched of a reliever (so if a reliever pitches 60 innings, but on average, the are 175% more important than a starters, they count as 105 innings) and do the same thing as above.  I won't go into detail about leverage in this post, mainly because I have not yet solidified my views on the matter.


That's the very basic construction of WAR.  However, figuring out a pitcher's value isn't as simple as my example with Carp.  I made a bunch of assumptions with that example, that aren't necessarily right.  As you might have noticed, a pitcher's WAR depends heavily on three things:

1) His run average or estimated run average

2) A replacement level pitcher's run average

3) The way we handle leverage

I'll go through each of those things in detail.  

Run estimators

It turns out when Carpenter gave up 2.28 runs per 9 last year, some of that wasn't directly related to his performance.  The batters, umpires, fielders and ballpark each had something to do with how many runs Carpenter gave up last year, and attributing all of that to him kinda misses the point of pitcher evaluation.  Because so much of what goes into a pitcher's runs allowed is out of his control, there have been many attempts to model the pitcher's performance.  I'll go through each of them now.  

Before I start, let me say that there are two different goals of run estimators.  1) To credit the pitcher simply with what's under his control, or 2) To credit the pitcher with everything, except for the performance of his defenders. 

For a while, those two were thought to be pretty much the same thing.  However, I've been doing some research at the THT, that suggests, in my opinion, that pitchers really don't really have that much control over even their defense independent outcomes.  When a pitcher strikes out a batter, that is usually a combination of good pitches from the pitcher, bad swings from the hitter, and often favorable calls from the umpire.  Every stat that a pitcher has is influenced by luck.

However, even if you concede to my viewpoints on that matter, defense independent pitching stats (DIPS) still can be used for value purposes.  When retroactively analyzing a pitcher, you really care more about properly attributing blame.  For example, Chris Carpenter gives up a double to left on a 95 MPH fastball down and away.  The fact that he allowed the hard contact certainly wasn't his fault, as the batter obviously had to make a very special effort to hit such a pitch, however, you can't blame it on anyone else on the team so, even if it's just bad luck, you attribute that hard contact to the pitcher.  The fielder is another story.  Once the ball is hit in play, the responsibility shifts from the pitcher to the fielder and so we can debit both of the players accordingly.  

There is also an argument for only crediting pitcher's with the things they can control for retrospective value.  Why should we give a pitcher credit or debit for things that aren't in his control.  Aren't we only trying to measure the pitchers value?  Either way, I think it's a matter of preference.  I personally would rather isolate the pitcher's performance, rather than just eliminate defensive performance.

Anyway... most of the DIPS estimators just try to eliminate defense rather than isolate pitcher performance, however, some are obviously better than others, so consider the following your quick hit guide to DIPS.


Aah, FIP.  How can one stat that breaks down a pitchers' at bats into 4 possible outcomes, has conveniently rounded coefficients in the formula, and can be figured out on the back of a napkin be so good at estimating future performance and so commonly sited?  

For one, it's the fact that it is simple and so easy to understand.  As I said above, FIP breaks down each at bat into 4 possible outcomes: strikeout, walk, home run and ball in play.  Basically, FIP assigns a value to each of those 4 outcomes based off of Linear Weights.  The formula multiplies those values by the number of each of the 4 outcomes that pitcher gives up, translates to on a per 9 basis and sets itself to the league average ERA. 


  • Simple, easy to remember: (13*HR + 3*(BB+HPB-IBB) - 2*K)/IP + 3.2
  • Rooted in logic
  • Can be calculated for almost all levels of the minors and historically


  • Will underrate pitchers who are better than average at allowing less damaging balls in play
  • Will underrate pitchers who are better than average at sequencing their events (home run after walk vs. home run before walk, if that makes sense)
  • Will somewhat underrate good pitchers and overrate bad pitchers.  The coefficients in it are tailored to league average, however, each pitcher has it's own run environment, so the values should change slightly based on how many expected runners are on base. 


tRA is very similar to FIP.  However, instead of assuming that all balls in play are created equal, it separates them based on whether they are fly balls, ground balls, line drives or popups.  It then multiplies each event (strikeout, walk, home run, and the four batted ball types previously mentioned) by their run value and divides by expected outs.  So it's estimated runs per outs multiplied by 27 (cause there are 27 outs in a game).  That gets it to match up pretty much with the league average runs per 9.


  • Like FIP, it's intuitive and rooted in logic
  • Unlike FIP, it accounts for a pitchers ability to induce "better" balls in play (like ground balls and popups)
  • Park adjusted by individual component.  This helps to balance out park effects that can effect strikeouts, walks, home runs and even batted balls.


  • Will underrate pitchers who can better sequence their events
  • Not all fly balls are created equal, ditto ground balls and line drives.  Some guys just allow weaker contact.
  • Cannot be calculated for many levels of the minors and historically
  • Will underrate good pitchers and overrate bad pitchers
  • Are prone to errors in batted ball classification (one guys fly ball is another guys line drive, and depending on the data source, this can have big implications on the final product)


This is BPro's brand new metric.  The purpose of it is to compensate for interacting skills in a way that FIP or tRA do not, as they treat each stat independently of each other.  For example, guys who allow a lot of ground balls aren't as hurt by walks as others, and SIERA should encapsulate that. 

The problems with SIERA are that it hasn't been thoroughly tested yet, and doesn't have a lot of logical backing in the formula.  With FIP and tRA, we can understand their strengths and flaws, however, that's simply impossible at this point with SIERA. 

SIERA may be the most accurate; however it's too early to tell.


This is kind of weird stat.  It's basically FIP, except it replaces the home runs with .11*FB.  This is done because pitchers have very little control over how many of their fly balls leave the yard, or at the very least, it's hard to identify such a skill based on the numbers. 

The problem I have with xFIP is that it can't really be used for retrospective value, because you have to credit those extra home runs to someone.  It's a decent stat for identifying luck in small sample size; however, HR/FB and run value of BIP clearly aren't the only things that are influenced by luck.  As I've mentioned before, I really hate the idea of a "luck" stat.  Every stat contains a bit of luck and a bit of skill, and I would like to see xFIP do some sort of weighting for each event based on the estimated ratio of luck/skill as a function of innings pitched. 

Right now it's kind of a hybrid stat, that's useful, but (and should) be a lot better.   


This is exactly the kind of stat I wish xFIP would be like.  Instead of assuming luck on one statistic (HR/FB) it regresses each statistic to the mean as a function of innings pitched.  So say a guy strikes out 26% of his batter's faced in 29 innings.  There is most likely a significant amount of luck in that percentage, so tRA* will regress that to the mean to account for that expected luck, and his "true" K% during that time will be around 18%.  tRA will regress strikeouts less than HR/FB ratio, or line drive percentage, etc. as strikeouts are more in the pitchers control than those stats.  However, from what I gather, tRA* treats no stat as a "luck" or "skill" stat and instead treats them all as somewhere in between.  I love it.  The problem is that I don't really know how the regression is calculated. 


RA is just how many runs the pitcher gave up per 9 innings.  A while ago, some dumbass had the idea that the only way pitchers can be effected by their defense was through errors, so he invented a stat that only debited pitchers when the run scored was deemed to be their fault.  However, errors are hilariously not correlated at all with actual defensive value (or at least estimated defensive value).  Anyway, the problems with ERA and RA are like I said above.  They give credit for everything the pitcher does, when that clearly shouldn't be the case in real life.

Still, they contain some information in them that all of the other metrics don't.  For one, they contain info related to sequencing of events, pickoff moves, runner holding ability, batted ball skill, and other things that none of the DIPS estimators capture.  And a lot of the discrepancies between ERA/RA and DIPS are just based on luck, rather than defense.  And I'm not really sure that we should count luck against the pitcher when judging Retrospective value.          

All of these metrics have some strengths and weaknesses to them.  I would personally just use some sort of combination of RA and tRA for judging Retrospective value, and basically just tRA* for judging Retrospective skill.


These are key to put players on an even playing field.  A pitcher who plays in Coors is much more likely to have a higher ERA or DIPS than a player who plays in Petco.  Furthermore, a guy who faces the Yankees every couple of weeks is likely to have a higher ERA than a guy who faces the Royals.  

Adjustments should include park at the minimum, and probably batter's faced as well.

Replacement level

Whatever metric you decide to use for WAR, it will basically model how many runs a pitcher should have given up per 9 innings.  For that to be useful in a value sense, you have to compare that to some sort of baseline.  The baseline for WAR is "Replacement", which is, simply put, the expected production of the last resort guy in the minor leagues or on the waiver wire.

I'm not exactly sure how replacement level is calculated for pitchers; however, I assume it's determined in a similar manner to that of hitters.  Basically, you look at the projections of the tweeners (guys between the majors and the minors) and average them out.  For starters, it's around a 5.50 ERA and for relievers it's around 4.50.  


Leverage is basically the weight of a pitchers innings, determined by their volatility in expected winning percentage.  So the 9th innings will have a much greater impact on which team wins than the first inning.  For starting pitchers, leverage doesn't make much of a difference, but for relievers it is instrumental in how they are valued.  Like I said, I won't go into great detail about leverage and it's effects on WAR in this post.    

It's getting kind of late, so I'll sign off on this post.  Basically, WAR is pretty simple for starting pitchers.  The most important aspects are the run estimators you use, and what kind of adjustments you make.  For relievers, it becomes a lot more complicated and I'll go into that, as well as describing some implementations of pitcher WAR in the next post.