Site Features: More On Postseason Boxscores
Posted by Neil Paine on October 5, 2010
Yesterday I gave a big-picture overview of all the features we have in our Postseason Section, so today I wanted to talk about a really cool, underrated feature of our playoff boxscores...
As you probably know, our regular-season play-by-play data stretches back to 1950 -- which is amazing and to the credit of Retrosheet that anything close to that amount of information is available. But did you know that our postseason boxscores have play-by-play accounts for the entirety of the World Series era?
That's right, we have play-by-play descriptions of baseball games that happened 107 years ago. Not only that, but we have Win Probability statistics and graphs for those games! I don't know about you, but I think that's pretty amazing. In fact, basically anything you can do in a 2010 box score, you can do for postseason games going back to 1903.
So if you ever wondered how much WPA Babe Ruth cost the Yankees when he was caught stealing to end the 1926 World Series, we can answer that and many more questions. Play around for a while in the Postseason Section, and you may find a whole new way to look at games that happened a century ago.
October 5th, 2010 at 4:50 pm
What are the Win probabilities based on 100 years ago? Offensive numbers this year or the small sample size numbers of 100 years ago?
October 5th, 2010 at 5:11 pm
Why do you say there is a small sample size?
October 5th, 2010 at 5:49 pm
#2, if it is win probability, how do you determine what the probability is in the 6th inning up 2 runs if all the data you have is 20 postseason games that season?
October 5th, 2010 at 5:56 pm
I think he's talking the base-out-inning win expectancy that requires play-by-play data.
I don't know how much of that stuff is era dependent. You could likely adjust the runs->wins converter by simply using the level of offense. But the deadball era featured a much different style of play with more fielding miscues. How different was the base-out matrix back then?
October 5th, 2010 at 6:39 pm
Right, I can't really speak for how Sean does it, but I believe you can tweak the base-out matrix for the environment, much like Tango did here:
http://www.tangotiger.net/customlwts.html
The various offensive events are worth more or less based on the RPG environment in which they occurred.
October 5th, 2010 at 6:58 pm
Neil, gotta try it to appreciate it, I guess.
Am I understanding your post correctly to mean that are more complete data available for 107-year-old, post-season, games than there are for comparable regular-season games?
I am aware of a lot of gaps in the minor offensive stats around that time for regular-season games.
October 5th, 2010 at 7:00 pm
What Neil said.
October 5th, 2010 at 7:36 pm
Isn't that approach a little flawed. They might have scored the same amount of runs in 1930 as they did today, but they did it in a different way. The batting average, OBP, and doubles were higher while today there are more HR, more K's, more BB, plus better pitchers as relievers at the end of the game changing the run distribution of late innings. Are the numbers in 1920's taking that into account? Is the model you are basing it on empirical?
October 5th, 2010 at 11:48 pm
Sean, what which Neil said? There are two of them in this discussion.
October 6th, 2010 at 12:30 am
[...] Site Features: More On Postseason Boxscores ยป Baseball-Reference ... [...]
October 6th, 2010 at 8:05 am
The model is not empirical. It is based on simulations given the run scoring environment. I'm not sure the manner in which the runs score will affect the team's win probability all that much.
As for the relievers, that cuts both ways. The team ahead now brings in relievers and the team behind will do so as well, so a lot of those effects cancel out.
I guess here is the question. Construct two teams, one that walks a lot and only hits home runs and the other that only hits singles. The tricky part will be creating the event rates such that their overall run scoring overall is the same. The question is then do the two teams have different probabilities of scoring different numbers of runs. It's possible they do, but you'd need to show me a model that demonstrates that before I'm going to spend a lot of time building different event rates into the system. I'm not convinced that the probabilities of one run, two runs, etc will change all that much.
October 6th, 2010 at 8:15 am
I dutifully went back to the Ruth game...and noticed that, at the end of the Yankee 9th, the play by play read "0 runs, 0 hits, 0 errors, 1 LOB. Cardinals 3, Yankees 2.
Doe the site do this whenever a caught stealing ends an inning?
October 6th, 2010 at 8:16 am
Sorry, meant to bold just the 1 LOB
October 6th, 2010 at 8:20 am
Yes, it does. I'll look into that.