Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Abstract WAR is a baseball statistic that measures a player’s value against a replacement level player. A replacement level player being a player that will give you the production value of a minimum salary baseball player. This is an important statistic because it’s a way of measuring a player’s value against their peers in the number of extra wins that they will give you.
We looked to do a regression analysis of this statistic to see what measures are integral to the WAR statistic. Traditionally players were evaluated on classic baseball statistics such as batting average, home runs, stolen bases, and runs batted in, but we know that these stats do not tell the whole story. For example, those statistics do not tell you how often a player is getting on base, or strikeout rate. We want to know what are the best indicators for WAR.
Regression Analysis a good tool to figure this out because we can account for multiple factors that help us determine WAR score. With a standard deviation score for each factor we can determine how statistically significant each variable is. With the Regression Analysis we are trying to limit the number of outliers, outliers will make our adjusted R squared score worse, so we will try to make our R – Squared score as high as possible.
We see that we could get better with our fit. Delete A20 TB Delete A18 OPS
We got a better fit for the model, Delete A11 for a better A11 for better model
Model is off by about 1.8
For the Regression Analysis, I used R and ran it in a CoCalc Jupyter notebook. This way made it easier to add and delete data, while making it look clean. For this analysis, I started out with 25 different factors and was able to narrow it down to 11 by adding and deleting different data points.
Over the course of the project, it became clearer that finding a fit that I was comfortable with was going to be an art and a science. This was the case because there were some statistics that are not great indicators for how good of a batter someone is, but they were getting high t - values. Keeping these would have yielded a better fit, but I could not keep them in good faith.
The first couple that were deleted were obvious bad fits. These were Total Bases, On-Base percentage, Stolen Bases, etc. Deleting these did help me get a better fit. I was able to go from a fit of 0.5737 to almost a one to one fit of 0.909. Although that is a great fit, like I mentioned before there were some stats that I felt needed to be removed from the model. I will talk about two examples from ones that I deleted.
The first example of these statistics is Hit by Pitch. This stat only tells us how many times a player gets unintentionally hit by a pitch. At the major league level getting hit by a pitch is not a common occurrence. The top five players had only between 8 – 10 in this category, and most players had over 600 plate appearances. Players should not expect to be hit by a pitch when they step into a Major League batters’ box. Yet, hit by pitch had a -3.698 t – score, indicating that it is a significant data point.
Another example of one of this Sacrifice Hits. Out of the 26 players with the highest WAR scores, 18 players had a zero in this category. Sacrifice Hits are situational and not great indictors for success per plate appearance. Sacrifice Hits happen when a player attempts to move baserunners into scoring position by bunting. The problem with this is that we are looking at the top 26 players when it comes to WAR. This means that they probably provide valuable offense to their team. If they are among the best players, why would you not just let them try to get a hit. Also, a player could provide more value by just getting a hit because the player at the plate and the baserunners could be on base. Despite this, Sacrifice Hits had a high t – value score of -4.834.
There were more stats like this that I removed, but I feel that I was able on a model that included stats like Batting Average, Slugging Percentage, On – Base plus Slugging. The thing that I like about these stats is that they are telling you what a player is doing on an at – bat basis. For example, On – Base plus Slugging is taking into account many aspects of great hitting. Slugging percentage tells us about player’s ability to hit for average and power, considering singles, doubles, triples, home runs and plate appearances as part of its calculation. While, On – Base percentage takes into account all of the ways that a player can get on base including hits, walks, and hit by pitch. Add On – Base percentage and Slugging Percentage, you get On – Base plus Slugging, a really high quality statistic telling us a lot about a batter.
I was able to get an adjusted R – Squared score of 0.7602, but it is not as good as a 0.909. When I tested the model for two of the best players in the MLB in the 2019 season, Mike Trout and Cody Bellinger, I was off by 1.8 and 2.6. Obviously, this model has room for improvement. Perhaps I should not have taken out as many stats that had high t – scores, although they might not be the greatest indictors for batting ability. Next time I do a project that has regression analysis, I would like to make a model that has a better balance of stats that give me a better fit. Perhaps a Regression Analysis that I do not have any background knowledge in, maybe that would improve my model because I would be looking at the numbers with no bias. This project has proven to me that the applications of regression analysis is endless. Pretty much anything in which you have a bunch of independent factors that led to a statistic.