Stat Stability: When Can we Predict Seson Outcomes and some other things?

Idenitifying settling times of various batting statistics

Profile PhotoPace Balster

|

Jun 15, 2025

Regression
Predictive Analytics
Descriptive Statistics
Sample Size is a Buzzkill

If sample size was a person, "actually...” would be its favorite word. Sample size turns otherwise interesting storylines into dry, statistical fact. Me: "That 12 game hit streak from a player means he's finally breaking out of his slump, right?" Sample Size: "No, you need around X PA before drawing any relevant conclusions on hitting performance" Me: "Wow I can't wait to watch Player X face off against Player Y tonight. Player X really seems to have his number" Sample Size: "Player X's success against Y is purely a product of chance. You should have no expectations that his performance against Y will continue".


Unfortunately, Sample Size is usually right. At least insofar as the number of data points we have access to is usually directly correlated to trust in the observed trend for any probabilistic process. Without an adequate sample size it's impossible to separate signal from noise. And noise is enemy #1 of any data scientist. That said, I think there's some beauty in the noise we observe before the signal emerges. The stochastic nature of baseball that creates pockets of hit streaks, postseason tears, batters that have pitcher's numbers etc. are part of what we all love about the sport. Even if many of them are product of chance made possible by small sample sizes, the narratives and story lines are ultimately valuable for their own stake. And the cool thing is every once in a while, one of these story lines turns out to not simply be a product of probability theory, but a legitimate trend. Unfortunately, baseball romanticism can't factor into the mindset of scouts, R&D teams, and the sports betting community whose baseball analysis becomes the constant art of separating signal from noise; More accurately, identifying that signal before your competitors.


But how can we know when the signal emerges? When can we confidently say that a hitter's .350 avg to start the season is sustainable (many people were asking this with Luis Arraez in 2023). How many games do I need to see before I can trust a player's performance? How big does the sample size need to be? Well this post is about tackling this problem systematically. Let's start with batters and define some specific questions. Namely, how many games do I need to see from a player before I can expect their Batting Average (BA), On Base Percentage (OBP), Slugging Percentage (SLG), On-Base + Slugging (OPS), weighted On-Base Average (wOBA), Strikeout Percentage (K%), Walk Percentage (BB%), or weighted Runs Created+ (wRC+) to hold for the season? Which of these stabilizes first? And how does this vary player to player?

Drawing Premature Conclusions

Willson Contreras caught a lot of heat during the beginning of 2023. His new Cardinals team was losing and people wanted someone to blame. Contreras replacing HOFer and beloved Cardinal, Yadier Molina gave fans an easy target, leading to him becoming the de facto scapegoat for the spiraling team. Now I'm a Cubs fan and as much as I relished in the newfound dysfunction of the Cardinals, I couldn't help but start to feel for the guy. It was obvious that the issues with the Cardinals ran much deeper than Contreras. But in some ways I get the Cardinal's frustration. Willson was brought in to add offense at the catcher position and he simply wasn't hitting. His OPS through the first two months of the year (March 30 - May 30) was 0.654. Not great. At the time, this was cause for panic from Cardinals fans. Two months is a large sample size right? It looked like something had happened to one of the league's top offensive catchers. Nope. Contreras ended up raising his OPS to 0.826 by the end of the season, a full 0.172 points. The takeaway? We jumped to a conclusion too early, driven by trying to fit limited information to a story that we wanted to be true before waiting for the requisite body of evidence to confirm it. Humans just cannot go without creating narratives. After all, the assertion "Contreras's 0.654 OPS is likely caused by some component of getting unlucky and random chance" isn't nearly as flashy as "Contreras is washed and can't handle the pressure of filling Molina's spot in St. Louis."


The interesting thing is that Contreras didn't even top the list of the largest differences between OPS through May and season OPS for players with at least 100 PAs. In fact, he was 14th. Michael Harris II started the season with a 0.534 OPS through two months and finished with 0.808. That's an incredible run. In the other direction Owen Miller started the season with a red-hot OPS of 0.871 and finished the season with a 0.674 OPS. I've outlined how these player stats can change drastically even from what you'd think would be a solid two-month sample size to end of the year. So that begs the question: At what point during the season can I trust a player's OPS, or any statistic for that matter? This is the question I will be attempting to answer in this post.

Our Approach

Starting with first principles let's define what causes noisy, untrustworthy stats. Variance. Sampling a statistic with a higher variance will require more samples to accurately estimate the mean, or the signal. So we'll start by saying that the standard deviation (the metric we'll use for variance) should be correlated in some way to the "stability", or "settling time" of a statistic.


What do I mean by “stability” and “settling time”? Simply put, how long does it take for a stat to “even out”, or settle to a value that represents a player's true ability? There are many different ways to approach this question and therefore calculate stability as we've defined it. For this post, we are going to explore one such approach, what I'm choosing to call Sliding Window Stability. I may look at other methods eventually in a future follow-up post.


So how does it work? Well let's go back to back to our Standard Deviation variable. If we look at the current value of any given statistic as the season progresses, we expect that the sample standard deviation will continue to decrease and for the mean to progressively converge to the end value because our sample size, or N, continues to increase. This is pretty straightforward and perhaps even a bit obvious, but let's look at a specific example to really hammer home this idea. Below is a graphic of Ian Happ's OPS over the course of the 2024 season with the x-axis representing games played.


Ian Happ 2024 OPS

The settling phenomenon becomes immediately clear when we look at a chart like this. We see how as we accumulate more games into our sample size, the trajectory stabilizes. Here's another example:


Elly De La Cruz 2024 OPS

It's pretty clear that there's a certain point where the stat becomes relatively close to the number that it will eventually finish as for the season. So how do we determine quantitatively where this point is? Well, one way to do this would be to apply a threshold to some type of rolling window - let's say the standard deviation over the last 10 games played. Note, even though we're just looking at the latest 10 games in this case, we're still calculating the season standard deviation at each of those data points. As we progress through the season, the standard deviation of the season OPS over the latest 10 games will decrease and if we apply some type of threshold value, let's say 0.04, then we can find the exact game when our data smooths out to a point where we can say it's stable.


This is a good start, but it has some problems on its own. First, obviously the size of the window and threshold we choose is going to have a huge impact on what we consider stable? How do we determine what the values of these should be? - let's put a pin in that for now. The second issue with just using standard deviation as a metric of stability is that temporary periods of perceived stability prior to actual stability are quite common in baseball. It's not uncommon over a 10 game stretch for a player to consistently get about 1 hit per game, resulting in little change to a season rolling statistic, such as OPS. Yet, they might follow up this run with a 10 game cold streak, only logging a couple of hits over the entire period, resulting in a large impact to the season OPS. Therefore, I propose we look at another parameter, what we'll call Standard Deviation Rate of Change. My assertion is that it's not only important to look at the current standard deviation, but also how the standard deviation has changed recently, essentially a standard deviation velocity, or derivative parameter. I choose to calculate this by adding another trailing 10 game window behind our first window. We'll compare the difference between the current standard deviation and that of our trailing window. That way if we just went from a highly unstable period to a weirdly, or prematurely stable period, we don't mistake this for true stability. Essentially this eliminates false positives in our calculation, allowing us to avoid the illusion of stability before it actually occurs. Like our standard deviation approach, we can apply a threshold to the std dev ROC. Now we have two parameters, and we want both to be true before we say a stat has settled.


So, this is a great start for establishing statistical criteria for baseball stability, but I would argue we can go farther. What's nice about our current approach is that it is a forward-looking metric, and we don't require any prior knowledge aside from the data that exists up until today to make a prediction of stability. However, I would argue that this isn't taking advantage of the fact that we have a lot of historical data to look back on and draw additional conclusions from. For instance, how long did it take Javier Baez's OPS to settle in 2022, or every player in the league for that matter? We should take advantage of the entire picture of players' performances over the years to see when a stat really settled. For instance, one thing we can look at is how the season mean calculated through a given number of games compared to a player's ending season value. It's as simple as mean_diff = season_mean - current_mean. Again, we can apply a threshold here to determine the game when the mean was close to the ending mean. Of course, on its own just using this metric for stability criteria has a lot of problems. It's likely that a player's current performance will bounce around their eventual season performance, producing periods of times where our criteria is satisfied, well before the statistic has settled. However, when we combine this with our other criteria, we begin to get a more comprehensive way to evaluate stability. Essentially, we can look back at historical player performances, go game by game, and calculate all of these metrics and apply our thresholds. Our criteria will look for all three metrics to cross our thresholds. When they do, we will say that the stat has stabilized for X player at Y games. We can then do this same process for each player that satisfied some number of Plate Appearances for the season, and average our results, creating a general estimate for needed games played for stability for each batting statistic mentioned earlier.


I created a couple of visualizations to articulate how this works. Let's look at Bryce Harper's OPS first:


Use the play/pause button to control the animation.

Here's the result for Willi Castro:


Use the play/pause button to control the animation.


We can apply this same process to any player, giving us an automated way to try out different thresholding criteria. With the basic criteria I used to create those animations, it's noticible that Harper's OPS reaches our settling criteria 20 games before Castro's. But how do we choose those thresholds? Well, I admit this requires a bit of trial and error. My approach is through iteration of the aforementioned process. When we look at stability across all players for given stability criteria, that will inevitably result in variance in the resulting games_to_stability metric we derive at the end. Some players will reach their stability criteria faster than others. When we average across all players, we'll have a normally distributed spread of games_to_stability. We don't want this spread to be too large, otherwise that will imply our criteria is too loose. Too tight and we might be waiting longer than we need to to say a stat has stabilized. So I suggest we iterate our thresholds until we reach a desired macro-level spread of stability across all players. Here's where it gets a bit subjective. I found that a single standard deviation of 8 games resulted in a good balance between early predictability (the mean of games_to_stability is relatively low) while still providing clear predictability in the end of season statistic. Below is a table that shows how changing the stability criteria thresholds impact the macro-level statistics:


A way that we can orient ourselves first before shooting in the dark, is first by looking at the stability of the stability metrics themselves. If we have an idea of where the difference between the final stat and the rolling mean gets small and where the std dev tends to settle, these will be good starting points for our selection of thresholds.


Here's a chart for the league-average OPS settling stats charted over each game for all qualified hitters. To use all the data, we're limited to the minimum number of total games played by a player in our dataset which is why it does not go out to 162. Additionally, we start these charts at 10 games as this is the first time we begin computing our rolling standard deviation. The std dev ROC compares two windows which requires at least 20 games.

Macro Settling Trends of League-Wide OPS
How our Thresholds Change as a Function of Games Played for All Players

It looks like each statistic exponential decays at first, before settling into a linear delcine - At least that's the case for the rolling mean differential. The standard deviation plots may continue their decay asymptotically to zero. Either way, for each chart, we're able to see that elbow where the transition out of the steep deline occurs. I've plotted a reference line around where I think this happens. For the rolling mean, this is around 25 games, the standard deviation plot is about 30 games, and 35 games for the std dev ROC plot. I can't be sure that the settling times for our threshold statistics will match settling time for our statisitic of interest, OPS. We'll simply use these charts as a guide for selecting our stability criteria.


So my approach will be to select a set of three values for each of our thresholds. We'll then iterate through every qualified player season in 2024, checking our threshold values against the computed value at each game played throughout the season. At some point, all of our criteria will be met and we'll log at which game this occured for that player. We'll then compile all of the results from each player season and compute the mean and standard deviation. This will give us an overall "average games to stabilization" metric and the variation of said metric. Then we'll repeat this process for a new set of thresholds, and again until we have around 10 or so tests completed. This is an arbitrary number of cases to run, and for a more thorough study we'd want to run more.


Here are the cases I ended up running along with the resutls:


Conclusion

There are many ways to measure stat stability, and while this post looked at a relatively simple method using the idea of statistical thresholds and a sliding game window, we're only scratching the surface. I may explore this topic again in the future uses different, more complex methods. Hopefully this post illuminated one thing though - sample size matters whether we like it or not.