Resource Random Battles Level Balancing

Chains of Markov · May 10, 2023

Why?
The Random Battles formats include every fully evolved Pokemon, but not every Pokemon is equally strong. If every Pokemon was simply level 100, you could not make a fun format that includes Luvdisc, Miraidon and everything in between. The tool we use to mitigate this problem is level balancing. By setting Miraidon's level much lower than its peers, we hope to bring its power roughly in line with the rest of the lineup.

How?
Old method
The traditional way to balance levels is by assigning levels to tiers, and manually adding a couple of exceptions. For example, in the tier-based system for Gen 7 an OU Pokemon would get level 80, and a PU Pokemon would get 88. Exceptions were mainly used to increase the level of Pokemon weaker than the average PU, for example setting Unown and Luvdisc to 100.
This system is currently still used for Gen 2.

New method
For the first time in Gen 8, we used a new method to balance levels: winrates. By going through the database of games played, it was possible to find out how often each Pokemon won. Clearly the average winrate is 50%, so each Pokemon significantly below that line was buffed, and each Pokemon significantly above was nerfed. This scraping of the database was quite a bit of work, and was undertaken by Annika. The fruits of her labor are still visible here. The increased balance that this method brought to the format was widely positively received.
Random Battle winrates have since gotten their own database, implemented by Mia, which anyone can view by typing /rwr in a chatroom, greatly simplifying this process. This has made it possible to not just extend winrate balancing to Gen 9, but also to retroactively balance Gens 3, 4, 5, 6 and 7. The end of March of this year saw the first old-gen balance patch for Gens 3-7. Gen 2 is left out because it does not have enough games played to get accurate statistics, and Gen 1 because its unique qualities make it unsuited for winrate balancing, as explained here.
If you want to know the exact implementation details click this box:

Every month we take the wins/losses from the /rwr data for each Pokemon, which includes every game played above 1300 Elo (1500 for current gen, and 1150 for gen 2), and check if it deviates from 50% with sufficient magnitude and certainty. Specifically, for every 1.5% deviation from 50% that is significant at the p<0.01 level using a one-tailed binomial test we buff or nerf a Pokemon by one level. To avoid anomalies, we cap this at max 3 levels per month. For the least drastic action we can take, a 1 level change, we maintain less rigorous standards, allowing a 1% deviation at p < 0.05. These parameters are still subject to change.

To increase sample size, each Arceus and Silvally forme gets the same level, allowing for grouped wins and losses. Current gen is the exception here, as its higher game count allows for individual balancing. To further increase old gen sample size, we have now started using multiple months of data per patch for Pokemon that haven't been buffed or nerfed for multiple months.

Results
The below plot shows a histogram of the Gen 9 winrates for February, March and April of 2023. Between each of these months there was a round of balancing, and we can clearly see a tightening towards 50% winrates, with accompanying decrease in standard deviation (σ). The biggest outlier is Luvdisc, which is already at level 100 and thus cannot be buffed further, unfortunately remaining weak in perpetuity.

In Gen 7, as shown below, the distribution is wider. Partly this is because Gen 7 is less balanced, and partly because the lower sample size amplifies noise. Regardless, there is a clear and drastic decrease in variance after March, which coincides with the first balance round.

In summary, winrate-based level balancing provides a convenient and effective balancing mechanism, and is now being applied not just to current gen, but Gens 3-7 as well. It is not without faults: Pokemon that are difficult to use may be weaker at lower ratings than higher, and of course level is but one aspect of balance. However, the huge outliers (Kyogre had a 66% winrate in Gen 3!) should be a thing of the past.

If you have any comments or questions, feel free to post in this thread!

MrSoup · May 10, 2023

Cool thread. How much does gen 2 lag to be insufficient in sample size?

Celever · May 10, 2023

MrSoup said:
Cool thread. How much does gen 2 lag to be insufficient in sample size?

The original plan was for gen 2 to have level balancing with all the gens after it so we did actually track winrates. However, each Pokémon was being generated a maximum of sort of 40 or 50 times in a month. Considering the amount of variables in Pokémon, particularly the other Pokémon on your team, the matchup, etc. it was absolutely not statistically significant.

On one of the months it was tracked for Stantler had the best winrate with like 63%, the kind of broken Kyogre was in gen 3 in its first month. That's obviously not representative of gen 2 rands as a format, y'know.

Chains of Markov · May 10, 2023

MrSoup said:
Cool thread. How much does gen 2 lag to be insufficient in sample size?

To decrease server load as much as possible, we've stopped gathering gen2 stats, so I don't have entirely up-to-date stats. However, I saved an example from back in January: the confidence interval for Starmie at a p<0.05 level was a winrate somewhere between 26.9% and 58.2%

Betathunder · May 10, 2023

Is the process going to be limited strictly to ladder? Just curious if the addition of Smogon hosted tournament battles (eg, Rands Slam or RBTT) would be considered for the calculations as well, although I understand that developing code for this would be a difficult task.

Chains of Markov · May 10, 2023

Betathunder said:
Is the process going to be limited strictly to ladder? Just curious if the addition of Smogon hosted tournament battles (eg, Rands Slam or RBTT) would be considered for the calculations as well, although I understand that developing code for this would be a difficult task.

It's only ladder, yes. Smogon tours are a very small part of the total number of games, with for example in RBTT each format only being played about 80 times, which would in the format with the fewest unique Pokemon (gen1) only add about 80 * 12 / 146 ≈ 6.5 games per mon. Even ignoring that RBTT is spread out over multiple months, that just doesn't really matter given the 1000+ monthly games every mon already has.

There are other arguments to be had about how representative Smogon tour matches are, whether we want to balance for the highest level of play or for the common ladder player, and how much effort it would be to implement something like this, but the real argument for not including them is the small sample size, at least if you ask me.

sharpclaw · May 10, 2023

numbers go brrrr

(I'm continually in awe of the mathematically-minded members of the Randbats staff, and how they direct their intelligence and know-how towards improving our various formats. thanks for this awesome writeup, Mark!)

MrSoup · May 10, 2023

Appreciate the response and understand the lack of ladder activity. I guess it’s sort of a cool thing that gsc can have its own format. Although have y’all considered using multiple months for gsc? Or is that not in the cards?

edit: fyi stantler is broken though and might be the best in the format tbh

Chains of Markov · May 10, 2023

MrSoup said:
Appreciate the response and understand the lack of ladder activity. I guess it’s sort of a cool thing that gsc can have its own format. Although have y’all considered using multiple months for gsc? Or is that not in the cards?

edit: fyi stantler is broken though and might be the best in the format tbh

It was very silly of me to not save the gen2 stats, but from what I remember I think it was so much less that even multiple months didn't allow for any meaningful statistical conclusions. These stats scale worse than linearly w.r.t. games played, because Elo distributions are rather bottom heavy, so a larger percentage of gen2's games don't make the 1300 Elo threshold. On the other hand it's a complicated game largely unlike the others, so you really do want to maintain the Elo threshold to prevent easy-to-use mons from dominating.

Personally, I'd think that sitting down with some top gsc rands players and manually releveling might a better option. This sort of individualized balancing was strongly discouraged for a long time for fear of opening the floodgates, but now we have a fairly individual system for every gen, so that may not be as bad anymore. That's not up to me to decide however, nor would I be qualified to help with that.

Estu · Jun 15, 2023

This is great! I implemented something similar to this a few years ago for Gen 1 in the Gold server, and transferred that system to the RBY server when Gold went down. I tried to get it adopted in the main Smogon server without success. I'm very happy that it looks like Smogon got around to implementing a similar system on their own - looking forward to playing it. My greatest irritation with my system was that, due to only being implemented on low-population servers, it never got much data. With this new system being on the main server, it will be exciting to see how the levels adjust over time.

One question - I've clearly been out of the loop, but I heard that the very worst pokemon, e.g. caterpie, were removed from randbats altogether (or at least from gen 1 randbats). With a level-adjusting system like this, I would think that it would be very doable to have them in the format and have them be reasonably balanced (e.g. if your worst mons are at level 99, mediocre mons could be level 30-40, and top mons could be level 10-20; then even a caterpie could perform well in such a world). Has there been any thought given to reintroducing them?

El químico del pueblucho · Jun 16, 2023

Ah Latios surely won't have a good winrate but ah is so strong...

Chains of Markov · Jun 16, 2023

Estu said:
One question - I've clearly been out of the loop, but I heard that the very worst pokemon, e.g. caterpie, were removed from randbats altogether (or at least from gen 1 randbats). With a level-adjusting system like this, I would think that it would be very doable to have them in the format and have them be reasonably balanced (e.g. if your worst mons are at level 99, mediocre mons could be level 30-40, and top mons could be level 10-20; then even a caterpie could perform well in such a world). Has there been any thought given to reintroducing them?

Yes, the shitmons (the four infamous bugs and Magikarp) were removed September of last year. It was cool to have every single pokemon in the format, but they were just too useless and hindered the surprisingly large competitive community for gen 1 rands.
Lowering the average levels to accommodate the worst ones is occasionally suggested—though usually in the context of gen 9 Luvdisc—but is thought to be not worth it. With a lower average level comes among other things less granular balancing, as a drop from level 60 to 59 is much smaller than 10 to 9. In the specific case of the shitmons, it also runs into another problem. They are not just weak, but also completely one-dimensional. The only thing a Magikarp can do is Tackle, and if you put the level difference such that Tackle does something it also becomes incredibly bulky. It would, frankly, be silly:

Magikarp Tackle vs. Lvl 30 Mewtwo: 35-42 (26.7 - 32%) -- guaranteed 4HKO
Lvl 30 Mewtwo Psychic vs. Magikarp: 30-36 (12.3 - 14.8%) -- possible 7HKO

Note that, because of recover, Magikarp would still lose this 1v1, it'd just take a long time.

Chains of Markov · Jul 8, 2023

An update on things that changed this month!

The balancing has become less apprehensive in two ways. The first is that we've started using multiple months of data each month. If a mon does not get nerfed or buffed by looking at its stats for a given month, we now also consider the last three months of data combined. If that aggregate is enough to reach statistical significance and the mon hasn't been nerfed or buffed in that timeframe, then we use that combined data to make a change. This has a very limited impact on Gen 9, but for older gens with smaller sample sizes this allows for balancing of Pokemon that consistently under- or overperform but don't reach the certainty threshold in any given month.

The second is that we now maintain even looser standards for a 1-level change. It used to be that a Pokemon had to be either 1.5% winrate away from 50% with a p<0.05 certainty, or 1% with p<0.01. These have now been combined to allow for a 1-level nerf or buff if a mon has <49% or >51% with p<0.05.

Lowering the threshold below 1% was considered, as we thought that a 1-level change would generally only cause a ~0.5% difference in winrate. However, taking every 1-level change we've made in balancing so far, it turns out that a buff averages out to a 1.0% increase in winrate, and a nerf lowers winrate by 1.4% on average. Thus, the new standard maintains the 1% threshold, while being slightly less afraid of false positives. The thinking being that the occasional undeserved buff or nerf is not the end of the world.

Here's a pretty picture of the difference in winrates in the months before and after a 1-level balance change:

[IMAGE DIED BECAUSE DISCORD DOES NOT DO HOSTING ANYMORE]

As we can see, the winrates go nicely towards 50% while rarely going past it.

Finally, we now have winrate stats for gen2randombattle and gen8randombattle! Please don't read too much into them while they have very small sample sizes, but they can be interesting. We also have winrate data for gen9randomdoublesbattle. To see any of these, type /rwr <formatname> in any PS! chatroom.

Chains of Markov · Apr 2, 2024

Gen 1 no longer has winrate balancing!

Over time, a majority of the high level Gen 1 Random Battle players have decided that they do not like the meta created by winrate balancing. For many of the Gen 1 Pokemon, making them equally strong leads to them being almost indistinguishable, as the limited movepools, lack of abilities and items, and inclusion of all NFEs makes them inherently more similar. This can take away meaningful strategic decisions, where clicking body slam for ~30% damage is too often the dominant theme in a match. Separate from this, the small movepools lead to type matchups often being stricter in Gen1 than others. Many Normal types cannot hit Ghosts at all and Electrics cannot hurt Ground. When balanced to 50% winrate, this leads to these pokemon being either very weak or very strong depending on matchup, and thus the balance never quite feeling right.

These reasons are not exhaustive of course, but are in my view the most important ones. Regardless, the levels will be reverted to a manual list that is close but not identical to what they were before winrate balancing started, made in consultation with the aforementioned top players. You can see the list here, where the first column is the Pokemon's name, then the old level, new level, and change.

We will keep tracking the winrates for Gen 1 (and you can still view them with /rwr 1), as it's valuable information and the only objective data we have access to, but changes are for the foreseeable future only manual.

This change presents a significant shakeup of the meta, and we hope you will all find it more enjoyable! Make sure to give the format a (new) chance!

Resource Random Battles Level Balancing

Chains of Markov

MrSoup

my gf broke up with me again

Celever

i am town

Chains of Markov

Betathunder

alphalightning

Chains of Markov

sharpclaw

MrSoup

my gf broke up with me again

Chains of Markov

Estu

El químico del pueblucho

a

Chains of Markov

Chains of Markov

Chains of Markov

Users Who Are Viewing This Thread (Users: 1, Guests: 0)