Grading anomalies 30/05/2014 4.2
Preface...

Chess grade 'stretching' is an ECF grade phenomenon where (perhaps over a period of time) the grade differences (seem to) become larger than what they should have been (according to the chess performances). The correction has been made once (in July 2009 grading list) by increasing grades of weaker players (leaving the grades of the top players virtually unchanged). It is believed that these anomalies were introduced by factors such as fast improving juniors, estimated grades of new ungraded players, etc. Robert Jurjevic has found that the reason for the grade 'stretching' is (according to him very likely) in the ECF grade calculation method itself and (if that is the case) that the one-off correction will not solve the problem. Robert proposed three calculation methods which (according to him) would not 'stretch' the grades, with so called ÉGS6 as the best of the three, where ÉGS6 stands for Élo Grading System six (as there were a number of attempts and versions of new grade calculation methods, each was clearly identified).

I think that I know why GS is stretching the grades! :-) :-)

The point...

My point in a nutshell is that the 'k' factors (wrongly chosen) in the current grading system are causing the grade stretching and that the amount of stretch (due to the 'k' factors) is larger (in fact it is equal to '|p - q|') than grade fluctuations caused by other anomalies which may be corrected by using FIDE logistic relation for 'p = f(d)', Glickman idea on changing less trusted grades (based on frequency of play) faster than more trusted grades, or even a solution to the "junior problem".

The main flaw (rule-wise) in the current grading system is that it applies the current grading rule for changing the grades of both players in the game (should apply it to change the grade of one of the players only, or use a different 'good' rule applying it to both players).

The main flaw (formulae-wise) in the current grading system is that it uses 'ka = kb = 1' in their formulae (should use 'ka = kb = 1/2' or variable 'ka' and 'kb' such that 'ka + kb = 1').

Factors 'k' and grade stretching...

The three relationships 'p = f(d)' (green, blue and red line) match pretty closely for '|d| <= 30' and all predict that for grade difference '|d|' of 30 grading points expected performance 'p' is approximately 80% (actually the relationships marked with green and blue line expect 80.0000% and with red line 78.8905%) (please see figure 1 below).

Figure 1
Figure 1: Relationship between expected performance 'p' and grade difference 'd' as defined in GS (green line), CGS, AGS and AGS2 (blue line) and ÉGS, ÉGS2, ÉGS3 and ÉGS4 (red line). Expected performance 'p' is a function of grade difference 'd', i.e., 'p = f(d)'.

Let us assume that two pool of players both with average grade of 100 play each other during a course of a season and that one of the player pools scores 80%. Let us assume that each player in the pool plays only players of the other pool and that each player played exactly 30 games (for simplicity we can assume that in each pool there are 30 players each graded 100 and that each player from one pool plays each player from other pool, totaling in 900 games). Then, it follows (from the relationships 'p = f(d)') that at the end of the season one of the player pools should be regarded stronger (than the other) for approximately 30 grading points (because it scored 80%, i.e., '50 + 30 = 80%' vs '50 - 30 = 20%').

(Note that it is unlikely that one of the pools would score so high in practice, though in order to please those who might be troubled with that, we could assume that, say, players of the well performing pool are all juniors who had been lucky enough to be coached by Garry Kasparov in the summer break before the start of the season.)

According to GS new grades of the player pools in the above example are 130 and 70 (the pool grades drift apart for '130 - 70 = 60' grading points).

According to ÉGS new grades of the player pools in the above example are 115 and 85 (the pool grades drift apart for '115 - 85 = 30' grading points)..

As the grade drifts are '115 - 85 = 30' and '130 - 70 = 60' it is obvious that (current grading system) GS stretched the grades for 30 grading points (the new grade difference calculated by GS is twice as big than it should have been)!

(You see how ÉGS is fair, it did not assign grades of 130 and 100, as it did not assume that the better pool improved and the other stayed as it was, but it guessed that the result was due to both one of the pools improving and other worsening, though if Kasparov really did coach the juniors in the better pool, the grades of 130 and 100 would have been a better guess. GS grades of 130 and 70 make no sense at all, as if the better pool was given 130 the other pool should have been given 100, not 70, giving 70 to other pool is as if the better pool had scored approximately 94%, according to Élo's logistic 'p = f(d)'.)

(Note that ÉGS2 would assign grades of 130 and 100 if all of the players in the better pool were ungraded and if all of the players in the other pool were graded, that is because ÉGS2 changes less trusted grades more rapidly than more trusted grades, and in this extreme case the grades of graded players remain unaffected by games played against ungraded players. Well, it would be nice if we could take into account if, say, Kasparov was coaching a player, but...)

In my opinion, the main reason for the grade stretching is factor 'k' which is twice as big in GS than ÉGS and ÉGS2 (please note that in the above example we eliminated the differences in 'p = f(d)', so 'p = f(d)' couldn't be the cause of the stretching).

Statement 1: If for a grading system (of so far mentioned) holds that 'ka + kb = 1' (in their formulae) the system neither stretches nor shrinks the grades, if 'ka + kb > 1' the system stretches the grades, and if 'ka + kb < 1' the system shrinks the grades.

Of so far mentioned systems GS, CGS, ÉGS3 and ÉGS4 stretch the grades, and AGS, AGS2, ÉGS and ÉGS2 neither stretch nor shrink the grades (none of the systems shrinks the grades).

"Equal grade for equal performance"...

Systems which neither stretch nor shrink the grades do not obey the rule which is known as "equal grade for equal performance", say if you have a 130 player who scores 50% against a pool of 160 players "equal grade for equal performance" rule requires that the 130 player becomes a 160 player (according to the systems which neither stretch nor shrink the grades the 130 player becomes approximately a 145 player).

So it would seem that one could opt either for a system which obeys "equal grade for equal performance" rule and stretches the grades or a system which does not obey "equal grade for equal performance" rule and neither stretches nor shrinks the grades.

Let us assume that in the above "equal grade for equal performance" example the 130 player played 300 games during a course of a season and that each player in the pool (there are 10 players in the pool) played 30 games against the 130 player. Then, taking into account only the 300 games the pool players played against the 130 player, it follows (from the relationships 'p = f(d)') that at the end of the season the 130 player should be regarded approximately equally strong as the pool of players he played.

"Equal grade for equal performance" rule requires that new grade of the 130 player is 160 (assuming that the pool grade stays approximately 160).

Taking into account only the 300 games the pool players played against the 130 player, a system which neither stretches nor shrinks the grades requires that new grade of both the 130 player and the pool is approximately 145.

If really all the pool payers performed at their level of 160 and the 130 player did improve, the 130 player should become a 160 player and the pool players should remain 160. The problem with the current grading system is that even it assigns 160 to the 130 player it panelizes the pool players for the games which they have played against the 130 players, what is causing the grade stretching.

A system which wouldn't stretch the grades, if assigning to the 130 player a grade of 160, when calculating the grades of the pool players, should ignore the games the pool players have played against the 130 player (as it is already assumed that they performed at 160 level and the games they have played against the 130 player should have no effect on their grade), or it can assume that both the pool players worsened and the 130 player improved (splinting it 50/50) and assigning to the 130 player a grade of 145, penalizing the pool players for their games against the 130 player (i.e., if the pool players played only the games against the 130 player the pool grade would have lowered to 145) not stretching the grades.

The above argument is enough for me to claim that "equal grade for equal performance" rule is unsound and should be abandoned in favour of a system which neither stretches nor shrinks the grades.

Factors 'k' and total system grade...

Statement 2: If for a grading system (of so far mentioned) holds that 'ka = kb' (in their formulae) the system preserves total system grade, and if 'ka /= kb' the system does not preserve total system grade.

Of so far mentioned systems GS, CGS, AGS, ÉGS and ÉGS4 preserve total system grade, and AGS2, ÉGS2 and ÉGS3 do not preserve total system grade.

In a nutshell...

In all mentioned grading systems there are two factors, 'ka' (factor 'k' for player A) and 'kb' (factor 'k' for player B), that are used in formulae which correct the grades based on the game results (i.e., based on the difference between actual and expected performance). In (the current grading system) GS (Grading System) both 'k' factors are equal to 1, in ÉGS (Élo Grading System) both 'k' factors are equal to 1/2 and in ÉGS2 (Élo Grading System two) each of the 'k' factors can be between 0 and 1 inclusively (the less the player is active relatively to the other player the closer is 'k' to 1) but their sum is always 1.

I have found that a necessary condition for a grading system not to stretch (nor shrink) the grades is that the sum of the two factors 'k' is 1 (if the sum is greater than 1 then the system stretches the grades and if the sum is less than 1 then the system shrinks the grades). As in GS both 'k' factors are equal to 1 their sum is 2 and the GS stretches the grades. As in both ÉGS and ÉGS2 the sum of 'k' factors is 1 they do not stretch (nor shrink) the grades.

Mathematical proof...

Let 'a' and 'b' are the grades of players 'A' and 'B', 'p' expected performance of player 'A' (expected performance of player 'B' is then '100 - p'), 'q' actual performance of player 'A' (actual performance of player 'B' is then '100 - q') and 'a2' and 'b2' new grades of players 'A' and 'B'.

'a2' and 'b2' are calculated using the following formulae (holds for any grading system mentioned here, including the current one):

a2 = a + ka*(q - p);
b2 = b + kb*((100 - q) - (100 - p));

then we examine a term '(a2 - a) + (b - b2)' (which is a measure of how much the grades drift apart due to a difference in actual and expected performance 'q - p').

Mathematical requirement for a grading system (using the mentioned formulae) not to stretch nor shrink the grades is that '(a2 - a) + (b - b2)' is equal to 'q - p' for any 'a', 'b', 'p' and 'q'.

As

(* grade stretching GS *)
ClearAll[a, b, a2, b2, d, g, s, ka, kb, p, q];
g = 50; s = 40;
d = a - b;
ka = 1; kb = 1;
If[d >= 0, If[d > s, p = 90, p = g*(1 + d/g)],
    If[d < -s, p = 10, p = g*(1 + d/g)]];
a2 = a + ka*(q - p);
b2 = b + kb*((100 - q) - (100 - p));
Simplify[(a2 - a) + (b - b2) == q - p]

gives

q == p

i.e., '(a2 - a) + (b - b2)' is equal to 'q - p' only if 'p = q', GS either stretches or shrinks the grades (it can be shown that GS stretches the grades).

As

(* grade stretching EGS2 *)
ClearAll[a, b, a2, b2, d, g, ka, kb, p, q, na, nb];
d = a - b;
g = (25*Log[10])/Log[3];
ka = If[na + nb > 0, nb/(na + nb), 1/2]; kb =
  If[na + nb > 0, na/(na + nb), 1/2];
p = 100/(1 + 10^(-d/g));
a2 = a + ka*(q - p);
b2 = b + kb*((100 - q) - (100 - p));
Simplify[(a2 - a) + (b - b2) == q - p, na + nb > 0]
Print[];
Simplify[(a2 - a) + (b - b2) == q - p, na + nb == 0]

gives

True
True

i.e., '(a2 - a) + (b - b2)' is equal to 'q - p' for any 'a', 'b', 'p' and 'q', in both cases, 'na + nb > 0' and 'na + nb = 0', ÉGS2 neither stretches nor shrinks the grades.

Or in general, as

(* grade stretching *)
ClearAll[a, b, a2, b2, ka, kb, p, q];
a2 = a + ka*(q - p);
b2 = b + kb*((100 - q) - (100 - p));
Simplify[(a2 - a) + (b - b2) == q - p]
Print[];
Simplify[(a2 - a) + (b - b2) == q - p, ka + kb == 1]

gives

0 == (-1 + ka + kb)*(p - q)
True

i.e., '(a2 - a) + (b - b2)' is equal to 'q - p' for any 'a', 'b', 'p' and 'q', if and only if 'ka + kb = 1', a grading system neither stretches nor shrinks the grades if 'ka + kb = 1'.

AGS3...

The closest system to GS which does not stretch (nor shrink) the grades (due to 'k' factors) is AGS3.

Rule 2b: For a win you score average grade plus 25; for a draw, average grade; and for a loss, average grade minus 25. Average grade is half of the sum of your and your opponent's grade. Note that, if your opponent's grade differs from yours by more than 40 points, it is taken to be exactly 40 points above (or below) yours. At the end of the season an average of points-per-game is taken, and that is your new grade.


ÉGS5 and ÉGS6...

ÉGS and ÉGS2 use 'g = (25*Log[10])/Log[3] = 52.3975...' in 'p = 100/(1 + 10^(-d/g))'. It can be shown that in order to match FIDE's choice of the constant in their logistic curve it should be 'g = 50' (Mr David Welch has found that 'g = (25*Log[10])/Log[3] = 52.3975...' does not match FIDE's constant).

Therefore we introduce two new systems, ÉGS5 and ÉGS6, ÉGS5 is ÉGS with 'g = 50' and ÉGS6 is ÉGS2 with 'g = 50'.

Figure 2
Figure 2: Relationship between expected performance 'p' and grade difference 'd' as defined in GS (green line), CGS, AGS and AGS2 (blue line), ÉGS, ÉGS2, ÉGS3 and ÉGS4 (red line), ÉGS5 and ÉGS6 (yellow line), and (normal relationship 'p = 100*(1 + Erf[d/50])/2', where the error function Erf[z] is the integral of the Gaussian distribution) as originally defined by Élo (brown line above yellow). Expected performance 'p' is a function of grade difference 'd', i.e., 'p = f(d)'. Note that both FIDE and USCF switched from normal (brown line) to logistic (yellow line) relationship 'p = f(d)' which they found provides a better fit for the actual results achieved.

It can be shown that 'g = (25*Log[10])/Log[3] = 52.3975...' minimizes the difference between logistic and linear 'p = f(d)' approximately in the interval '0 <= d <= 34' and 'g = 50' approximately in the interval '0 <= d <= 41'.

-------------------------------------------
                  p = f(d)           
    d   green   blue    red yellow  brown
-------------------------------------------
    0    50.0   50.0   50.0   50.0   50.0
    1    51.0   51.0   51.1   51.2   51.1
    2    52.0   52.0   52.2   52.3   52.3
    3    53.0   53.0   53.3   53.4   53.4
    4    54.0   54.0   54.4   54.6   54.5
    5    55.0   55.0   55.5   55.7   55.6
    6    56.0   56.0   56.6   56.9   56.7
    7    57.0   57.0   57.6   58.0   57.8
    8    58.0   58.0   58.7   59.1   59.0
    9    59.0   59.0   59.8   60.2   60.0
   10    60.0   60.0   60.8   61.3   61.1
   11    61.0   61.0   61.9   62.4   62.2
   12    62.0   62.0   62.9   63.5   63.3
   13    63.0   63.0   63.9   64.5   64.3
   14    64.0   64.0   64.9   65.6   65.4
   15    65.0   65.0   65.9   66.6   66.4
   16    66.0   66.0   66.9   67.6   67.5
   17    67.0   67.0   67.9   68.6   68.5
   18    68.0   68.0   68.8   69.6   69.5
   19    69.0   69.0   69.7   70.6   70.5
   20    70.0   70.0   70.7   71.5   71.4
   21    71.0   71.0   71.6   72.5   72.4
   22    72.0   72.0   72.4   73.4   73.3
   23    73.0   73.0   73.3   74.3   74.2
   24    74.0   74.0   74.2   75.1   75.1
   25    75.0   75.0   75.0   76.0   76.0
   26    76.0   76.0   75.8   76.8   76.9
   27    77.0   77.0   76.6   77.6   77.7
   28    78.0   78.0   77.4   78.4   78.6
   29    79.0   79.0   78.1   79.2   79.4
   30    80.0   80.0   78.9   79.9   80.2
   31    81.0   81.0   79.6   80.7   81.0
   32    82.0   82.0   80.3   81.4   81.7
   33    83.0   83.0   81.0   82.0   82.5
   34    84.0   84.0   81.7   82.7   83.2
   35    85.0   85.0   82.3   83.4   83.9
   36    86.0   86.0   82.9   84.0   84.6
   37    87.0   87.0   83.6   84.6   85.2
   38    88.0   88.0   84.2   85.2   85.9
   39    89.0   89.0   84.7   85.8   86.5
   40    90.0   90.0   85.3   86.3   87.1
   41    90.0   91.0   85.8   86.9   87.7
   41    90.0   92.0   86.4   87.4   88.3
   43    90.0   93.0   86.9   87.9   88.8
   44    90.0   94.0   87.4   88.4   89.3
   45    90.0   95.0   87.8   88.8   89.8
   46    90.0   96.0   88.3   89.3   90.3
   47    90.0   97.0   88.7   89.7   90.8
   48    90.0   98.0   89.2   90.1   91.3
   49    90.0   99.0   89.6   90.5   91.7
   50    90.0  100.0   90.0   90.9   92.1
   51    90.0  100.0   90.4   91.3   92.5
   52    90.0  100.0   90.8   91.6   92.9
   53    90.0  100.0   91.1   92.0   93.3
   54    90.0  100.0   91.5   92.3   93.7
   55    90.0  100.0   91.8   92.6   94.0
   56    90.0  100.0   92.1   92.9   94.3
   57    90.0  100.0   92.4   93.2   94.7
   58    90.0  100.0   92.7   93.5   95.0
   59    90.0  100.0   93.0   93.8   95.2
   60    90.0  100.0   93.3   94.1   95.5
   61    90.0  100.0   93.6   94.3   95.8
   62    90.0  100.0   93.8   94.6   96.0
   63    90.0  100.0   94.1   94.8   96.3
   64    90.0  100.0   94.3   95.0   96.5
   65    90.0  100.0   94.6   95.2   96.7
   66    90.0  100.0   94.8   95.4   96.9
   67    90.0  100.0   95.0   95.6   97.1
   68    90.0  100.0   95.2   95.8   97.3
   69    90.0  100.0   95.4   96.0   97.5
   70    90.0  100.0   95.6   96.2   97.6
   71    90.0  100.0   95.8   96.3   97.8
   72    90.0  100.0   95.9   96.5   97.9
   73    90.0  100.0   96.1   96.6   98.1
   74    90.0  100.0   96.3   96.8   98.2
   75    90.0  100.0   96.4   96.9   98.3
   76    90.0  100.0   96.6   97.1   98.4
   77    90.0  100.0   96.7   97.2   98.5
   78    90.0  100.0   96.9   97.3   98.6
   79    90.0  100.0   97.0   97.4   98.7
   80    90.0  100.0   97.1   97.5   98.8
   81    90.0  100.0   97.2   97.7   98.9
   82    90.0  100.0   97.3   97.8   99.0
   83    90.0  100.0   97.5   97.9   99.1
   84    90.0  100.0   97.6   98.0   99.1
   85    90.0  100.0   97.7   98.0   99.2
   86    90.0  100.0   97.8   98.1   99.3
   87    90.0  100.0   97.9   98.2   99.3
   88    90.0  100.0   98.0   98.3   99.4
   89    90.0  100.0   98.0   98.4   99.4
   90    90.0  100.0   98.1   98.4   99.5
   91    90.0  100.0   98.2   98.5   99.5
   92    90.0  100.0   98.3   98.6   99.5
   93    90.0  100.0   98.3   98.6   99.6
   94    90.0  100.0   98.4   98.7   99.6
   95    90.0  100.0   98.5   98.8   99.6
   96    90.0  100.0   98.5   98.8   99.7
   97    90.0  100.0   98.6   98.9   99.7
   98    90.0  100.0   98.7   98.9   99.7
   99    90.0  100.0   98.7   99.0   99.7
  100    90.0  100.0   98.8   99.0   99.8
  101    90.0  100.0   98.8   99.1   99.8
  102    90.0  100.0   98.9   99.1   99.8
  103    90.0  100.0   98.9   99.1   99.8
  104    90.0  100.0   99.0   99.2   99.8
  105    90.0  100.0   99.0   99.2   99.9
  106    90.0  100.0   99.1   99.2   99.9
  107    90.0  100.0   99.1   99.3   99.9
  108    90.0  100.0   99.1   99.3   99.9
  109    90.0  100.0   99.2   99.3   99.9
  110    90.0  100.0   99.2   99.4   99.9
  111    90.0  100.0   99.2   99.4   99.9
  112    90.0  100.0   99.3   99.4   99.9
  113    90.0  100.0   99.3   99.5   99.9
  114    90.0  100.0   99.3   99.5   99.9
  115    90.0  100.0   99.4   99.5   99.9
  116    90.0  100.0   99.4   99.5   99.9
  117    90.0  100.0   99.4   99.5  100.0
  118    90.0  100.0   99.4   99.6  100.0
  119    90.0  100.0   99.5   99.6  100.0
  120    90.0  100.0   99.5   99.6  100.0
-------------------------------------------

Table 1: Relationship between expected performance 'p' and grade difference 'd' as defined in GS (green line), CGS, AGS and AGS2 (blue line), ÉGS, ÉGS2, ÉGS3 and ÉGS4 (red line) and ÉGS5 and ÉGS6 (yellow line), and (normal relationship 'p = 100*(1 + Erf[d/50])/2', where the error function Erf[z] is the integral of the Gaussian distribution) as originally defined by Élo (brown line above yellow). Expected performance 'p' is a function of grade difference 'd', i.e., 'p = f(d)'. Note that both FIDE and USCF switched from normal (brown line) to logistic (yellow line) relationship 'p = f(d)' which they found provides a better fit for the actual results achieved.

Which 'p = f(d)'...

It is impossible to measure chess abilities independently of chess performances (there is not a device one can put on the heads of chess players and get a measure of their chess abilities), if that would be possible, one would be able to plot 'p' against 'd' and find the best fit for 'p = f(d)'. Nevertheless, assuming that for small differences in chess abilities (say '|d|<=30') the relationship between chess performance and difference in chess abilities is linear, one can assume that grades for '|d|<=30' are in fact chess abilities and, taking into account game records where '|d|>30', plot 'p' against 'd' (black dots in the figure 3 below) and find that 'p = f(d)' for '|d|>30' follows one of the sigmoid curves (yellow, brown and red lines in the figure 3) closer than linear approximations (green and blue lines in the figure 3).

Figure 3
Figure 3: Mr Welch's finding. The '(|d|>30, q)' discrete experimental points match one of the sigmoid curves (yellow, brown and red lines) better than liner approximations (green and blue lines). Note that both FIDE and USCF switched from normal (brown line above red) to logistic (yellow line) relationship 'p = f(d)' which they found provides a better fit for the actual results achieved. Please note that the discrete points shown are for illustration purposes only, they are not a result of an actual analysis of the experimental data, and are shown to best fit the yellow line (blue line: ECF linear with 50 point rule; green line: ECF linear with 40 point rule; brown line: Élo's normal, 'p = 100*(1 + Erf[d/g])/2', 'g = 50', where the error function Erf[z] is the integral of the Gaussian distribution; red line: Élo's logistic with 'g = 52.3975...', 'p = 100/(1 + 10^(-d/g))', 'g = (25*Log[10])/Log[3] = 52.3975...'; yellow line: Élo's logistic with 'g = 50', 'p = 100/(1 + 10^(-d/g))', 'g = 50').

Variable factor 'k'...

Figure 4
Figure 4: Factor 'ka' (used in AGS2, ÉGS2, ÉGS3 and ÉGS6) as a function of 'na' and 'nb' ('na' and 'nb' are number of games players A and B played in the last season). Factor 'ka' is used in formulae which correct player's A grade based on the difference in actual and expected performance against player B ('a2 = a + ka*(q - p)'). The idea is to make less trusted or established grades (based on frequency of play) change more rapidly. Note that if player's A opponent is ungraded (i.e., 'nb = 0') and player A in not ungraded ('na > 0' i.e., 'na >= 1') then 'ka = 0' and consequently player's A grade is not affected by games he or she played against player B (i.e., 'a2' remains unchanged, 'a2 = a + ka*(q - p) = a + 0*(q - p) = a'). For systems which do not stretch (nor shrink) the grades it always holds 'ka + kb = 1'.

ECF grade vs FIDE rating scale...

Élo (originally) suggested scaling ratings so that a difference of 200 rating points in chess would mean that the stronger player has an expected score of approximately 0.75.

(In order to keep present ECF grade scale) one should suggest scaling grades so that a difference of 25 (not 200) grading points in chess would mean that the stronger player has an expected score of approximately 0.75 (i.e. 75%).

To me ECF grade scale makes more sense than FIDE rating scale, as for "small" grade differences (approximately '|d| <= 30') ECF grade difference is approximately half of the expected performance difference (in percents). Say, if a grade difference between two players is 10 grading points the stronger player is expected to score approximately 60%, i.e., '50 + 10 = 60%' vs '50 - 10 = 40%', so '10' is approximately half of the expected performance difference (in percents). In the case of FIDE rating a player would have to be approximately 80 rating points stronger in order to score 60%, and '80' is approximately half of the expected performance difference (in percents) multiplied by 8 (why by 8, I do not know).

For "larger" grade differences (approximately '|d| > 30') ECF grade difference is (or should be) larger than half of the expected performance difference (in percents), say if a grade difference between two players is 140 grading points, according to ÉGS6, the stronger player is expected to score approximately 99.68%, i.e., '50 + 49.84 = 99.84%' vs '50 - 49.84 = 0.16%', so '140' is larger than half of the expected performance difference (in percents).

Rule approach...

Rule 1a: For a win you score your opponent's grade plus 50; for a draw, your opponent's grade; and for a loss, your opponent's grade minus 50. Note that, if your opponent's grade differs from yours by more than 40 points, it is taken to be exactly not 40 points above (or below) yours. At the end of the season an average of points-per-game is taken, and that is your new grade.

Rule 2b: For a win you score average grade plus 25; for a draw, average grade; and for a loss, average grade minus 25. Average grade is half of the sum of your and your opponent's grade. Note that, if your opponent's grade differs from yours by more than 40 points, it is taken to be exactly 40 points above (or below) yours. At the end of the season an average of points-per-game is taken, and that is your new grade.

Rule 1a...

In order not to stretch nor shrink (drift) the grades one should apply rule 1a only for one of the players in the game. Say, if payers A and B have played a game, when grading it, if one applies rule 1a for correcting the grade of player A one should not apply it for correcting the grade of player B (one should omit grading of player B in that game) and vice versa. The reason for that is that rule 1a sets a maximum possible correction for a player's grade and if one would apply it for correcting the grades of both players the grades would drift apart (as one would apply too much correction).

There is nothing intrinsically wrong in grading a game in a way to change one player's grade for a maximum amount and leave other player's grade unchanged. The problem is though that the number of games in which such a grade correction distribution applies is rather small, in most games either it does not apply or one does not know if it does apply.

Let us assume that there are two player pools, pool 1 with average grade of 130 and pool 2 with average grade of 160. Let us assume that player A graded 130 is a member of pool 1 and that player B graded 160 is a member of player pool 2. Let us assume that player A played all players in pool 2 and that player B played all players in pool 1 and let us assume that the game between player A and B ended in a draw. We are facing a problem of correcting the grades of players A and B for the game they drawn.

One can argue, as player A played a pool of players with average grade of 160 I will increase grade of player A for a maximum possible amount and leave the grade of player B unchanged, this is equivalent to saying that the players drew because player A improved (say he was lucky enough to be coached by Garry Kasparov on the summer break) and player B neither improved nor worsened. Fine, one applies rule 1a for the game grading only player A.

But one can also argue, as player B played a pool of players with average grade of 130 I will decrease grade of player B for a maximum possible amount and leave the grade of player A unchanged, this is equivalent to saying that the players drew because player B worsened (say he had fallen in love on the summer break and all he thinks about is his girlfriend) and player A neither improved nor worsened. One apply rule 1a for the game grading only player B.

Which argument is correct? Could be one or the other, could be neither, there are infinite possibilities, one could try to estimate if one of the players improved or worsened and for how much, but basically in order to make such an assessment one would need a thorough analysis of players' lives, maybe the game itself, etc., so the best guess would be to change both player's grades for half of the maximum amount, one increases the grade of player A and decreases the grade of player B for half of the maximum amount.

Every player eventually plays a pool of prayers with some average grade, this pool should have no bearing on a decision how to distribute grade corrections when grading individual games.

The main flaw in GS is that it applies rule 1a for changing the grades of both players in the game (should apply it to change the grade of one of the players only). This causes grade stretching (or grade drifting).

Rule 2b...

In order not to stretch nor shrink (drift) the grades one should apply rule 2a for both players in the game. The reason for that is that rule 2b sets half of the maximum possible correction for a player's grade and if one would apply it for correcting a grade of one of the players only the grades would drift towards each other (as one would apply too little correction).

So one has no problem in deciding how to distribute the grade correction in each individual game, by applying rule 2b, one increases the grade of one player and decreases the grade of other player for half of the maximum amount.

Replacing GS...

In my opinion three candidates for replacing GS are AGS3, ÉGS5 and ÉGS6, with ÉGS6 as the best and ÉGS5 as the second best.

AGS3 is GS with the 'k' factors equal to '1/2' (the ECF's linear approximation is used for 'p = f(d)'; does not stretch the grades).

ÉGS5 is sort of ECF equivalent of FIDE's Élo (logistic curve is used for 'p = f(d)', this is regarded as more accurate than the ECF's linear approximation; the grades are ECF grades, not FIDE ratings, i.e., a strong grandmaster is about 270 not 2800; grading is done every season rather than after every tournament; does not stretch the grades).

ÉGS6 has similar improvement (taken in a simple from) over ÉGS5 as Glicko has over FIDE's Élo which accounts for a grade trust (or establishment) based on frequency of play (i.e., less trusted or established grades change more rapidly than more trusted or established grades, consequently, in extreme case, ungraded players do not affect the grades of graded players; uses logistic curve for 'p = f(d)'; does not stretch the grades).

--------------------------------------------------------------
grading   stretches  uses FIDE   changes less     preserves
system    grades     'p = f(d)'  trusted grades   total system
          ('k')      (yellow)    more rapidly     grade
--------------------------------------------------------------
GS        yes        no          no               yes
AGS3      no         no          no               yes
ÉGS5      no         yes         no               yes
ÉGS6      no         yes         yes              no
--------------------------------------------------------------

Table 2: Main differences between GS (current Grading System), AGS3 (Amended Grading System three), ÉGS5 (Élo Grading System five) and ÉGS6 (Élo Grading System six).

"Junior problem"...

The so called "junior problem", or in general a problem of players whose chess abilites change rapidly (which has been addressed in Glicko 2) has not been addressed in any of the mentioned systems (so neither in ÉGS6 nor ÉGS5).

One of the simple approaches to the "junior problem" could be that after calculating grades (in a normal way) using GS, AGS3, ÉGS5 or ÉGS6 (I am advocating using ÉGS6), one calculates average between the old and a new (just calculated) grades, then repeat the calculation (in a normal way) using GS, AGS3, ÉGS5 or ÉGS6, but this time with the average grades. This should address (to some extent) the problem of players whose chess abilities change rapidly (say fast improving juniors). This idea still needs to be checked (this approach may well cause grade stretching or shirking).

Another idea for resolving the "junior problem" could be to make 'k' factors for juniors (i.e., players whose chess abilities change rapidly) higher than for other players, while keeping the sum of the two factors 'ka' and 'kb' at 1 (which would guarantee that the grades won't be stretched nor shrank). The measure of change of one's chess ability could be the grade change in the last two seasons (chess abilities of those players who played in less than two seasons so far may be assumed to change rapidly). The idea is to trust the grade of a rapidly improving juniors (or other players whose chess abilities change rapidly) less than that of ordinary adult players (or any other players) whose grade is more or less constant. With this approach junior grades (or grades of players whose chess abilities change rapidly) would change faster affecting the grades of other players they have played less. The problem would remain to decide how much to correct 'k' factor for change in chess abilities and how much for grade trust (establishment) based on frequency of play.

Estimated effect on stretching...

Using a system where the sum of the factors 'k' is always equal to 1 (i.e. addressing the grade stretching problem) would affect the grades significantly in a longer run as in GS the grade stretching happens all the time (the effect increases with performance difference) and accumulates with time. Élo's logistic curve (present in both ÉGS5 and ÉGS6) wouldn't affect the grades significantly as its effect is relatively small for relatively small grade differences (the effect increases with grade difference), but it may affect the grades noticeably in cases where the grade difference is large. Professor Glickman's idea about grade establishment wouldn't affect the grades significantly in general as its effect is normally relatively small, but it may affect the grades significantly in cases where ungraded (or less active) players play graded (or more active) players.

The formulae...

Let 'a' and 'b' are the grades of players 'A' and 'B', 'p' expected performance of player 'A' (expected performance of player 'B' is then '100 - p'), 'q' actual performance of player 'A' (actual performance of player 'B' is then '100 - q') and 'd = a - b' the grade difference, 'na' and 'nb' the number of games players 'A' and 'B' played in the last season for which grades 'a' and 'b' were calculated. Then, new grades of players 'A' and 'B', 'a2' and 'b2', are calculated as follows:

GS (current Grading System) formulae:

(* GS *)
ClearAll[a, b, a2, b2, d, g, s, ka, kb, p, q];
a = 120; b = 120;
q = 50;
g = 50; s = 40;
d = a - b;
ka = 1; kb = 1;
If[d >= 0, If[d > s, p = 90, p = g*(1 + d/g)],
    If[d < -s, p = 10, p = g*(1 + d/g)]];
a2 = a + ka*(q - p);
b2 = b + kb*((100 - q) - (100 - p));
Round[N[a2]]
Print[];
Round[N[b2]]

AGS3 (Amended Grading System three) formulae:

(* AGS3 *)
ClearAll[a, b, a2, b2, d, g, s, ka, kb, p, q];
a = 120; b = 120;
q = 50;
g = 50; s = 40;
d = a - b;
ka = 1/2; kb = 1/2;
If[d >= 0, If[d > s, p = 90, p = g*(1 + d/g)],
    If[d < -s, p = 10, p = g*(1 + d/g)]];
a2 = a + ka*(q - p);
b2 = b + kb*((100 - q) - (100 - p));
Round[N[a2]]
Print[];
Round[N[b2]]

ÉGS5 (Élo Grading System five) formulae:

(* EGS5 *)
ClearAll[a, b, a2, b2, d, g, ka, kb, p, q];
a = 120; b = 120;
q = 50;
d = a - b;
g = 50;
ka = 1/2; kb = 1/2;
p = 100/(1 + 10^(-d/g));
a2 = a + ka*(q - p);
b2 = b + kb*((100 - q) - (100 - p));
Round[N[a2]]
Print[];
Round[N[b2]]

ÉGS6 (Élo Grading System six) formulae:

(* EGS6 *)
ClearAll[a, b, a2, b2, d, g, ka, kb, p, q, na, nb];
na = 30; nb = 30;
a = 120; b = 120;
q = 50;
d = a - b;
g = 50;
ka = If[na + nb > 0, nb/(na + nb), 1/2]; kb =
  If[na + nb > 0, na/(na + nb), 1/2];
p = 100/(1 + 10^(-d/g));
a2 = a + ka*(q - p);
b2 = b + kb*((100 - q) - (100 - p));
Round[N[a2]]
Print[];
Round[N[b2]]

(if players 'A' and 'B' had played only one game in the season 'q' is either 100, 0 or 50, if they played more than one game it can be a number between 0 and 100 inclusively, 'Log[z]' gives the natural logarithm of 'z' (logarithm to base 'e'), 'x^y' gives 'x' to the power 'y', input parameters for GS, AGS3 and ÉGS5 are: 'a', 'b' and 'q', input parameters for ÉGS6 are: 'a', 'b', 'na', 'nb' and 'q', output parameters are: 'a2' and 'b2')

Note: The formulae are used to calculate a new grade of player 'A' for every opponent 'B' he or she played in the season. At the end of the season an average of the calculated grades (for every opponent 'B') is taken, and this average is a new player's 'A' grade for the season (for GS, AGS3 and ÉGS5 if a player has not played enough games in the season, games from previous season or seasons will be taken into calculation, for ÉGS6 no games from previous season or seasons need to be taken into account).

Ungraded players...

Rule 1b: For a win you score your opponent's grade plus 50; for a draw, your opponent's grade; and for a loss, your opponent's grade minus 50. Note that, if your opponent's grade differs from yours by more than 50 points, it is taken to be exactly not 50 points above (or below) yours. At the end of the season an average of points-per-game is taken, and that is your new grade.

Let 'a' and 'b' are the grades of players 'A' and 'B', 'p' expected performance of player 'A' (expected performance of player 'B' is then '100 - p'), 'q' actual performance of player 'A' (actual performance of player 'B' is then '100 - q') and 'a2' and 'b2' new grades of players 'A' and 'B'.

'a2' and 'b2' are calculated using the following formulae (holds for any grading system mentioned here, including the current one):

a2 = a + ka*(q - p);
b2 = b + kb*((100 - q) - (100 - p));

Rule 1b is equivalent to:

g = 50; s = 50;
ka = 1; kb = 1;
d = a - b;
If[d >= 0, If[d > s, p = 100, p = g*(1 + d/g)],
    If[d < -s, p = 0, p = g*(1 + d/g)]];
a2 = a + ka*(q - p);
b2 = b + kb*((100 - q) - (100 - p));

where player A is you and player B is your opponent.

'a2' can be expressed in terms of 'b', if one substitutes 'a = b + d', one gets 'a2 = a + ka*(q - p) = b + d + ka*(-50 - d + q) = -50 + b + q'.

In general 'a2' is a function of 'b', 'q' and 'd', say AGS3's 'a2' is equal to 'a2 = (-50 + 2*b + d + q)/2', but CGS's 'a2' is only a function of 'b' and 'q', which means that one needs to know only your opponent's grade 'b' and your actual performance 'q' in order to calculate your new grade 'a2'.

This happy coincidence can be utilized in calculation of grades of ungraded players. We could apply rule 1b (which can be used if one changes the grade of one of the players in the game only) to all games where ungraded players play graded players using it for grade calculation of ungraded players only (note that the grades of ungraded player need not to be estimated, as they are not needed in the calculation). Applying rule 1b for the ungraded player but omitting grading of the game for the graded player is probably the best one can do (this is one of the rare occasions where one knows that one wants to credit or penalize only one player in the game for a fully allowed amount).

According to ÉGS6 a new grade 'a2' of ungraded player (who did play graded player) is 'a2 = -100/(1 + 10^(-d/50)) + b + d + q'. Nevertheless, a new grade 'a2' of ungraded player could be calculated using CGS's formula 'a2 = -50 + b + q', taking into account a fact that the calculation is independent of 'd' (so that the estimated grade would not be taken into calculation). Taking CGS's formula for 'a2' instead of ÉGS6's is equivalent to stating that '50 + d' is approximately equal to '100/(1 + 10^(-d/50))' which is true for '-35 < d < 35', so it would seem that it may be better to use CGS's rather than ÉGS6's formula for calculating grades of ungraded players (the error due to error in grade estimate would most likely be larger than the error due to taking CGS's rather than ÉGS6's formula).

Figure 5
Figure 5: Relationship between expected performance 'p' and grade difference 'd' as defined in CGS (blue line) and ÉGS6 (yellow line). Expected performance 'p' is a function of grade difference 'd', i.e., 'p = f(d)'. Note that '50 + d' (part of blue line) is approximately equal to '100/(1 + 10^(-d/50))' (yellow line) for '-35 < d < 35'.

Single game outcome probability...

---------------------------------------
        chance in percents           
        loss   draw   win   (stronger)
    d   win    draw   loss  (weaker)
---------------------------------------
    0   25.0   50.0   25.0
    1   24.4   48.9   26.7
    2   23.8   47.7   28.5
    3   23.3   46.5   30.2
    4   22.7   45.4   31.9
    5   22.1   44.3   33.6
    6   21.6   43.1   35.3
    7   21.0   42.0   37.0
    8   20.4   40.9   38.7
    9   19.9   39.8   40.3
   10   19.3   38.7   42.0
   11   18.8   37.6   43.6
   12   18.3   36.5   45.2
   13   17.7   35.5   46.8
   14   17.2   34.4   48.4
   15   16.7   33.4   49.9
   16   16.2   32.4   51.4
   17   15.7   31.4   52.9
   18   15.2   30.4   54.4
   19   14.7   29.4   55.9
   20   14.2   28.5   57.3
   21   13.8   27.5   58.7
   22   13.3   26.7   60.0
   23   12.9   25.7   61.4
   24   12.4   24.9   62.7
   25   12.0   24.0   64.0
   26   11.6   23.2   65.2
   27   11.2   22.4   66.4
   28   10.8   21.6   67.6
   29   10.4   20.8   68.8
   30   10.0   20.1   69.9
   31    9.7   19.3   71.0
   32    9.3   18.7   72.0
   33    9.0   17.9   73.1
   34    8.6   17.3   74.1
   35    8.3   16.7   75.0
   36    8.0   16.0   76.0
   37    7.7   15.4   76.9
   38    7.4   14.8   77.8
   39    7.1   14.3   78.6
   40    6.8   13.7   79.5
   41    6.6   13.1   80.3
   42    6.3   12.6   81.1
   43    6.1   12.1   81.8
   44    5.8   11.7   82.5
   45    5.6   11.2   83.2
   46    5.4   10.7   83.9
   47    5.1   10.3   84.6
   48    4.9    9.9   85.2
   49    4.7    9.5   85.8
   50    4.5    9.1   86.4
   51    4.4    8.7   86.9
   52    4.2    8.3   87.5
   53    4.0    8.0   88.0
   54    3.8    7.7   88.5
   55    3.7    7.3   89.0
   56    3.5    7.1   89.4
   57    3.4    6.7   89.9
   58    3.2    6.5   90.3
   59    3.1    6.2   90.7
   60    3.0    5.9   91.1
   61    2.8    5.7   91.5
   62    2.7    5.5   91.8
   63    2.6    5.2   92.2
   64    2.5    5.0   92.5
   65    2.4    4.8   92.8
   66    2.3    4.6   93.1
   67    2.2    4.4   93.4
   68    2.1    4.2   93.7
   69    2.0    4.0   94.0
   70    1.9    3.8   94.3
   71    1.8    3.7   94.5
   72    1.8    3.5   94.7
   73    1.7    3.3   95.0
   74    1.6    3.2   95.2
   75    1.5    3.1   95.4
   76    1.5    2.9   95.6
   77    1.4    2.8   95.8
   78    1.3    2.7   96.0
   79    1.3    2.5   96.2
   80    1.2    2.5   96.3
   81    1.2    2.3   96.5
   82    1.1    2.3   96.6
   83    1.1    2.1   96.8
   84    1.0    2.1   96.9
   85    1.0    1.9   97.1
   86    0.9    1.9   97.2
   87    0.9    1.8   97.3
   88    0.9    1.7   97.4
   89    0.8    1.6   97.6
   90    0.8    1.5   97.7
   91    0.7    1.5   97.8
   92    0.7    1.4   97.9
   93    0.7    1.3   98.0
   94    0.7    1.3   98.0
   95    0.6    1.3   98.1
   96    0.6    1.2   98.2
   97    0.6    1.1   98.3
   98    0.5    1.1   98.4
   99    0.5    1.1   98.4
  100    0.5    1.0   98.5
  101    0.5    0.9   98.6
  102    0.5    0.9   98.6
  103    0.4    0.9   98.7
  104    0.4    0.8   98.8
  105    0.4    0.8   98.8
  106    0.4    0.7   98.9
  107    0.4    0.7   98.9
  108    0.3    0.7   99.0
  109    0.3    0.7   99.0
  110    0.3    0.6   99.1
  111    0.3    0.6   99.1
  112    0.3    0.6   99.1
  113    0.3    0.5   99.2
  114    0.3    0.5   99.2
  115    0.2    0.5   99.3
  116    0.2    0.5   99.3
  117    0.2    0.5   99.3
  118    0.2    0.5   99.3
  119    0.2    0.4   99.4
  120    0.2    0.4   99.4
  121    0.2    0.4   99.4
  122    0.2    0.3   99.5
  123    0.2    0.3   99.5
  124    0.2    0.3   99.5
  125    0.2    0.3   99.5
  126    0.2    0.3   99.5
  127    0.1    0.3   99.6
  128    0.1    0.3   99.6
  129    0.1    0.3   99.6
  130    0.1    0.3   99.6
  131    0.1    0.3   99.6
  132    0.1    0.2   99.7
  133    0.1    0.2   99.7
  134    0.1    0.2   99.7
  135    0.1    0.2   99.7
  136    0.1    0.2   99.7
  137    0.1    0.2   99.7
  138    0.1    0.2   99.7
  139    0.1    0.1   99.8
  140    0.1    0.1   99.8
  141    0.1    0.1   99.8
  142    0.1    0.1   99.8
  143    0.1    0.1   99.8
  144    0.1    0.1   99.8
  145    0.1    0.1   99.8
  146    0.1    0.1   99.8
  147    0.1    0.1   99.8
  148    0.1    0.1   99.8
  149    0.1    0.1   99.8
  150    0.0    0.1   99.9
  151    0.0    0.1   99.9
  152    0.0    0.1   99.9
  153    0.0    0.1   99.9
  154    0.0    0.1   99.9
  155    0.0    0.1   99.9
  156    0.0    0.1   99.9
  157    0.0    0.1   99.9
  158    0.0    0.1   99.9
  159    0.0    0.1   99.9
  160    0.0    0.1   99.9
  161    0.0    0.1   99.9
  162    0.0    0.1   99.9
  163    0.0    0.1   99.9
  164    0.0    0.1   99.9
  165    0.0    0.1   99.9
  166    0.0    0.1   99.9
  167    0.0    0.1   99.9
  168    0.0    0.1   99.9
  169    0.0    0.1   99.9
  170    0.0    0.1   99.9
  171    0.0    0.1   99.9
  172    0.0    0.1   99.9
  173    0.0    0.1   99.9
  174    0.0    0.0  100.0
  175    0.0    0.0  100.0
  176    0.0    0.0  100.0
  177    0.0    0.0  100.0
  178    0.0    0.0  100.0
  179    0.0    0.0  100.0
  180    0.0    0.0  100.0
  181    0.0    0.0  100.0
  182    0.0    0.0  100.0
  183    0.0    0.0  100.0
  184    0.0    0.0  100.0
  185    0.0    0.0  100.0
  186    0.0    0.0  100.0
  187    0.0    0.0  100.0
  188    0.0    0.0  100.0
  189    0.0    0.0  100.0
  190    0.0    0.0  100.0
  191    0.0    0.0  100.0
  192    0.0    0.0  100.0
  193    0.0    0.0  100.0
  194    0.0    0.0  100.0
  195    0.0    0.0  100.0
  196    0.0    0.0  100.0
  197    0.0    0.0  100.0
  198    0.0    0.0  100.0
  199    0.0    0.0  100.0
  200    0.0    0.0  100.0
  201    0.0    0.0  100.0
  202    0.0    0.0  100.0
  203    0.0    0.0  100.0
  204    0.0    0.0  100.0
  205    0.0    0.0  100.0
  206    0.0    0.0  100.0
  207    0.0    0.0  100.0
  208    0.0    0.0  100.0
  209    0.0    0.0  100.0
  210    0.0    0.0  100.0
  211    0.0    0.0  100.0
  212    0.0    0.0  100.0
  213    0.0    0.0  100.0
  214    0.0    0.0  100.0
  215    0.0    0.0  100.0
  216    0.0    0.0  100.0
  217    0.0    0.0  100.0
  218    0.0    0.0  100.0
  219    0.0    0.0  100.0
  220    0.0    0.0  100.0
---------------------------------------

Table 3: Chance in percents of the stronger player losing (the weaker player winning) '50 - 1/2*f(d)', the players drawing '100 - f(d)', and the stronger player winning (the weaker player losing) '3/2*f(d) - 50' as a function of grade difference 'd'. Calculation is performed assuming logistic 'p = f(d)' (yellow line in figure 2) and that the most probable number of draws is the average between the maximum '200 - 2*f(d)' and the minimum '0'. White is approximately 5 grading points stronger than Black.


References...

[ECF]