Scores of high range tests’ candidates on supervised tests

hriqtests.com, November 2018

On March 2018, I published a short report regarding scores of high range tests’ candidates on supervised tests. Since available data were significantly renewed and some acquired before years were found again, it seemed time to present some more stuff.
My point is to try and present some things, in ways comprehensible to anyone, in the context of general discussion.
Such a work is not just a matter of general interest or isn’t done just to be done; it, more or less, helps set some standards regarding high range tests. A long and controversial issue, for example, is if in fact there is any fixed mean performance. Or, another example, if the sample of high range test candidates fits in normal distribution – in plain words, whether high range test candidates can be considered as some kind of special “general population” regarding their performances on high range tests or not. Many, testees or designers, do think that performances on a specific high range test should normally form Bell’s curve and, if not, question test’s quality. In order to do this, one must be sure that one has to do with some kind of “general population”. If not, not forming Bell’s curve is just the right thing to expect and there is totally no problem with it. Furthermore, others tend to complain why norms are not exactly set according to calculated standard deviation (“every 7.2 raw score difference, IQ should raise by 15 points, why doesn’t this happen” – if it’s not a normal distribution, standard deviation cannot in fact be taken that much into account).
158 scores on supervised tests were reported within 2010-2018 (well, most of them reported, a dozen gathered from scorelists). The good thing is that, after 120 scores, mean performances and curve’s shape remained impressively stable. That’s a good start. In addition, as you will see next, removing ceiling scores (scores reported as “greater than x”) did not practically change anything (stability).

Tests used : WAIS (R, III, IV – 61 scores), RAPM (22), CCFIT III (B, A+B – 14 scores), MAT (8), FRT-A (8), FRT-B (7), SBIS (4), IST70R (3), IBF-S (3), Wonderlic (3), OLSAT (3), CFT20R (2), TOGRA (1), BLS2-4T (1), RIAS (1), Unknown Mensa Entrance Tests (probably in most cases it’s RAPM or FRT – 17 scores). Lower reported score is 85 (M=100, SD=15) and higher 185 (M=100, SD=15).

So, histograms and further discussion, follow.

1. Scores on supervised tests (with ceiling results – greater than x is calculated as x).

Supervised_ceiling

2. Scores on supervised tests (without ceiling results)

Supervised_wo_ceiling

One may easily notice that despite the removal of 26 scores, both mean score and curve’s shape don’t differ significantly; in fact, they have no practical difference at all (here comes the fact that was described before: after almost 120 scores no significant differences were seen).
Among people who reported more than one score, correlations are almost perfect, as expected (number of score pairs in brackets).

Test
CCFIT III
RAPM
WAIS 0.89 (4) 0.86 (10)

So, as seen above, scores approach normal distribution, but shapes aren’t quite clear. Nothing more than a tendency can be noticed. In larger numbers (for example N=300) and if ALL scores are reported (many people do not report scores that are not “as high as they wanted to”), things may be clearer.
On the other hand, there clearly exists a mean performance that should be strictly, according to my opinion, taken into mind when norming high range tests. And that is probably the IQ where people tend to realize that they think significantly different than the average Joe or ordinary Jane – but that’s totally another, large, conversation.
Regarding test designing, after 8 years of having created several of them, one thing can be said for sure: There exist specific (and different, of several difficulty levels) patterns (in other words, kinds of items) that tend to behave steadily and “healthy” – that is, keeping their difficulty regardless the environment (whether they in one test or another), showing high discrimination index. It’s only after 8 years of studying and experimenting with tests that I can now say that I ‘ve created a team of decent tests. And in this point, one should realize that atractiveness or originality of a test not necessarily guarantee its quality. Of course, combination of the aforementioned is the ultimate goal.

Advertisements