On review scoring
It's one of the most-observed truths of videogame reviewing that the entire concept of scoring is, as practised almost universally in all forms of current print, broadcast and online media, fundamentally broken.
Everyone knows that the marks awarded in game reviews – whether out of five stars, ten points or 100% – are not in fact sequential numbers as we were taught them in arithmetic lessons, but abstract ciphers whose true value is heavily encoded. In videogame reviewing, 4 isn't any bigger than 2, 6=7, and 10 is more than twice as many as 9.
And therefore – since the sole and entire point of scoring is to attach an instantly comprehensible numerical summary of the reviewer's opinion to the text – videogame review scores are functionally almost meaningless.
There are certain phenomena that we know to be almost universally true:
—————————————————————————-
1. In scoring systems marking out of 10, the vast majority of scores clump around the scores 6, 7 and 8. (For the sake of clarity, this feature will generally use marks out of 10 as the default reference.) Similarly, with percentage-marking systems, most games score between 51% and 80%. (Which – rather than 60% to 80% – is the true equivalent range.)
2. In scoring systems marking out of five (or less), maximum scores (ie 5/5) are commonplace. Yet in systems marking out of 10 or 100, they're almost unheard of. This isn't very rational. Marking out of 10 only doubles the fineness of division, so there should be roughly half as many games scoring 10/10 in a 10-point system as there are scoring 5/5 in a five-point system (with the other half getting 9/10).
Yet something like Edge, which has a declared policy of cherry-picking only the best games each month, still only awards a 10 roughly once per 250 games reviewed. There seems to be a disproportionately-huge invisible glass barrier between 9 and 10 in a way that there isn't between 4 and 5, and another between 99% and 100% that's so vast as to be almost infinite.
(Edge has reviewed somewhere in the vicinity of 3000 games in its 17-year life. If we assume that their cherry-picking policy selects only, say, the best 40% of games to start with, that would suggest that approximately 750 games should have had a 10. The actual number is 12.)
3. Despite the above, there are always far more games clustered around the 9 (or equivalent) mark than there are in the entire range from 5 down to 1. It seems to be far, far easier to break into the 9/10 club than it is to score 5 or less in a 10-point system, yet astonishingly hard to take that one extra step.
4. A similar marginal level of distinction doesn't apply in the bottom half of the scoring range. To most modern reviewers (and readers), the scores 5/10 and 1/10 are basically interchangeable. In other words, any system marking out of 10 or more is in fact really marking out of three:
"Definitely buy this" (encompassing scores of 9 and 10. Strangely, in a percentage system the cut-off here is 90% and above, even though the actual equivalent should be 81% and above).
"Definitely don't buy this" (encompassing anything from 1 to 5).
"Um, it's sort of in the middle. Buy this if this is the sort of thing you usually like buying", (encompassing scores from 6 to 8, and basically not a review at all).
(What this means is that systems marking out of more than five are more often than not actually LESS precise than those marking out of five, in which the full range is more widely employed.)
—————————————————————————-
Odd, isn't it? There are two core reasons for these phenomena, and we'll very quickly deal with the less-interesting one first: corruption.
Corruption takes two forms: hard and soft. The "hard" variety is much rarer, and involves direct bribery of some sort – either in the form of a reviewer or editor being offered money or other incentives by the game's publisher, or the magazine being offered advertising placement (or removal), or other editorial benefits (eg exclusive access) conditional on the game's score. All of these things go on depressingly regularly, but they're much less common than the second form of corruption.
"Soft" corruption is when reviewers self-censor, not as a result of any direct instruction or threat but simply as a means to make their own lives easier by not causing what they regard as needless bad feeling.
That is, if you have to review Game X and it's absolutely terrible, but you know that Game X's publisher is about to announce Game Y which you want privileged access to, you might find yourself giving Game X a score of 5/10 – safe in the knowledge that that will be enough to put your readers off buying it, while much less likely to make the publisher apopleptic with rage than the 1 or 2 score that it really deserves.
But neither of those is what I want to talk about. The second reason that review scores are so hopelessly debased is that reviewers have absolutely no grasp of the fundamental purpose of marking.
The purpose of having a marking scale is to compare and sort games against each other, so that you can give your readers advice on which to buy. It's the only criterion that makes any sense, because even if we leave aside the vagaries of personal taste there is no such thing as an absolute empirical measure of game quality – it's a constantly changing baseline.
(If you judged every game against the standard of the entire history of videogames, for example, you'd have to give everything 10/10, because even the crappiest 400-point XBLA game is immeasurably superior to, say, Video Pinball. And the baseline changes even if you stay within a single format – the Spectrum games of 1992 were incomparable to the ones of 1982.)
Now, very stupid people sometimes point out that it's appropriate to cluster things uselessly around the middle, because of the bell curve. But this is a disastrously wrong-headed notion, because the bell curve method of grading is an artificial process specifically designed to clump its constituent elements around the middle, which is exactly what review scoring ISN'T supposed to do.
(Because they're very stupid, incidentally, what these people are almost certainly doing is confusing bell-curve grading with the other form of the bell curve, the normal-distribution graph. This is equally inapplicable because it refers to naturally-occurring phenomena such as human height, whereas there is no such thing as a naturally-mediocre videogame. You can't change your height, but you can employ more QA testers or make your fricking cutscenes skippable.)
The bell curve is completely the wrong model for reviewing – any idiot can say "most things are sort of average with only a few examples at either extreme", because everyone already knows that. But a reviewer's SOLE AND ENTIRE REASON FOR EXISTING is to separate out that clump of things and tell people which ones are most worthy of their limited time and money.
If people invest precious moments of their lives reading your reviews, the absolute minimum they should be able to expect is that they should come out at the other end knowing more than they did when they started. If all they've gleaned from your 1500-word review is "most things are in the middle with only a few examples at either extreme", then you've wasted their time, because they ALREADY KNEW THAT.
So a review scale should be spread as evenly as possible in order to achieve the greatest possible amount of distinction between games. That is, for a scoring system to have ANY worthwhile meaning at all, there should be roughly the same number of games in each division on the scale at any given point in time. If instead you've just clumped everything in the middle and left readers to judge for themselves which of 900 different 8-rated games they should buy, you've failed.
(As a theoretical ideal, reviewing would take the form of a single chart comprising the total number of games reviewed in the publication's entire lifespan, with each new review being assigned a unique position on the chart. If 576 games had been reviewed and a new game was deemed the best ever, its "score" would be 1/577. Sadly this approach is slightly impractical, at least in print magazines, due to the arguable need to categorise games by genre. You'd end up with a hefty slab of pages every month occupied by numerous charts of what was basically reprint.)
The thing that sparked these thoughts off today was some Sunday morning idle time, which I whiled away by examining the reviews of an iPod gaming site. The site marked out of four, which is just about the least useful grading system you could possibly imagine, but the implementation made it even worse. Out of 1063 games "reviewed", the marks broke down like this:
1 ("Avoid): 55
2 ("Caution"): 273
3 ("Good"): 471
4 ("Must Have"): 264
Now, to establish exactly how useless this system is, first we have to look at the games that get 2/4 ("Games in this category should not be avoided in all cases. Some players will find value in them", which is an American translation of the classic "If you like this sort of thing, this is the sort of thing you'll like").
How exactly is someone supposed to exercise "caution" over buying an iPod game? If the game has a Lite version, we don't need some muppet stating the bleeding obvious by telling us that we should probably check that out first if we're not sure. And if it doesn't have a Lite, then there's no way of being cautious. We can either buy the game or not buy it – there's no "buy it cautiously" option available on the App Store.
(What that score is in fact telling us, more or less explicitly, is "Go and check out some other people's reviews, because ours just wasted your time.")
So we can immediately bin the 2/4 scores, because they offer us no help whatsoever in making a buying decision. That leaves us with 790 reviews, of which 735 – or 93% – come with a buying recommendation. ("This rating is our seal of approval", says the site of its 3/4 "Good" mark.)
Well, thanks. You've really filtered the wheat from the chaff for us there, guys. Who knew that only 13 out of every 14 iPod games were worth buying? Where would we have been without you?
As a concept, giving reviews scores is brilliant. It's informative and a great time-saver in today's busy world, and it makes things like Metacritic (another terrific idea in theory) possible. Writers bemoan them at every turn, because they seem to make all the writer's finely-crafted words an afterthought to a number, but the scores aren't the problem. It's the idiots awarding them.
I still think Pinball Scoring is the only way to go. With New High Score occasionally.
I like your game ranking idea: "if you're going to buy games, buy them in this order." Also, discerning whether game Y is better or worse than game X makes a lot more sense than game Y is 1% better than game X. This reminds me of ACE magazine (I think) that had scores out of 1000.
It poses the question: can the ranking system work if your site has multiple reviewers?
I gave your post 4/5, by the way. It didn't quite break 'the invisible wall' 😉
If this article is the sort of thing you'll like, then you'll like this. 73%.
Sounds a bit like the Amiga Action Power League (was that what it was called?) to me.
The sad thing is that magazine reviews seem to have gone backwards since the Amiga Power/Zzap/Crash days of yore. (Of course even Zzap which I was a big fan of failed it's readers sometimes, most notably not pointing out that Thalamus was run by Newsfield and the Operation Thunderbolt C64 review debacle.)
Nowadays the only highly critical stuff in print media for games can be found in Retro gamer because let's face it, people are much more frank once things are 10-20 years dead in the market.
Amiga Power was the last mag where I felt that the journo's were on the consumer's side. (Arcade had it's moments though, especially big write ups of games they loved.)
"the Spectrum games of 1992 were incomparable to the ones of 1982"
Which were better? I can't think off-hand of any Speccy classics that were released in 1992, but likewise there were a hell of a lot of supremely crude "I MADE A GAEM" crap in the early years.
Excellent article, I was hoping for it to run a little longer and come up with a solution.
As far as I can see you've come up with two workable solutions that mags tend to use anyway. Take PC Gamer, near the back they always have a list of the best games for each genre then two or more that are just beneath it. PC Gamer also has a regular top 100 which helps to order the games.
Is that a good enough solution, or should there be something more like GamesTM where at the end of each review there's a better than and worst than comparison (or at least used to be)? Or, indeed, should the whole review be about comparing it to other games within the genre, how it handles itself compared to them?
Is the widespread use of top 100s an antidote to badly applied review scores or does the reviewing itself have to change?
@tssk: Actually, I don't think Retro Gamer's entirely off the hook. It's pretty lenient when it comes to ratings, and then every now and again it'll absolutely slam something (such as PMCE for iPod); the homebrew section's also scored insanely highly throughout, the the writer admitted on the RG forum he tends to think of 70% as an 'average'.
@Tom: A solution is to award games the rating they deserve and to use the full range. If something is unmitigated shit, give it 1/10. If it's brilliant, give it 10. Sadly, most publishers get scared of the former—I wrote a couple of reviews for a publication that shall remain nameless, but the editor went mental when I tried to give a rubbish game a 2. His argument: he'd read by review, bought the game himself and it worked fine. Therefore, it was worth "at least a 4". No matter that it was utter bollocks.
RG is also the magazine that gave the "Ultimate" Sega collection 98% despite you not being able to play Sonic 3 + Knuckles lock on with it.
Sega of course released said item for XBLA very shortly afterwards.
I'm pretty sure RG uses a scoring system that starts at 84%.
On the 'try cautiously' mark, some US scoring systems have the baffling concept of a 'rent it' recommendation, with hilarious consequences when applied to downloadables.
Personally, I still mentally work on the Zzap/Crash model. Games are one of:
– Stay up way past bedtime.
– Looking forward all day to playing later.
– Something to try when there's nothing else on
– Possibly get round to if I have a long, but not too debilitating illness
– Real Crap
I find it hard to believe that someone managed to mention ("Michael Jackson" -Ed) in these comments in a vaguely positive way given the approach they took to marking games.
I don't believe Craig ever wrote for them but I've heard a very very similar story to his with regard to ("Michael Jackson", – ed) so consider that redressing the balance.
Is it not the case that rampant payola and the threatened withdrawal of ad revenue and exclusives make any scoring method in any mainstream print or web publication so skewed as to be fairly useless?
link to escapistmagazine.com
Reading the above review made me think of this article. Points awarded for :
– if you like this sort of thing, you'll like this.
– if you don't like this sort of thing, think about how you don't like this sort of thing when considering whether to buy it.
– you may want to rent it instead.
– making the assertion that the game is 'more of' its prequel fully five times across its 750-odd words.
– the use of the word 'gameplay' twice in the same sentence.
Anyone remember Simon Kirrane giving Micro Machines 2 100% for playability?
How could you justify giving anything 100%? Seriously..?
One reason most review scores cluster around 51%-80% is that in the main, reviewers are still scoring for competence as well as quality.
The equivalent would be a film review by Barry Norman in the Radio Times that says "The actors keep forgetting their lines and I swear I saw a boom mike in shot in one scene. One star."
Or more accurately, "the actors DON'T forget their lines and I DIDN'T see any boom mikes, therefore it gets at least 3/5 straight away".
Our scores start at 84%? Tosh.
The thing with Retro Gamer is that we have a very limited amount of space, so I mainly tend to cover the good stuff I enjoy playing. Most of our readers also know what we like so will typically change a score because they know what our personal preferences are, just like in the days of old.
I'll admit we sometimes get things wrong (I awarded Pac-Man Championship 4/5 recently for a bookazine) but hey, we're only human and sometimes make mistakes. We do take a harsher viewpoint now, but as we tend to review the best titles it's not really noticeable 😉
Are you familiar with monitor gamma values? (Bear with me here.) The brightness of a pixel on your monitor is the RGB value to the power of 2.5 (ish). It's done for dumb historical reasons, but one benefit is that we concentrate the 255 available divisions around the dark colours where the human eye is better at distinguishing tones.
Your anti-bell-curve system, where 1% of games get 1%, 1% get 2% and so on means the divisions are clustered around 50% — because "most things are sort of average with only a few examples at either extreme". So, the difference between a 50th percentile game and a 51st percentile game is tiny compared to the difference between 1st and 2nd percentile games. It's all very noble to separate out the clump, but it's of no use to the reader, because he can't afford to buy the best 1% of all games. He can afford, at a push, the best .1% of all games. It would be far more useful to score the top 0.5% from 1-10 and give everything else zero.
The theoretical answer to this is to set gamma less than one, so the average game gets maybe 25%, 0-1% is a huge change and 99-100% is a tiny change. But people won't intuitively understand that. People intuitively understand the bell-curve, and it does cluster marks around 100% (albeit at the cost of also clustering them around 0%).
Or, instead of a score, give each game a salary — "Game X is worth buying if you earn more than $53,000/year ($47,000 for fans of the genre)".
“It’s all very noble to separate out the clump, but it’s of no use to the reader, because he can’t afford to buy the best 1% of all games. He can afford, at a push, the best .1% of all games.”
Then he buys the tenth of the best 1% that he’s interested in. If he’s into arcade games he’s not going to care about even a 99%-rated flight sim.
I have a comment on the criticisms of that 93% of that site’s ratings are high – I feel that this is an inevitable trend in any review site or magazine, and one that isn’t necessarily a bad thing. Ultimately, a review outlet is going to have limited resources. They can only review a limited number of games that are out there. It’s perfectly rational for them to focus their reviews on games that appeal to them and their audience, and those games are far more likely to score highly than games in which the reviewer has little interest in to begin with.
In a perfect world the outlet would review every iPhone game available, and then theoretically there would be an equal number of 1/4 reviews as there are 4/4 reviews, but that’s never going to happen. Instead they focus on games that they are interested in, and they are likely to enjoy, and thus reviews end up clumped together at the high end of the scale.
Overall though, this is a fantastic read, and you raise a lot of fantastic points.
I don't want the best 1% of all arcade games, I want the best arcade game, the best fighting game, the best racing game, and the top two FPS games.
In any case, my point is that it's dumb to say "we can immediately bin the 2/4 scores, because they offer us no help whatsoever in making a buying decision" and then say "a reviewer's SOLE AND ENTIRE REASON FOR EXISTING is to separate out that clump of things and tell people which ones are most worthy of their limited time and money" because separating the clump offers us no help whatsoever in making a buying decision either. The whole clump is "no". Nobody cares that Block Puzzle II (50%) is one point better than The Averageventures of Tim the Person (51%) or one point worse than Banal Rally February 2011 (49%) because all three games are far too dull to ever consider buying, whereas the difference between a 99% game and a 98.5% game might well affect a purchasing decision.