The world's most-read Scottish politics website

Wings Over Scotland


On review scoring

Posted on August 01, 2010 by

It's one of the most-observed truths of videogame reviewing that the entire concept of scoring is, as practised almost universally in all forms of current print, broadcast and online media, fundamentally broken.

Everyone knows that the marks awarded in game reviews – whether out of five stars, ten points or 100% – are not in fact sequential numbers as we were taught them in arithmetic lessons, but abstract ciphers whose true value is heavily encoded. In videogame reviewing, 4 isn't any bigger than 2, 6=7, and 10 is more than twice as many as 9.

And therefore – since the sole and entire point of scoring is to attach an instantly comprehensible numerical summary of the reviewer's opinion to the text – videogame review scores are functionally almost meaningless.

There are certain phenomena that we know to be almost universally true:

—————————————————————————-

1. In scoring systems marking out of 10, the vast majority of scores clump around the scores 6, 7 and 8. (For the sake of clarity, this feature will generally use marks out of 10 as the default reference.) Similarly, with percentage-marking systems, most games score between 51% and 80%. (Which – rather than 60% to 80% – is the true equivalent range.)

2. In scoring systems marking out of five (or less), maximum scores (ie 5/5) are commonplace. Yet in systems marking out of 10 or 100, they're almost unheard of. This isn't very rational. Marking out of 10 only doubles the fineness of division, so there should be roughly half as many games scoring 10/10 in a 10-point system as there are scoring 5/5 in a five-point system (with the other half getting 9/10).

Yet something like Edge, which has a declared policy of cherry-picking only the best games each month, still only awards a 10 roughly once per 250 games reviewed. There seems to be a disproportionately-huge invisible glass barrier between 9 and 10 in a way that there isn't between 4 and 5, and another between 99% and 100% that's so vast as to be almost infinite.

(Edge has reviewed somewhere in the vicinity of 3000 games in its 17-year life. If we assume that their cherry-picking policy selects only, say, the best 40% of games to start with, that would suggest that approximately 750 games should have had a 10. The actual number is 12.)

3. Despite the above, there are always far more games clustered around the 9 (or equivalent) mark than there are in the entire range from 5 down to 1. It seems to be far, far easier to break into the 9/10 club than it is to score 5 or less in a 10-point system, yet astonishingly hard to take that one extra step.

4. A similar marginal level of distinction doesn't apply in the bottom half of the scoring range. To most modern reviewers (and readers), the scores 5/10 and 1/10 are basically interchangeable. In other words, any system marking out of 10 or more is in fact really marking out of three:

"Definitely buy this" (encompassing scores of 9 and 10. Strangely, in a percentage system the cut-off here is 90% and above, even though the actual equivalent should be 81% and above).

"Definitely don't buy this" (encompassing anything from 1 to 5).

"Um, it's sort of in the middle. Buy this if this is the sort of thing you usually like buying", (encompassing scores from 6 to 8, and basically not a review at all).

(What this means is that systems marking out of more than five are more often than not actually LESS precise than those marking out of five, in which the full range is more widely employed.)

—————————————————————————-

Odd, isn't it? There are two core reasons for these phenomena, and we'll very quickly deal with the less-interesting one first: corruption.

Corruption takes two forms: hard and soft. The "hard" variety is much rarer, and involves direct bribery of some sort – either in the form of a reviewer or editor being offered money or other incentives by the game's publisher, or the magazine being offered advertising placement (or removal), or other editorial benefits (eg exclusive access) conditional on the game's score. All of these things go on depressingly regularly, but they're much less common than the second form of corruption.

"Soft" corruption is when reviewers self-censor, not as a result of any direct instruction or threat but simply as a means to make their own lives easier by not causing what they regard as needless bad feeling.

That is, if you have to review Game X and it's absolutely terrible, but  you know that Game X's publisher is about to announce Game Y which you want privileged access to, you might find yourself giving Game X a score of 5/10 – safe in the knowledge that that will be enough to put your readers off buying it, while much less likely to make the publisher apopleptic with rage than the 1 or 2 score that it really deserves.

 

But neither of those is what I want to talk about. The second reason that review scores are so hopelessly debased is that reviewers have absolutely no grasp of the fundamental purpose of marking.

The purpose of having a marking scale is to compare and sort games against each other, so that you can give your readers advice on which to buy. It's the only criterion that makes any sense, because even if we leave aside the vagaries of personal taste there is no such thing as an absolute empirical measure of game quality – it's a constantly changing baseline.

(If you judged every game against the standard of the entire history of videogames, for example, you'd have to give everything 10/10, because even the crappiest 400-point XBLA game is immeasurably superior to, say, Video Pinball. And the baseline changes even if you stay within a single format – the Spectrum games of 1992 were incomparable to the ones of 1982.)

Now, very stupid people sometimes point out that it's appropriate to cluster things uselessly around the middle, because of the bell curve. But this is a disastrously wrong-headed notion, because the bell curve method of grading is an artificial process specifically designed to clump its constituent elements around the middle, which is exactly what review scoring ISN'T supposed to do.

(Because they're very stupid, incidentally, what these people are almost certainly doing is confusing bell-curve grading with the other form of the bell curve, the normal-distribution graph. This is equally inapplicable because it refers to naturally-occurring phenomena such as human height, whereas there is no such thing as a naturally-mediocre videogame. You can't change your height, but you can employ more QA testers or make your fricking cutscenes skippable.)

The bell curve is completely the wrong model for reviewing – any idiot can say "most things are sort of average with only a few examples at either extreme", because everyone already knows that. But a reviewer's SOLE AND ENTIRE REASON FOR EXISTING is to separate out that clump of things and tell people which ones are most worthy of their limited time and money.

If people invest precious moments of their lives reading your reviews, the absolute minimum they should be able to expect is that they should come out at the other end knowing more than they did when they started. If all they've gleaned from your 1500-word review is "most things are in the middle with only a few examples at either extreme", then you've wasted their time, because they ALREADY KNEW THAT.

So a review scale should be spread as evenly as possible in order to achieve the greatest possible amount of distinction between games. That is, for a scoring system to have ANY worthwhile meaning at all, there should be roughly the same number of games in each division on the scale at any given point in time. If instead you've just clumped everything in the middle and left readers to judge for themselves which of 900 different 8-rated games they should buy, you've failed.

(As a theoretical ideal, reviewing would take the form of a single chart comprising the total number of games reviewed in the publication's entire lifespan, with each new review being assigned a unique position on the chart. If 576 games had been reviewed and a new game was deemed the best ever, its "score" would be 1/577. Sadly this approach is slightly impractical, at least in print magazines, due to the arguable need to categorise games by genre. You'd end up with a hefty slab of pages every month occupied by numerous charts of what was basically reprint.)

The thing that sparked these thoughts off today was some Sunday morning idle time, which I whiled away by examining the reviews of an iPod gaming site. The site marked out of four, which is just about the least useful grading system you could possibly imagine, but the implementation made it even worse. Out of 1063 games "reviewed", the marks broke down like this:

1 ("Avoid): 55
2 ("Caution"): 273
3 ("Good"): 471
4 ("Must Have"): 264

Now, to establish exactly how useless this system is, first we have to look at the games that get 2/4 ("Games in this category should not be avoided in all cases. Some players will find value in them", which is an American translation of the classic "If you like this sort of thing, this is the sort of thing you'll like").

How exactly is someone supposed to exercise "caution" over buying an iPod game? If the game has a Lite version, we don't need some muppet stating the bleeding obvious by telling us that we should probably check that out first if we're not sure. And if it doesn't have a Lite, then there's no way of being cautious. We can either buy the game or not buy it  – there's no "buy it cautiously" option available on the App Store.

(What that score is in fact telling us, more or less explicitly, is "Go and check out some other people's reviews, because ours just wasted your time.")

So we can immediately bin the 2/4 scores, because they offer us no help whatsoever in making a buying decision. That leaves us with 790 reviews, of which 735 – or 93% – come with a buying recommendation. ("This rating is our seal of approval", says the site of its 3/4 "Good" mark.)

Well, thanks. You've really filtered the wheat from the chaff for us there, guys. Who knew that only 13 out of every 14 iPod games were worth buying? Where would we have been without you?

As a concept, giving reviews scores is brilliant. It's informative and a great time-saver in today's busy world, and it makes things like Metacritic (another terrific idea in theory) possible. Writers bemoan them at every turn, because they seem to make all the writer's finely-crafted words an afterthought to a number, but the scores aren't the problem. It's the idiots awarding them.

0 to “On review scoring”

  1. Derek says:

    I still think Pinball Scoring is the only way to go. With New High Score occasionally.

    Reply
  2. Marc says:

    I like your game ranking idea: "if you're going to buy games, buy them in this order." Also, discerning whether game Y is better or worse than game X makes a lot more sense than game Y is 1% better than game X. This reminds me of ACE magazine (I think) that had scores out of 1000.
     
    It poses the question: can the ranking system work if your site has multiple reviewers?
     
    I gave your post 4/5, by the way. It didn't quite break 'the invisible wall' 😉

    Reply
  3. Marco Gazpacho says:

    If this article is the sort of thing you'll like, then you'll like this. 73%.

    Reply
  4. CheapSheep says:

    Sounds a bit like the Amiga Action Power League (was that what it was called?) to me.

    Reply
  5. tssk says:

    The sad thing is that magazine reviews seem to have gone backwards since the Amiga Power/Zzap/Crash days of yore.  (Of course even Zzap which I was a big fan of failed it's readers sometimes, most notably not pointing out that Thalamus was run by Newsfield and the Operation Thunderbolt C64 review debacle.)
    Nowadays the only highly critical stuff in print media for games can be found in Retro gamer because let's face it, people are much more frank once things are 10-20 years dead in the market.
    Amiga Power was the last mag where I felt that the journo's were on the consumer's side. (Arcade had it's moments though, especially big write ups of games they loved.)

    Reply
  6. VLII says:

    "the Spectrum games of 1992 were incomparable to the ones of 1982"
    Which were better?  I can't think off-hand of any Speccy classics that were released in 1992, but likewise there were a hell of a lot of supremely crude "I MADE A GAEM" crap in the early years.

    Reply
  7. Tom Camfield says:

    Excellent article, I was hoping for it to run a little longer and come up with a solution.
    As far as I can see you've come up with two workable solutions that mags tend to use anyway. Take PC Gamer, near the back they always have a list of the best games for each genre then two or more that are just beneath it. PC Gamer also has a regular top 100 which helps to order the games.
    Is that a good enough solution, or should there be something more like GamesTM where at the end of each review there's a better than and worst than comparison (or at least used to be)? Or, indeed, should the whole review be about comparing it to other games within the genre, how it handles itself compared to them?
    Is the widespread use of top 100s an antidote to badly applied review scores or does the reviewing itself have to change?

    Reply
  8. @tssk: Actually, I don't think Retro Gamer's entirely off the hook. It's pretty lenient when it comes to ratings, and then every now and again it'll absolutely slam something (such as PMCE for iPod); the homebrew section's also scored insanely highly throughout, the the writer admitted on the RG forum he tends to think of 70% as an 'average'.
    @Tom: A solution is to award games the rating they deserve and to use the full range. If something is unmitigated shit, give it 1/10. If it's brilliant, give it 10. Sadly, most publishers get scared of the former—I wrote a couple of reviews for a publication that shall remain nameless, but the editor went mental when I tried to give a rubbish game a 2. His argument: he'd read by review, bought the game himself and it worked fine. Therefore, it was worth "at least a 4". No matter that it was utter bollocks.

    Reply
  9. DG says:

    RG is also the magazine that gave the "Ultimate" Sega collection 98% despite you not being able to play Sonic 3 + Knuckles lock on with it.
    Sega of course released said item for XBLA very shortly afterwards.

    Reply
  10. Rev. Stuart Campbell says:

    I'm pretty sure RG uses a scoring system that starts at 84%.

    Reply
  11. CdrJameson says:

    On the 'try cautiously' mark, some US scoring systems have the baffling concept of a 'rent it' recommendation, with hilarious consequences when applied to downloadables.
    Personally, I still mentally work on the Zzap/Crash model. Games are one of:
    – Stay up way past bedtime.
    – Looking forward all day to playing later.
    – Something to try when there's nothing else on
    – Possibly get round to if I have a long, but not too debilitating illness
    – Real Crap

    Reply
  12. The Owl says:

    I find it hard to believe that someone managed to mention ("Michael Jackson" -Ed) in these comments in a vaguely positive way given the approach they took to marking games.

    Reply
  13. DG says:

    I don't believe Craig ever wrote for them but I've heard a very very similar story to his with regard to ("Michael Jackson", – ed)  so consider that redressing the balance.

    Reply
  14. Irish Al says:

    Is it not the case that rampant payola and the threatened withdrawal of ad revenue and exclusives make any scoring method in any mainstream print or web publication so skewed as to be fairly useless?

    Reply
  15. asdasdasd says:

    link to escapistmagazine.com

    Reading the above review made me think of this article. Points awarded for :
    – if you like this sort of thing, you'll like this.
    – if you don't like this sort of thing, think about how you don't like this sort of thing when considering whether to buy it.
    you may want to rent it instead.
    – making the assertion that the game is 'more of' its prequel fully five times across its 750-odd words.
    – the use of the word 'gameplay' twice in the same sentence.

    Reply
  16. bedroomcoder says:

    Anyone remember Simon Kirrane giving Micro Machines 2 100% for playability?
    How could you justify giving anything 100%? Seriously..?

    Reply
  17. Mr Lizard says:

    One reason most review scores cluster around 51%-80% is that in the main, reviewers are still scoring for competence as well as quality.
    The equivalent would be a film review by Barry Norman in the Radio Times that says "The actors keep forgetting their lines and I swear I saw a boom mike in shot in one scene. One star."

    Reply
  18. Rev. Stuart Campbell says:

    Or more accurately, "the actors DON'T forget their lines and I DIDN'T see any boom mikes, therefore it gets at least 3/5 straight away".

    Reply
  19. Darran says:

    Our scores start at 84%? Tosh.
    The thing with Retro Gamer is that we have a very limited amount of space, so I mainly tend to cover the good stuff I enjoy playing. Most of our readers also know what we like so will typically change a score because they know what our personal preferences are, just like in the days of old.
    I'll admit we sometimes get things wrong (I awarded Pac-Man Championship 4/5 recently for a bookazine) but hey, we're only human and sometimes make mistakes. We do take a harsher viewpoint now, but as we tend to review the best titles it's not really noticeable 😉

    Reply
  20. Andrew says:

    Are you familiar with monitor gamma values? (Bear with me here.) The brightness of a pixel on your monitor is the RGB value to the power of 2.5 (ish). It's done for dumb historical reasons, but one benefit is that we concentrate the 255 available divisions around the dark colours where the human eye is better at distinguishing tones.
    Your anti-bell-curve system, where 1% of games get 1%, 1% get 2% and so on means the divisions are clustered around 50% — because "most things are sort of average with only a few examples at either extreme". So, the difference between a 50th percentile game and a 51st percentile game is tiny compared to the difference between 1st and 2nd percentile games. It's all very noble to separate out the clump, but it's of no use to the reader, because he can't afford to buy the best 1% of all games. He can afford, at a push, the best .1% of all games. It would be far more useful to score the top 0.5% from 1-10 and give everything else zero.
    The theoretical answer to this is to set gamma less than one, so the average game gets maybe 25%, 0-1% is a huge change and 99-100% is a tiny change. But people won't intuitively understand that. People intuitively understand the bell-curve, and it does cluster marks around 100% (albeit at the cost of also clustering them around 0%).
    Or, instead of a score, give each game a salary — "Game X is worth buying if you earn more than $53,000/year ($47,000 for fans of the genre)".

    Reply
    • Rev. Stuart Campbell says:

      “It’s all very noble to separate out the clump, but it’s of no use to the reader, because he can’t afford to buy the best 1% of all games. He can afford, at a push, the best .1% of all games.”

      Then he buys the tenth of the best 1% that he’s interested in. If he’s into arcade games he’s not going to care about even a 99%-rated flight sim.

      Reply
  21. MattyFTM says:

    I have a comment on the criticisms of that 93% of that site’s ratings are high – I feel that this is an inevitable trend in any review site or magazine, and one that isn’t necessarily a bad thing. Ultimately, a review outlet is going to have limited resources. They can only review a limited number of games that are out there. It’s perfectly rational for them to focus their reviews on games that appeal to them and their audience, and those games are far more likely to score highly than games in which the reviewer has little interest in to begin with.

    In a perfect world the outlet would review every iPhone game available, and then theoretically there would be an equal number of 1/4 reviews as there are 4/4 reviews, but that’s never going to happen. Instead they focus on games that they are interested in, and they are likely to enjoy, and thus reviews end up clumped together at the high end of the scale.

    Overall though, this is a fantastic read, and you raise a lot of fantastic points.

    Reply
  22. Andrew says:

    I don't want the best 1% of all arcade games, I want the best arcade game, the best fighting game, the best racing game, and the top two FPS games.
     
    In any case, my point is that it's dumb to say "we can immediately bin the 2/4 scores, because they offer us no help whatsoever in making a buying decision" and then say "a reviewer's SOLE AND ENTIRE REASON FOR EXISTING is to separate out that clump of things and tell people which ones are most worthy of their limited time and money" because separating the clump offers us no help whatsoever in making a buying decision either. The whole clump is "no". Nobody cares that Block Puzzle II (50%) is one point better than The Averageventures of Tim the Person (51%) or one point worse than Banal Rally February 2011 (49%) because all three games are far too dull to ever consider buying, whereas the difference between a 99% game and a 98.5% game might well affect a purchasing decision.

    Reply


Comment - please read this page for comment rules. HTML tags like <i> and <b> are permitted. Use paragraph breaks in long comments. DO NOT SIGN YOUR COMMENTS, either with a name or a slogan. If your comment does not appear immediately, DO NOT REPOST IT. Ignore these rules and I WILL KILL YOU WITH HAMMERS.


  • About

    Wings Over Scotland is a (mainly) Scottish political media digest and monitor, which also offers its own commentary. (More)

    Stats: 6,748 Posts, 1,216,989 Comments

  • Recent Posts

  • Archives

  • Categories

  • Tags

  • Recent Comments

    • Cynicus on For Children Scotland: ““Blighty” is a British English slang term for Great Britain, or often specifically England( WIKI) ====== V.O’B: are you seriously…May 2, 23:58
    • Geri on The Takeover: “I’ll hold yer beer cause this one’s a belter up next… Nothing is left or Right anymore. Just crazed political…May 2, 23:31
    • Karen on For Children Scotland: “Yes, principles make good law (as in Scots Law) – example The Law of Nuisance. Lists make bad law (as…May 2, 22:37
    • Bilbo on For Children Scotland: “I have no interest in UK elections but from what I can gather the 2024 Labour election victory, which is…May 2, 22:33
    • Geri on For Children Scotland: ““Why does the SNP have to make everything so complicated?” Because they’ve looked around themselves & can’t find anything more…May 2, 22:17
    • Jay on The Takeover: “Caveman, when you refer to “rolling back all the insanity of the left” do you include undoing the neo-liberal privatising…May 2, 21:58
    • Dan on For Children Scotland: “I thought it would have been more interesting and useful for Scotland to have been developing a clearly required strategy…May 2, 20:36
    • Bilbo on For Children Scotland: “I see that Trans charities are trying to put pressure on the UK government over the UK Supreme court ruling:…May 2, 18:53
    • GM on For Children Scotland: “..keyed their car or jilted their sister..? It depends whose car/sister IT IS maybe? Have I missed this story?May 2, 18:51
    • Lorn on For Children Scotland: “The truth is that they never wanted it for women and girls. They couldn’t give ‘trans’ women (men) extra protection…May 2, 18:50
    • Marie on For Children Scotland: “I agree – but Chapman and Slater would not have been able to do the damage they have done without…May 2, 18:41
    • Bilbo on For Children Scotland: “Jontoscots21 “Fascist- someone you disagree with. Most of the unreformed lefties on here with their circle of concern extended to…May 2, 18:35
    • Karen on For Children Scotland: “Cynical me thought the point of any new misogyny law (and indeed the Hate Crime Act), was to protect “transwomen”.…May 2, 18:14
    • Oneliner on For Children Scotland: “Alas, in Blighty, infiltration can happen at the highest level of an organisation. Instances include Kim Philby (MI5/KGB) and Joe…May 2, 17:02
    • Lorn on For Children Scotland: “John Swinney has said that they will amend the Hate Crime law to include misogyny against women and girls (not…May 2, 16:51
    • Yoon Scum on For Children Scotland: “How will you convince soft NO voters like myself? Or is it already a vast majority for YESMay 2, 16:46
    • Yoon Scum on For Children Scotland: “The conversion therapy ban was a particularly evil policy As it would ban anyone from saying “Are you sure” I’m…May 2, 16:33
    • Jontoscots21 on For Children Scotland: “Fascist- someone you disagree with. Most of the unreformed lefties on here with their circle of concern extended to the…May 2, 16:28
    • Yoon Scum on For Children Scotland: “Farage The man who showed that you should never believe politicians who tell you leaving a union is easy and…May 2, 16:26
    • Lorn on For Children Scotland: “The ban on conversion therapy, in reality, would have been a ban on the ban of conversion therapy because conversion…May 2, 16:17
    • Lorn on For Children Scotland: “It was suggested, Scott, by the women’s groups who were totally ignored. The only reason they wanted it introduced is…May 2, 15:41
    • Shug on For Children Scotland: “To be honest dropping this shit is no hit it us a success. Does Swinney have the balls to deliver…May 2, 15:11
    • BLMac on For Children Scotland: “‘How did the world end up with so many crazies ruling it?’ Because no one is game to do what…May 2, 15:03
    • Vivian O’Blivion on For Children Scotland: “In other news probably more pertinent to the previous thread, the German permanent state has declared Alternative fur Deutchland (AfD)…May 2, 14:58
    • Effijy on For Children Scotland: “My big story of the day is England looking to install Farage as their Prime Minister. The man with the…May 2, 14:31
    • joolz on For Children Scotland: “I welcome them dropping the ‘conversion’ bill but one of your headlines says ‘Scottish ministers shelve plans for new misogyny…May 2, 14:02
    • Aidan on For Children Scotland: “We all know conversion practices aimed at changing a person’s sexuality are abhorrent, but as far as I am aware…May 2, 13:55
    • Vivian O’Blivion on For Children Scotland: “The conversion therapy legislation in its fundamentalist drafting was substantially the brain fart of Emma Roddick when she was Meenister…May 2, 13:21
    • I. Despair on For Children Scotland: “A climbdown rooted in snivelling political expediency rather than intelligent principle, but welcome news all the same. Another strange and…May 2, 13:12
    • duncanio on For Children Scotland: “It’s bad news for the gender crazies so John the Vote Swindler trying to bury the bad news so as…May 2, 12:45
  • A tall tale



↑ Top