software-store-developers team mailing list archive

Thread
Date

Re: Specification for "Top Rated" sections

From: Matthew Paul Thomas <mpt@xxxxxxxxxxxxx>
Date: Fri, 24 Jun 2011 16:17:15 +0100
Cc: software-store-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <BANLkTi=aG9pEr3NhQNWGQcckUUCAkxS50A@mail.gmail.com>
Organization: Canonical Ltd
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.17) Gecko/20110516 Thunderbird/3.1.10

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Peachey wrote on 23/06/11 04:49:
>...
> My new pair of scripts (http://db.tt/szCPDog) now implements the
> ci_lower_bound in both ruby (verbatim as supplied) and python
> (re-write of ruby version) to print the output of 3 test cases fed
> into ci_lower_bound.

Nice work!

> (This uses 0.1 as the power, which should correspond to a 90%
> confidence interval - not sure why the Wilson score link says 0.1=95%
> but I'm happy to be corrected as it's a long time since I've done
> statistics!)

As I understand it, it's because we're working with a "two-tailed"
confidence interval: we want 90% confidence that the average, from an
infinite number of ratings, would be within this interval that goes both
above *and* below the current average rating. (Or to put it another way,
as well as the "expected lowest rating", there is an "expected highest
rating"; we just don't use the latter in USC.)

That's why the algorithm includes "power/2", not just "power",
corresponding to "𝛼/2" in the original formula.

So I think
<http://www.evanmiller.org/how-not-to-sort-by-average-rating.html> is
correct, and for 90% confidence we want a power of 0.2, not 0.1. (Most
statistical tests in scientific papers use 95% confidence.
<http://xkcd.com/882/> But for "Top Rated" our requirements are not so
stringent, and a lower confidence level moves us a bit closer to the raw
average rating.)

> In both the python and ruby script, the output is the lower bound
> Wilson score for a case of:
> 2 ratings out of 2 (i.e. 2x5 from 2 ratings in spec test case 1),
> 1 rating out of 3 (1x2 from 3 ratings in spec test case 2), and
> 2 ratings out of 3 (2x5 from 3 ratings in spec test case 2)
> 
> Then in python, I've written a calc_erl function which takes a list of
> 5 integers corresponding to the number of ratings at each level of
> [1,2,3,4,5] to produce the ERL as per the spec.
>
> So the python script also prints out the resulting ERL for the two
> test cases in the spec.

Thank you for that. The code demonstrated that I made a silly mistake.

With the formula I suggested, as an item gets more ratings that are all
5 stars, the ELR approaches 5, as we'd expect.

     ★     ★★    ★★★   ★★★★  ★★★★★  Raw average  ELR
     0      0      0      0      1         5.00  1.89
     0      0      0      0      2         5.00  2.75
     0      0      0      0      4         5.00  3.54
     0      0      0      0      8         5.00  4.14
     0      0      0      0     16         5.00  4.53

But look what happens if we do the same with 1-star ratings:

     ★     ★★    ★★★   ★★★★  ★★★★★  Raw average  ELR
     1      0      0      0      0         1.00  0.38
     2      0      0      0      0         1.00  0.55
     4      0      0      0      0         1.00  0.71
     8      0      0      0      0         1.00  0.83
    16      0      0      0      0         1.00  0.91

Two simple ways to tell that the formula is wrong:

*   It gives us ELR < 1, when that average would never be possible in
    reality.

*   Something that has 16 ratings, all 1-star, is more definitely bad
    than something that has 1 rating of 1 star, so its ELR should be
    lower -- but it's higher.

Another way of thinking about it: Imagine two applications, one with 1
rating of 1 star, and another with no ratings at all. If you had to bet
which was worse, you'd bet the former. But the current formula tells us
that the latter is probably worse!

So it seems to me that one test case for our final formula should be
that if something has no ratings at all, *or* if it has only 3-star
ratings, its ELR should be 3.00. This means it's not really an "Expected
Lowest Rating" (ELR), but more like a "Dampened Rating" (DR), modelling
the pattern that the average rating usually dampens towards 3.00 as more
ratings come in.

So I tweaked your Python code to subtract 3 from each star rating before
multiplying it by the corresponding Wilson score, and then to add 3 to
the final result. It produced this:

     ★     ★★    ★★★   ★★★★  ★★★★★      DR
     0      0      0      0      0      3.0
    16      0      0      0      0      1.19
     8      0      0      0      0      1.34
     4      0      0      0      0      1.58
     2      0      0      0      0      1.90
     1      0      0      0      0      2.24
     0      1      0      0      0      2.62
     0      0      1      0      0      3.00
     0      0      0      1      0      3.38
     0      0      0      0      1      3.76
     0      0      0      0      2      4.10
     0      0      0      0      4      4.42
     0      0      0      0      8      4.66
     0      0      0      0     16      4.81

That looks about what I expected. What do you think? Revised code attached.

Thanks
- -- 
mpt
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk4EqnsACgkQ6PUxNfU6ecoySQCfdKMGfm4DMHqDYPADckXymIw1
1yQAnibGdjtW4bmM5KHm9XhcSLzJURKZ
=uZCJ
-----END PGP SIGNATURE-----

import math

def pnormaldist(qn):
    '''Inverse normal distribution, based on the Ruby statistics2.pnormaldist'''
    b = [1.570796288, 0.03706987906, -0.8364353589e-3,
         -0.2250947176e-3, 0.6841218299e-5, 0.5824238515e-5,
         -0.104527497e-5, 0.8360937017e-7, -0.3231081277e-8,
         0.3657763036e-10, 0.6936233982e-12]
        
    if qn < 0 or qn > 1:
        raise ValueError("qn must be between 0.0 and 1.0")
    if qn == 0.5:
        return 0.0
    
    w1 = qn
    if qn > 0.5:
        w1 = 1.0 - w1
    w3 = -math.log(4.0 * w1 * (1.0 - w1))
    w1 = b[0]
    for i in range (1,11):
        w1 = w1 + (b[i] * math.pow(w3, i))
        
    if qn > 0.5:
        return math.sqrt(w1*w3)
    else:
        return -math.sqrt(w1*w3)

def ci_lower_bound(pos, n, power):
    if n == 0:
        return 0
    z = pnormaldist(1-power/2)
    phat = 1.0 * pos / n
    return (phat + z*z/(2*n) - z * math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)

def calc_dr(ratings):

   if not len(ratings) == 5:
      raise AttributeError('ratings argument must be a list of 5 integers')
   
   tot_ratings = 0
   for i in range (0,5):
      tot_ratings = ratings[i] + tot_ratings
      
   sum_scores = 0.0
   for i in range (0,5):
      wilson_score = ci_lower_bound(ratings[i],tot_ratings,0.2)
      sum_scores = sum_scores + float((i+1)-3)*wilson_score
   
   return sum_scores+3

if __name__ == '__main__':
   print ' 0 0 0 0 0  %s' % calc_dr([0,0,0,0,0])
   print '16 0 0 0 0  %s' % calc_dr([16,0,0,0,0])
   print ' 8 0 0 0 0  %s' % calc_dr([8,0,0,0,0])
   print ' 4 0 0 0 0  %s' % calc_dr([4,0,0,0,0])
   print ' 2 0 0 0 0  %s' % calc_dr([2,0,0,0,0])
   print ' 1 0 0 0 0  %s' % calc_dr([1,0,0,0,0])
   print ' 0 1 0 0 0  %s' % calc_dr([0,1,0,0,0])
   print ' 0 0 1 0 0  %s' % calc_dr([0,0,1,0,0])
   print ' 0 0 0 1 0  %s' % calc_dr([0,0,0,1,0])
   print ' 0 0 0 0 1  %s' % calc_dr([0,0,0,0,1])
   print ' 0 0 0 0 2  %s' % calc_dr([0,0,0,0,2])
   print ' 0 0 0 0 4  %s' % calc_dr([0,0,0,0,4])
   print ' 0 0 0 0 8  %s' % calc_dr([0,0,0,0,8])
   print ' 0 0 0 0 16 %s' % calc_dr([0,0,0,0,16])

Follow ups

Re: Specification for "Top Rated" sections
From: Matthew Paul Thomas, 2011-07-07

References

Specification for "Top Rated" sections
From: Matthew Paul Thomas, 2011-06-22
Re: Specification for "Top Rated" sections
From: Aaron Peachey, 2011-06-23
Re: Specification for "Top Rated" sections
From: Aaron Peachey, 2011-06-23