desktop-packages team mailing list archive

Thread
Date

[Bug 894468] Re: Statistics algorithm for sorting ratings looks fishy

To: desktop-packages@xxxxxxxxxxxxxxxxxxx
From: Anthony Lenton <894468@xxxxxxxxxxxxxxxxxx>
Date: Thu, 12 Jan 2012 23:15:07 -0000
Reply-to: Bug 894468 <894468@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

** Tags added: client-server

-- 
You received this bug notification because you are a member of Desktop
Packages, which is subscribed to software-center in Ubuntu.
https://bugs.launchpad.net/bugs/894468

Title:
  Statistics algorithm for sorting ratings looks fishy

Status in “software-center” package in Ubuntu:
  New

Bug description:
  Here's the current code snippet for sorting the Software Center
  Ratings:

  def wilson_score(pos, n, power=0.2):
      if n == 0:
          return 0
      z = pnormaldist(1-power/2)
      phat = 1.0 * pos / n
      return (phat + z*z/(2*n) - z * math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)

  def calc_dr(ratings, power=0.1):
      '''Calculate the dampened rating for an app given its collective ratings'''
      if not len(ratings) == 5:
          raise AttributeError('ratings argument must be a list of 5 integers')
     
      tot_ratings = 0
      for i in range (0,5):
          tot_ratings = ratings[i] + tot_ratings
        
      sum_scores = 0.0
      for i in range (0,5):
          ws = wilson_score(ratings[i], tot_ratings, power)
          sum_scores = sum_scores + float((i+1)-3) * ws
     
      return sum_scores + 3


  This looks very fishy to me, as we are calculating 5 different wilson
  scores per rating and summing them.  This is slow, and probably wrong.
  I'm not 100% sure about what the right method to use is, however I did
  find the question asked on Math Overflow:

  http://mathoverflow.net/questions/20727/generalizing-the-wilson-score-
  confidence-interval-to-other-distributions

  The current answer there suggests using a standard normal distribution
  for large samples, and a T-distribution for low ones (we don't do
  either)

  This website suggests a slightly different Wilson algorithm:
  http://www.goproblems.com/test/wilson/wilson.php?v1=0&v2=0&v3=3&v4=2&v5=4 

  
  I will go further, and assert that we are making a conceptual error in trying to estimate a mean rating in the first place: ratings are fundamentally ordinal data, and thus a mean doesn't make much sense for the same reason that "excellent" + "terrible" does not balance out to "mediocre".  However, taking medians and percentile data is very much valid measurement.

  I will research this question a bit more,  and probably post a
  question on the beta stats stackexchange site for advice.
  Intuitively, though, I think we may want to have a ratings algorithm
  that sorts primarily based on median, and then for the large number of
  cases where two apps have the same median (since we only have 5
  ratings), we then compute a wilson score for the lower bound of the
  probability that a rater of App A would rate >= median vs < median.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/software-center/+bug/894468/+subscriptions