desktop-packages team mailing list archive
-
desktop-packages team
-
Mailing list archive
-
Message #65175
[Bug 894468] Re: Statistics algorithm for sorting ratings looks fishy
** Tags added: client-server
--
You received this bug notification because you are a member of Desktop
Packages, which is subscribed to software-center in Ubuntu.
https://bugs.launchpad.net/bugs/894468
Title:
Statistics algorithm for sorting ratings looks fishy
Status in “software-center” package in Ubuntu:
New
Bug description:
Here's the current code snippet for sorting the Software Center
Ratings:
def wilson_score(pos, n, power=0.2):
if n == 0:
return 0
z = pnormaldist(1-power/2)
phat = 1.0 * pos / n
return (phat + z*z/(2*n) - z * math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
def calc_dr(ratings, power=0.1):
'''Calculate the dampened rating for an app given its collective ratings'''
if not len(ratings) == 5:
raise AttributeError('ratings argument must be a list of 5 integers')
tot_ratings = 0
for i in range (0,5):
tot_ratings = ratings[i] + tot_ratings
sum_scores = 0.0
for i in range (0,5):
ws = wilson_score(ratings[i], tot_ratings, power)
sum_scores = sum_scores + float((i+1)-3) * ws
return sum_scores + 3
This looks very fishy to me, as we are calculating 5 different wilson
scores per rating and summing them. This is slow, and probably wrong.
I'm not 100% sure about what the right method to use is, however I did
find the question asked on Math Overflow:
http://mathoverflow.net/questions/20727/generalizing-the-wilson-score-
confidence-interval-to-other-distributions
The current answer there suggests using a standard normal distribution
for large samples, and a T-distribution for low ones (we don't do
either)
This website suggests a slightly different Wilson algorithm:
http://www.goproblems.com/test/wilson/wilson.php?v1=0&v2=0&v3=3&v4=2&v5=4
I will go further, and assert that we are making a conceptual error in trying to estimate a mean rating in the first place: ratings are fundamentally ordinal data, and thus a mean doesn't make much sense for the same reason that "excellent" + "terrible" does not balance out to "mediocre". However, taking medians and percentile data is very much valid measurement.
I will research this question a bit more, and probably post a
question on the beta stats stackexchange site for advice.
Intuitively, though, I think we may want to have a ratings algorithm
that sorts primarily based on median, and then for the large number of
cases where two apps have the same median (since we only have 5
ratings), we then compute a wilson score for the lower bound of the
probability that a rater of App A would rate >= median vs < median.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/software-center/+bug/894468/+subscriptions