[ACCEPTED]-Computing similarity between two lists-information-retrieval
It gives the full gain for relevant 19 document if it is ranked first, and the 18 gain decreases as rank decreases.
Using DCG/nDCG to evaluate the system compared to the SOA base line:
Note: If 17 you set all results returned by "state 16 of the art system" as relevant, then 15 your system is identical to the state of the art 14 if they recieved the same rank using DCG/nDCG.
Thus, a 13 possible evaluation could be:
To further 12 enhance it, you can give a relevance grade 11 [relevance will not be binary] - and will be determined according to 10 how each document was ranked in the state 9 of the art. For example
rel_i = 1/log(1+i) for each document 8 in the state of the art system.
If the value recieved by this evaluation function is close to 1: your system is very similar to the base line.
mySystem = [1,2,5,4,6,7] stateOfTheArt = [1,2,4,5,6,9]
First you 7 give score to each document, according to 6 the state of the art system [using the formula 5 from above]:
doc1 = 1.0 doc2 = 0.6309297535714574 doc3 = 0.0 doc4 = 0.5 doc5 = 0.43067655807339306 doc6 = 0.38685280723454163 doc7 = 0 doc8 = 0 doc9 = 0.3562071871080222
Now you calculate
DCG(stateOfTheArt), and use 4 the relevance as stated above [note relevance 3 is not binary here, and get
Next, calculate 2 it for your system using the same relecance weights and get:
DCG(mySystem) = 1.9784040064803783
Thus, the evaluation 1 is
DCG(mySystem)/DCG(stateOfTheArt) = 1.9784040064803783 / 2.1100933062283396 = 0.9375907693942939
Kendalls tau is the metric you want. It 6 measures the number of pairwise inversions 5 in the list. Spearman's foot rule does the 4 same, but measures distance rather than 3 inversion. They are both designed for the 2 task at hand, measuring the difference in 1 two rank-ordered lists.
Is the list of documents exhaustive? That 8 is, is every document rank ordered by system 7 1 also rank ordered by system 2? If so 6 a Spearman's rho may serve your purposes. When they don't 5 share the same documents, the big question 4 is how to interpret that result. I don't 3 think there is a measurement that answers 2 that question, although there may be some 1 that implement an implicit answer to it.
As you said, you want to compute how similar 25 one list is to the other. I think simplistically, you 24 can start by counting the number of Inversions. There's 23 a O(NlogN) divide and conquer approach to 22 this. It is a very simple approach to measure 21 the "similarity" between two lists.
e.g. you 20 want to compare how 'similar' the music 19 tastes are for two persons on a music website, you 18 take their rankings of a set of songs and 17 count the no. of inversions in it. Lesser 16 the count, more 'similar' their taste is.
since 15 you are already considering the "state 14 of the art system" to be a benchmark 13 of correctness, counting Inversions should 12 give you a basic measure of 'similarity' of 11 your ranking. Of course this is just a starters 10 approach, but you can build on it as how 9 strict you want to be with the "inversion 8 gap" etc.
D1 D2 D3 D4 D5 D6 ----------------- R1: 1, 7, 4, 5, 8, 9 [Rankings from 'state of the art' system] R2: 1, 7, 5, 4, 9, 6 [ your Rankings]
Since rankings are in order 7 of documents you can write your own comparator 6 function based on R1 (ranking of the "state 5 of the art system" and hence count 4 the inversions comparing to that comparator.
You 3 can "penalize" 'similarity' for 2 each inversions found: i < j but R2[i] >' R2[j]
( >' here you use 1 your own comparator)
I actually know four different measures 3 for that purpose.
Three have already been 2 mentioned:
- Kendall's Tau
- Spearman's Rho
But if you have more than two 1 ranks that have to be compared, use Kendall's W.
In addition to what has already been said, I 41 would like to point you to the following 40 excellent paper: W. Webber et al, A Similarity Measure for Indefinite Rankings (2010). Besides containing a 39 good review of existing measures (such as 38 above-mentioned Kendall Tau and Spearman's 37 footrule), the authors propose an intuitively 36 appealing probabilistic measure that is 35 applicable for varying length of result 34 lists and when not all items occur in both 33 lists. Roughly speaking, it is parameterized 32 by a "persistence" probability 31 p that a user scans item k+1 after having 30 inspected item k (rather than abandoning). Rank-Biased Overlap (RBO) is 29 the expected overlap ratio of results at 28 the point the user stops reading.
The implementation 27 of RBO is slightly more involved; you can 26 take a peek at an implementation in Apache 25 Pig here.
Another simple measure is cosine similarity, the cosine 24 between two vectors with dimensions corresponding 23 to items, and inverse ranks as weights. However, it 22 doesn't handle items gracefully that only 21 occur in one of the lists (see the implementation 20 in the link above).
- For each item i in list 1, let h_1(i) = 1/rank_1(i). For each item i in list 2 not occurring in list 1, let h_1(i) = 0. Do the same for h_2 with respect to list 2.
- Compute v12 = sum_i h_1(i) * h_2(i); v11 = sum_i h_1(i) * h_1(i); v22 = sum_i h_2(i) * h_2(i)
- Return v12 / sqrt(v11 * v22)
For your example, this 19 gives a value of 0.7252747.
Please let me 18 give you some practical advice beyond your 17 immediate question. Unless your 'production 16 system' baseline is perfect (or we are dealing 15 with a gold set), it is almost always better 14 to compare a quality measure (such as above-mentioned 13 nDCG) rather than similarity; a new ranking 12 will be sometimes better, sometimes worse 11 than the baseline, and you want to know 10 if the former case happens more often than 9 the latter. Secondly, similarity measures 8 are not trivial to interpret on an absolute 7 scale. For example, if you get a similarity 6 score of say 0.72, does this mean it is 5 really similar or significantly different? Similarity 4 measures are more helpful in saying that 3 e.g. a new ranking method 1 is closer to 2 production than another new ranking method 1 2.
I suppose you are talking about comparing 36 two Information Retrieval System which trust 35 me is not something trivial. It is a complex 34 Computer Science problem.
For measuring relevance 33 or doing kind of A/B testing you need to 32 have couple of things:
A competitor to measure 31 relevance. As you have two systems than 30 this prerequisite is met.
You need to manually 29 rate the results. You can ask your colleagues 28 to rate query/url pairs for popular queries 27 and then for the holes(i.e. query/url pair 26 not rated you can have some dynamic ranking 25 function by using "Learning to Rank" Algorithm 24 http://en.wikipedia.org/wiki/Learning_to_rank. Dont be surprised by that but thats true 23 (please read below of an example of Google/Bing).
Google 22 and Bing are competitors in the horizontal 21 search market. These search engines employ 20 manual judges around the world and invest 19 millions on them, to rate their results 18 for queries. So for each query/url pairs 17 generally top 3 or top 5 results are rated. Based 16 on these ratings they may use a metric like 15 NDCG (Normalized Discounted Cumulative Gain) , which 14 is one of finest metric and the one of most 13 popular one.
According to wikipedia:
Discounted 12 cumulative gain (DCG) is a measure of effectiveness 11 of a Web search engine algorithm or related 10 applications, often used in information 9 retrieval. Using a graded relevance scale 8 of documents in a search engine result set, DCG 7 measures the usefulness, or gain, of a document 6 based on its position in the result list. The 5 gain is accumulated from the top of the 4 result list to the bottom with the gain 3 of each result discounted at lower ranks.
Wikipedia 2 explains NDCG in a great manner. It is a 1 short article, please go through that.
More Related questions