## Frequent Links

# Kendall tau rank correlation coefficient

In statistics, the **Kendall rank correlation coefficient**, commonly referred to as **Kendall's tau (τ) coefficient**, is a statistic used to measure the association between two measured quantities. A **tau test** is a non-parametric hypothesis test for statistical dependence based on the tau coefficient.

It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938,^{[1]} though Gustav Fechner had proposed a similar measure in the context of time series in 1897.^{[2]}

## Contents

## Definition

Let (*x*_{1}, *y*_{1}), (*x*_{2}, *y*_{2}), …, (*x*_{n}, *y*_{n}) be a set of observations of the joint random variables *X* and *Y* respectively, such that all the values of (*x*_{i}) and (*y*_{i}) are unique. Any pair of observations (*x*_{i}, *y*_{i}) and (*x*_{j}, *y*_{j}) are said to be *concordant* if the ranks for both elements agree: that is, if both *x*_{i} > *x*_{j} and *y*_{i} > *y*_{j} or if both *x*_{i} < *x*_{j} and *y*_{i} < *y*_{j}. They are said to be *discordant*, if *x*_{i} > *x*_{j} and *y*_{i} < *y*_{j} or if *x*_{i} < *x*_{j} and *y*_{i} > *y*_{j}. If *x*_{i} = *x*_{j} or *y*_{i} = *y*_{j}, the pair is neither concordant nor discordant.

The Kendall τ coefficient is defined as:

- <math>\tau = \frac{(\text{number of concordant pairs}) - (\text{number of discordant pairs})}{\frac{1}{2} n (n-1) } .</math>
^{[3]}

### Properties

The denominator is the total number of pair combinations, so the coefficient must be in the range −1 ≤ *τ* ≤ 1.

- If the agreement between the two rankings is perfect (i.e., the two rankings are the same) the coefficient has value 1.
- If the disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the other) the coefficient has value −1.
- If
*X*and*Y*are independent, then we would expect the coefficient to be approximately zero.

## Hypothesis test

The Kendall rank coefficient is often used as a test statistic in a statistical hypothesis test to establish whether two variables may be regarded as statistically dependent. This test is non-parametric, as it does not rely on any assumptions on the distributions of *X* or *Y* or the distribution of (*X*,*Y*).

Under the null hypothesis of independence of *X* and *Y*, the sampling distribution of *τ* has an expected value of zero. The precise distribution cannot be characterized in terms of common distributions, but may be calculated exactly for small samples; for larger samples, it is common to use an approximation to the normal distribution, with mean zero and variance

- <math>\frac{2(2n+5)}{9n (n-1)}</math>.
^{[4]}

## Accounting for ties

This article needs additional citations for verification. (June 2010) |

A pair {(*x*_{i}, *y*_{i}), (*x*_{j}, *y*_{j})} is said to be *tied* if *x*_{i} = *x*_{j} or *y*_{i} = *y*_{j}; a tied pair is neither concordant nor discordant. When tied pairs arise in the data, the coefficient may be modified in a number of ways to keep it in the range [-1, 1]:

### Tau-a

The Tau-a statistic tests the strength of association of the cross tabulations. Both variables have to be ordinal. Tau-a will not make any adjustment for ties. It is defined as:

- <math>\tau_A = \frac{n_c-n_d}{n_0}</math>

### Tau-b

The Tau-b statistic, unlike Tau-a, makes adjustments for ties.^{[5]} Values of Tau-b range from −1 (100% negative association, or perfect inversion) to +1 (100% positive association, or perfect agreement). A value of zero indicates the absence of association.

The Kendall Tau-b coefficient is defined as:

- <math>\tau_B = \frac{n_c-n_d}{\sqrt{(n_0-n_1)(n_0-n_2)}}</math>

where

- <math>\begin{array}{ccl}

n_0 & = & n(n-1)/2\\ n_1 & = & \sum_i t_i (t_i-1)/2 \\ n_2 & = & \sum_j u_j (u_j-1)/2 \\ n_c & = & \mbox{Number of concordant pairs} \\ n_d & = & \mbox{Number of discordant pairs} \\ t_i & = & \mbox{Number of tied values in the } i^{th} \mbox{ group of ties for the first quantity} \\ u_j & = & \mbox{Number of tied values in the } j^{th} \mbox{ group of ties for the second quantity} \end{array} </math>

### Tau-c

Tau-c differs from Tau-b as in being more suitable for rectangular tables than for square tables.

## Significance tests

When two quantities are statistically independent, the distribution of <math>\tau</math> is not easily characterizable in terms of known distributions. However, for <math>\tau_A</math> the following statistic, <math>z_A</math>, is approximately distributed as a standard normal when the variables are statistically independent:

- <math>z_A = {3 (n_c - n_d) \over \sqrt{n(n-1)(2n+5)/2} }</math>

Thus, to test whether two variables are statistically dependent, one computes <math>z_A</math>, and finds the cumulative probability for a standard normal distribution at <math>-|z_A|</math>. For a 2-tailed test, multiply that number by two to obtain the *p*-value. If the *p*-value is below a given significance level, one rejects the null hypothesis (at that significance level) that the quantities are statistically independent.

Numerous adjustments should be added to <math>z_A</math> when accounting for ties. The following statistic, <math>z_B</math>, has the same distribution as the <math>\tau_B</math> distribution, and is again approximately equal to a standard normal distribution when the quantities are statistically independent:

- <math>z_B = {n_c - n_d \over \sqrt{ v } }</math>

where

- <math>\begin{array}{ccl}

v & = & (v_0 - v_t - v_u)/18 + v_1 + v_2 \\ v_0 & = & n (n-1) (2n+5) \\ v_t & = & \sum_i t_i (t_i-1) (2 t_i+5)\\ v_u & = & \sum_j u_j (u_j-1)(2 u_j+5) \\ v_1 & = & \sum_i t_i (t_i-1) \sum_j u_j (u_j-1) / (2n(n-1)) \\ v_2 & = & \sum_i t_i (t_i-1) (t_i-2) \sum_j u_j (u_j-1) (u_j-2) / (9 n (n-1) (n-2)) \end{array} </math>

pvrank^{[6]} is a very recent R package that computes rank correlations and their p-values with various options for tied ranks. It is possible to compute exact Kendall coefficient test p-values for *n* ≤ 60.

## Algorithms

The direct computation of the numerator <math>n_c - n_d</math>, involves two nested iterations, as characterized by the following pseudo-code:

numer := 0fori:=2..Ndoforj:=1..(i-1)donumer := numer + sign(x[i] - x[j]) * sign(y[i] - y[j])returnnumer

Although quick to implement, this algorithm is <math>O(n^2)</math> in complexity and becomes very slow on large samples. A more sophisticated algorithm^{[7]} built upon the Merge Sort algorithm can be used to compute the numerator in <math>O(n \cdot \log{n})</math> time.

Begin by ordering your data points sorting by the first quantity, <math>x</math>, and secondarily (among ties in <math>x</math>) by the second quantity, <math>y</math>. With this initial ordering, <math>y</math> is not sorted, and the core of the algorithm consists of computing how many steps a Bubble Sort would take to sort this initial <math>y</math>. An enhanced Merge Sort algorithm, with <math>O(n \log n)</math> complexity, can be applied to compute the number of swaps, <math>S(y)</math>, that would be required by a Bubble Sort to sort <math>y_i</math>. Then the numerator for <math>\tau</math> is computed as:

- <math>n_c-n_d = n_0 - n_1 - n_2 + n_3 - 2 S(y)</math>,

where <math>n_3</math> is computed like <math>n_1</math> and <math>n_2</math>, but with respect to the joint ties in <math>x</math> and <math>y</math>.

A Merge Sort partitions the data to be sorted, <math>y</math> into two roughly equal halves, <math>y_\mathrm{left}</math> and <math>y_\mathrm{right}</math>, then sorts each half recursive, and then merges the two sorted halves into a fully sorted vector. The number of Bubble Sort swaps is equal to:

- <math>S(y) = S(y_\mathrm{left}) + S(y_\mathrm{right}) + M(Y_\mathrm{left},Y_\mathrm{right})</math>

where <math>Y_\mathrm{left}</math> and <math>Y_\mathrm{right}</math> are the sorted versions of <math>y_\mathrm{left}</math> and <math>y_\mathrm{right}</math>, and <math>M(\cdot,\cdot)</math> characterizes the Bubble Sort swap-equivalent for a merge operation. <math>M(\cdot,\cdot)</math> is computed as depicted in the following pseudo-code:

functionM(L[1..n], R[1..m]) i := 1 j := 1 nSwaps := 0whilei <= nandj <= mdoifR[j] < L[i]thennSwaps := nSwaps + n - i + 1 j := j + 1elsei := i + 1returnnSwaps

A side effect of the above steps is that you end up with both a sorted version of <math>x</math> and a sorted version of <math>y</math>. With these, the factors <math>t_i</math> and <math>u_j</math> used to compute <math>\tau_B</math> are easily obtained in a single linear-time pass through the sorted arrays.

A second algorithm with <math>O(n \cdot \log{n})</math> time complexity, based on AVL trees, was devised by David Christensen.^{[8]} Yet, another algorithm for <math>O(n \cdot \log{n})</math> time complexity was proposed more recently.^{[9]}

## See also

- Correlation
- Kendall tau distance
- Kendall's W
- Spearman's rank correlation coefficient
- Goodman and Kruskal's gamma
- Theil–Sen estimator

## References

**^**Kendall, M. (1938). "A New Measure of Rank Correlation".*Biometrika***30**(1–2): 81–89. JSTOR 2332226. doi:10.1093/biomet/30.1-2.81.**^**Kruskal, W.H. (1958). "Ordinal Measures of Association".*Journal of the American Statistical Association***53**(284): 814–861. JSTOR 2281954. MR 100941. doi:10.2307/2281954.**^**Nelsen, R.B. (2001), "Kendall tau metric", in Hazewinkel, Michiel,*Encyclopedia of Mathematics*, Springer, ISBN 978-1-55608-010-4**^**Prokhorov, A.V. (2001), "Kendall coefficient of rank correlation", in Hazewinkel, Michiel,*Encyclopedia of Mathematics*, Springer, ISBN 978-1-55608-010-4**^**Agresti, A. (2010).*Analysis of Ordinal Categorical Data*(Second ed.). New York: John Wiley & Sons.**^**Amerise, I.L.; Marozzi, M.; Tarsitano, A. "R package pvrank".**^**Knight, W. (1966). "A Computer Method for Calculating Kendall's Tau with Ungrouped Data".*Journal of the American Statistical Association***61**(314): 436–439. JSTOR 2282833. doi:10.2307/2282833.**^**Christensen, David (2005). "Fast algorithms for the calculation of Kendall's τ".*Computational Statistics***20**(1): 51–62. doi:10.1007/BF02736122.**^**Campello, R.J.G.B.; Hruschka, E.R. (29 March 2009). "On comparing two sequences of numbers and its applications to clustering analysis".*Information Sciences***179**(8): 1025–1039. doi:10.1016/j.ins.2008.11.028.

- Abdi, H. (2007). "Kendall rank correlation" (PDF). In Salkind, N.J.
*Encyclopedia of Measurement and Statistics*. Thousand Oaks (CA): Sage. - Kendall, M. (1948)
*Rank Correlation Methods*, Charles Griffin & Company Limited - Bonett, DG & Wright, TA (2000)
*Sample size requirements for Pearson, Kendall, and Spearman correlations*, Psychometrika, 65, 23-28.

## External links

- Tied rank calculation
- Why Kendall tau?
- Software for computing Kendall's tau on very large datasets
- Online software: computes Kendall's tau rank correlation
- The CORR Procedure: Statistical Computations - McDonough School of Business