| 1 | """ |
|---|
| 2 | chi2p.py |
|---|
| 3 | |
|---|
| 4 | |
|---|
| 5 | Copyright (c) 2004 Transpose, LLC |
|---|
| 6 | Contained code is licensed under the MIT license: |
|---|
| 7 | |
|---|
| 8 | http://www.opensource.org/licenses/mit-license.php |
|---|
| 9 | |
|---|
| 10 | Portions are Copyright (c) 2002-2003 Python Software Foundation, and |
|---|
| 11 | used under the PSF license: |
|---|
| 12 | |
|---|
| 13 | http://cvs.sf.net/viewcvs.py/spambayes/spambayes/LICENSE.txt |
|---|
| 14 | |
|---|
| 15 | |
|---|
| 16 | |
|---|
| 17 | This inverse chi-square method has some unusual characteristics. |
|---|
| 18 | |
|---|
| 19 | PART 1: Effective Size Factor |
|---|
| 20 | |
|---|
| 21 | Most importantly, it can accept an "effective size factor" or ESF. |
|---|
| 22 | |
|---|
| 23 | The concept of ESF is useful when the chi square random variable is |
|---|
| 24 | comprised of the product of multiple nonindependent chi square RV's. |
|---|
| 25 | |
|---|
| 26 | The idea is that if the RV's are nonindependent, then it is "as if" there |
|---|
| 27 | are fewer RV's than there really are. For instance, say there are 10 RV's, |
|---|
| 28 | but they are paired into 5 pairs such that each member of a pair |
|---|
| 29 | always has the same values as the other member. Then, in effect, there |
|---|
| 30 | are really only 5 truly random RV's. |
|---|
| 31 | |
|---|
| 32 | In that case, in order to calculate the real p-value associated with a |
|---|
| 33 | particular product, we make the following changes. |
|---|
| 34 | a) the effective degrees of freedom is 5 |
|---|
| 35 | rather than 10, and b) we should use the square root of the product |
|---|
| 36 | of the RV's rather than the product itself. |
|---|
| 37 | |
|---|
| 38 | When fed into an inverse-chi-square calculating algorithm, those |
|---|
| 39 | adjustments will result in exactly the true p-value for that case. |
|---|
| 40 | |
|---|
| 41 | What we are saying is that the "effective size" of this group of RV's |
|---|
| 42 | is 5 rather than 10, because of the redundancy, and that redundancy |
|---|
| 43 | can be perfectly taken into account by halving the degrees of freedom |
|---|
| 44 | and taking the square root of the product. This is denoted by saying |
|---|
| 45 | that the ESF is .5. |
|---|
| 46 | |
|---|
| 47 | Now consider a case that is less well-defined; that is, there is some |
|---|
| 48 | interaction between the RV's, but it isn't that half of them are |
|---|
| 49 | duplicates of the other half. Intuively it seems that it might be |
|---|
| 50 | useful to try to find an ESF that compensates for that interaction |
|---|
| 51 | the way ESF=.5 compensates for the interaction in our example. |
|---|
| 52 | |
|---|
| 53 | In practice, it has been found that that assumption appears to be correct. |
|---|
| 54 | For an example, see: |
|---|
| 55 | http://portal.acm.org/citation.cfm?id=299444&jmp=cit&dl=GUIDE&dl=ACM |
|---|
| 56 | in which this concept is successfully used for protein classification. |
|---|
| 57 | |
|---|
| 58 | In addition, Gary Robinson's informal testing of using ESF in the |
|---|
| 59 | context of spam classification, where the chi-square RV's are |
|---|
| 60 | generated from word probabilities, indicates that it is |
|---|
| 61 | helpful in that context as well. (A pointer to formal test |
|---|
| 62 | results will be placed here when they are available.) |
|---|
| 63 | |
|---|
| 64 | So, the inverse chi-square function given here, chi2p(), accepts |
|---|
| 65 | an ESF parameter and makes the corresponding adjustments. |
|---|
| 66 | |
|---|
| 67 | PART 2: The algorithms |
|---|
| 68 | |
|---|
| 69 | This code uses two different approaches to protein classification, depending |
|---|
| 70 | on the degrees of freedom. |
|---|
| 71 | |
|---|
| 72 | For small degrees of freedom (< 25) we use the an algorithm from the |
|---|
| 73 | protein classification paper cited above. This formula takes |
|---|
| 74 | the ESF into account by means of interpolation, and so gives |
|---|
| 75 | an approximate result even for non-integer effective sizes. It |
|---|
| 76 | is also quite fast for small DF. However it becomes very slow |
|---|
| 77 | for large DF. |
|---|
| 78 | |
|---|
| 79 | For larger degrees of freedom we use Tim Peter's approach from the |
|---|
| 80 | SpamBayes spam filter. Tim's algorithm assumes that the degrees of freedom |
|---|
| 81 | is even (a correct assumption for the "traditional" use of chi-square in |
|---|
| 82 | spam filtering). The current version of the code does not |
|---|
| 83 | interpolate to approximate the exact effective size, but rather rounds |
|---|
| 84 | the effective size to the nearest even degrees of freedom. Because |
|---|
| 85 | the effective sizes are found empirically in practical applications |
|---|
| 86 | and are therefore not really exact, it is felt that this is acceptable. |
|---|
| 87 | |
|---|
| 88 | """ |
|---|
| 89 | |
|---|
| 90 | |
|---|
| 91 | import math |
|---|
| 92 | |
|---|
| 93 | |
|---|
| 94 | |
|---|
| 95 | def _chi2pManyTokens(fChi, iDF, fESF=1.0): |
|---|
| 96 | """ |
|---|
| 97 | Use instead of _chi2pFewTokens |
|---|
| 98 | for large values of iDF*fESF. Suggested |
|---|
| 99 | cutoff is 25.0, but certainly should cutoff |
|---|
| 100 | by 100.0. |
|---|
| 101 | |
|---|
| 102 | Except for the code involving fESF, and some renaming |
|---|
| 103 | of variables, this is almost exactly the |
|---|
| 104 | same as Tim Peters' SpamBayes chi function. |
|---|
| 105 | """ |
|---|
| 106 | MAX_ALLOWABLE_M = 700.0 |
|---|
| 107 | |
|---|
| 108 | def makeAdjustments(): |
|---|
| 109 | global fM |
|---|
| 110 | global iAdjustedDF |
|---|
| 111 | iHalfDF = iDF / 2 |
|---|
| 112 | iAdjustedHalfDF = max(1,int(fESF * iHalfDF + .5)) |
|---|
| 113 | fAdjustedProp = float(iAdjustedHalfDF) / iHalfDF |
|---|
| 114 | fAdjustedChi = fChi * fAdjustedProp |
|---|
| 115 | iAdjustedDF = iAdjustedHalfDF * 2 |
|---|
| 116 | # assert (iDF & 1) == 0 |
|---|
| 117 | # If chi is very large, exp(-m) will underflow to 0; Tim says the |
|---|
| 118 | # results are meaningless in this case; otherwise they should be good. |
|---|
| 119 | fM = fAdjustedChi / 2.0 |
|---|
| 120 | |
|---|
| 121 | makeAdjustments() |
|---|
| 122 | |
|---|
| 123 | if fM > MAX_ALLOWABLE_M: |
|---|
| 124 | fESF = fESF * (700.0 / fM) |
|---|
| 125 | makeAdjustments() |
|---|
| 126 | |
|---|
| 127 | fSum = fTerm = math.exp(-fM) |
|---|
| 128 | assert fSum > 0.0 |
|---|
| 129 | for i in range(1, iAdjustedDF/2): |
|---|
| 130 | fTerm *= fM / i |
|---|
| 131 | fSum += fTerm |
|---|
| 132 | # With small chi and large df, accumulated roundoff error, plus error in |
|---|
| 133 | # the platform exp(), can cause this to spill a few ULP above 1.0. For |
|---|
| 134 | # example, chi2p(100, 300) on my box has sum == 1.0 + 2.0**-52 at this |
|---|
| 135 | # point. Returning a value even a teensy bit over 1.0 is no good. |
|---|
| 136 | return min(fSum, 1.0) |
|---|
| 137 | |
|---|
| 138 | def factorial(i): |
|---|
| 139 | if not hasattr(factorial, 'lstFactorial'): |
|---|
| 140 | factorial.lstFactorial = [None] * 1000 |
|---|
| 141 | if factorial.lstFactorial[i] is None: |
|---|
| 142 | iProduct = 1 |
|---|
| 143 | for iFactor in xrange(1, i+1): |
|---|
| 144 | iProduct *= iFactor |
|---|
| 145 | factorial.lstFactorial[i] = iProduct |
|---|
| 146 | return factorial.lstFactorial[i] |
|---|
| 147 | |
|---|
| 148 | def _chi2pFewTokens(fChi, iChiDF, fESF=1.0): |
|---|
| 149 | """ |
|---|
| 150 | This is more efficient than _chi2pManyTokens for |
|---|
| 151 | small values of iChiDF*fESF, and is more |
|---|
| 152 | accurate. However it can't handle values of iChiDF*fESF |
|---|
| 153 | > some amount I don't recall at the time of writing this |
|---|
| 154 | docstring. It works up to at least iChiDF*fESF==100.0, |
|---|
| 155 | though _chi2pManyTokens is significantly faster at that point. |
|---|
| 156 | """ |
|---|
| 157 | |
|---|
| 158 | fAdjustedProduct = math.exp((fESF * (-fChi)/2.0)) |
|---|
| 159 | iActualSize = iChiDF / 2 |
|---|
| 160 | |
|---|
| 161 | fEffectiveSize = float(iActualSize) * fESF |
|---|
| 162 | |
|---|
| 163 | assert fAdjustedProduct != 0.0, 'df is: %i, fChi is %f, fESF is %f, iActualSize is %i, fEffectiveSize is %f' \ |
|---|
| 164 | % (iChiDF, fChi, fESF, iActualSize, fEffectiveSize, ) |
|---|
| 165 | |
|---|
| 166 | fSum = 0.0 |
|---|
| 167 | for i in xrange(int(fEffectiveSize)): |
|---|
| 168 | fSum += (-math.log(fAdjustedProduct))**i / factorial(i) |
|---|
| 169 | fFirstTerm = fAdjustedProduct * fSum |
|---|
| 170 | fSecondTerm = fAdjustedProduct * \ |
|---|
| 171 | (fEffectiveSize - int(fEffectiveSize)) * \ |
|---|
| 172 | ((-math.log(fAdjustedProduct))**int(fEffectiveSize)) / \ |
|---|
| 173 | factorial(int(fEffectiveSize)) |
|---|
| 174 | fResult = fFirstTerm + fSecondTerm |
|---|
| 175 | return fResult |
|---|
| 176 | |
|---|
| 177 | def chi2p(fChi, iChiDF, fESF=1.0, fUseFactorialForUnder=25.0): |
|---|
| 178 | |
|---|
| 179 | if (float(iChiDF) * fESF) < 25.0: |
|---|
| 180 | fResult = _chi2pFewTokens(fChi, iChiDF, fESF) |
|---|
| 181 | else: |
|---|
| 182 | fResult = _chi2pManyTokens(fChi, iChiDF, fESF) |
|---|
| 183 | return fResult |
|---|
| 184 | |
|---|
| 185 | if __name__ == '__main__': |
|---|
| 186 | import unittest |
|---|
| 187 | |
|---|
| 188 | class TestChi(unittest.TestCase): |
|---|
| 189 | def testRightSmallDFResult(self): |
|---|
| 190 | self.assertEqual(0.28240645151038829, chi2p(-2*(math.log(.3**10)), 2*10, .5)) |
|---|
| 191 | |
|---|
| 192 | def testRightBigDFResult(self): |
|---|
| 193 | self.assertEqual(0.1336918921570289, chi2p(-2*(math.log(.3**60)), 2*60, .5)) |
|---|
| 194 | |
|---|
| 195 | |
|---|
| 196 | unittest.main() |
|---|