root/lang/python/spam/chi2p.py

Revision 31400, 7.2 kB (checked in by rezoo, 3 years ago)

Robinson方式に対応。

Line 
1"""
2chi2p.py
3
4
5    Copyright (c) 2004 Transpose, LLC
6    Contained code is licensed under the MIT license:
7
8        http://www.opensource.org/licenses/mit-license.php
9
10    Portions are Copyright (c) 2002-2003 Python Software Foundation, and
11    used under the PSF license:
12
13        http://cvs.sf.net/viewcvs.py/spambayes/spambayes/LICENSE.txt
14
15
16
17This inverse chi-square method has some unusual characteristics.
18
19PART 1: Effective Size Factor
20
21Most importantly, it can accept an "effective size factor" or ESF.
22
23The concept of ESF is useful when the chi square random variable is
24comprised of the product of multiple nonindependent chi square RV's.
25
26The idea is that if the RV's are nonindependent, then it is "as if" there
27are fewer RV's than there really are. For instance, say there are 10 RV's,
28but they are paired into 5 pairs such that each member of a pair
29always has the same values as the other member. Then, in effect, there
30are really only 5 truly random RV's.
31
32In that case, in order to calculate the real p-value associated with a
33particular product, we make the following changes.
34a) the effective degrees of freedom is 5
35rather than 10, and b) we should use the square root of the product
36of the RV's rather than the product itself.
37
38When fed into an inverse-chi-square calculating algorithm, those
39adjustments will result in exactly the true p-value for that case.
40
41What we are saying is that the "effective size" of this group of RV's
42is 5 rather than 10, because of the redundancy, and that redundancy
43can be perfectly taken into account by halving the degrees of freedom
44and taking the square root of the product. This is denoted by saying
45that the ESF is .5.
46
47Now consider a case that is less well-defined; that is, there is some
48interaction between the RV's, but it isn't that half of them are
49duplicates of the other half. Intuively it seems that it might be
50useful to try to find an ESF that compensates for that interaction
51the way ESF=.5 compensates for the interaction in our example.
52
53In practice, it has been found that that assumption appears to be correct.
54For an example, see:
55http://portal.acm.org/citation.cfm?id=299444&jmp=cit&dl=GUIDE&dl=ACM
56in which this concept is successfully used for protein classification.
57
58In addition, Gary Robinson's informal testing of using ESF in the
59context of spam classification, where the chi-square RV's are
60generated from word probabilities, indicates that it is
61helpful in that context as well. (A pointer to formal test
62results will be placed here when they are available.)
63
64So, the inverse chi-square function given here, chi2p(), accepts
65an ESF parameter and makes the corresponding adjustments.
66
67PART 2: The algorithms
68
69This code uses two different approaches to protein classification, depending
70on the degrees of freedom.
71
72For small degrees of freedom (< 25) we use the an algorithm from the
73protein classification paper cited above. This formula takes
74the ESF into account by means of interpolation, and so gives
75an approximate result even for non-integer effective sizes. It
76is also quite fast for small DF. However it becomes very slow
77for large DF.
78
79For larger degrees of freedom we use Tim Peter's approach from the
80SpamBayes spam filter. Tim's algorithm assumes that the degrees of freedom
81is even (a correct assumption for the "traditional" use of chi-square in
82spam filtering). The current version of the code does not
83interpolate to approximate the exact effective size, but rather rounds
84the effective size to the nearest even degrees of freedom. Because
85the effective sizes are found empirically in practical applications
86and are therefore not really exact, it is felt that this is acceptable.
87
88"""
89
90
91import math
92
93
94       
95def _chi2pManyTokens(fChi, iDF, fESF=1.0):
96    """
97    Use instead of _chi2pFewTokens
98    for large values of iDF*fESF. Suggested
99    cutoff is 25.0, but certainly should cutoff
100    by 100.0.
101   
102    Except for the code involving fESF, and some renaming
103    of variables, this is almost exactly the
104    same as Tim Peters' SpamBayes chi function.
105    """
106    MAX_ALLOWABLE_M = 700.0
107   
108    def makeAdjustments():
109        global fM
110        global iAdjustedDF
111        iHalfDF = iDF / 2   
112        iAdjustedHalfDF = max(1,int(fESF * iHalfDF + .5))
113        fAdjustedProp =  float(iAdjustedHalfDF) / iHalfDF
114        fAdjustedChi = fChi * fAdjustedProp
115        iAdjustedDF = iAdjustedHalfDF * 2
116        # assert (iDF & 1) == 0
117        # If chi is very large, exp(-m) will underflow to 0; Tim says the
118        # results are meaningless in this case; otherwise they should be good.
119        fM = fAdjustedChi / 2.0
120     
121    makeAdjustments()
122       
123    if fM > MAX_ALLOWABLE_M:
124        fESF = fESF * (700.0 / fM)
125        makeAdjustments()
126   
127    fSum = fTerm = math.exp(-fM)
128    assert fSum > 0.0
129    for i in range(1, iAdjustedDF/2):
130        fTerm *= fM / i
131        fSum += fTerm
132    # With small chi and large df, accumulated roundoff error, plus error in
133    # the platform exp(), can cause this to spill a few ULP above 1.0.  For
134    # example, chi2p(100, 300) on my box has sum == 1.0 + 2.0**-52 at this
135    # point.  Returning a value even a teensy bit over 1.0 is no good.
136    return min(fSum, 1.0)
137
138def factorial(i):
139    if not hasattr(factorial, 'lstFactorial'):
140        factorial.lstFactorial = [None] * 1000
141    if factorial.lstFactorial[i] is None:
142        iProduct = 1
143        for iFactor in xrange(1, i+1):
144            iProduct *= iFactor
145        factorial.lstFactorial[i] = iProduct
146    return factorial.lstFactorial[i]
147
148def _chi2pFewTokens(fChi, iChiDF, fESF=1.0):
149    """
150    This is more efficient than _chi2pManyTokens for
151    small values of iChiDF*fESF, and is more
152    accurate. However it can't handle values of iChiDF*fESF
153    > some amount I don't recall at the time of writing this
154    docstring. It works up to at least iChiDF*fESF==100.0,
155    though _chi2pManyTokens is significantly faster at that point.   
156    """
157       
158    fAdjustedProduct = math.exp((fESF * (-fChi)/2.0))
159    iActualSize = iChiDF / 2
160                   
161    fEffectiveSize = float(iActualSize) * fESF
162
163    assert fAdjustedProduct != 0.0, 'df is: %i, fChi is %f, fESF is %f, iActualSize is %i, fEffectiveSize is %f' \
164                                % (iChiDF, fChi, fESF, iActualSize, fEffectiveSize, )
165
166    fSum = 0.0
167    for i in xrange(int(fEffectiveSize)):
168        fSum += (-math.log(fAdjustedProduct))**i / factorial(i)
169    fFirstTerm = fAdjustedProduct * fSum
170    fSecondTerm = fAdjustedProduct * \
171                  (fEffectiveSize - int(fEffectiveSize)) * \
172                  ((-math.log(fAdjustedProduct))**int(fEffectiveSize)) / \
173                  factorial(int(fEffectiveSize))
174    fResult = fFirstTerm + fSecondTerm
175    return fResult
176
177def chi2p(fChi, iChiDF, fESF=1.0, fUseFactorialForUnder=25.0):
178   
179    if (float(iChiDF) * fESF) < 25.0:
180        fResult = _chi2pFewTokens(fChi, iChiDF, fESF)
181    else:
182        fResult = _chi2pManyTokens(fChi, iChiDF, fESF)
183    return fResult
184
185if __name__ == '__main__':
186    import unittest
187   
188    class TestChi(unittest.TestCase):
189        def testRightSmallDFResult(self):
190            self.assertEqual(0.28240645151038829, chi2p(-2*(math.log(.3**10)), 2*10, .5))
191           
192        def testRightBigDFResult(self):
193            self.assertEqual(0.1336918921570289, chi2p(-2*(math.log(.3**60)), 2*60, .5))
194
195           
196    unittest.main()
Note: See TracBrowser for help on using the browser.