It’s Sunday morning and I get some time to myself, so I’m listening to some blues and catching up on my newsfeeds when I come across this interesting article on calculating what size group of people would be necessary to have a 50/50 chance of two of them sharing the same birthday.
The primary focus of the article is on this equation to calculate the probability of uniqueness given a sample size of r from a group of N things to choose.
$$p = \frac {N!} {N^r (N-r)!}$$
In the article, he is concerned with the possibility of overflow with the size of the factorials involved and since scipy doesn’t have log factorial he implemented his solution with gamma log — see above article for his code. So I say to myself, “Wonder if I can do this with just straight Python?”
A quick google on log factorial found this approximation of log factorial by Srinivasa Ramanujan on math.stackexchange If you have not heard of Ramanujan before — stop and google him immediately. Wow!
$$\log n! \approx n\log n-n+\frac{\log(n(1+4n(1+2n)))}{6}+\frac{\log(\pi)}{2}$$
Which works out to the following CPython code:
from math import log, pi, exp def logn_factorial(n): """return an approximation of log n! using Ramanujan's equation.""" return n * log(n) - n + (log(n * (1 + 4*n * (1 + 2*n)))/6) + (log(pi)/2) p = exp(logn_factorial(N) - logn_factorial(N - r) - r*log(N))
Ok, so no scipy needed to follow along with this article plus I get to use some very cool math. I like Sunday morning fun time. However, I then start thinking, “overflow”, hmmmm. What is the upper bound of math.factorial
anyway. Since it’s the birthday problem lets see what happens with 365
>>> import math >>> math.factorial(365) 25104128675558732292929443748812027705165520269876079766872595193901106138220937419666018009000254169376172314360982328660708071123369979853445367910653872383599704355532740937678091491429440864316046925074510134847025546014098005907965541041195496105311886173373435145517193282760847755882291690213539123479186274701519396808504940722607033001246328398800550487427999876690416973437861078185344667966871511049653888130136836199010529180056125844549488648617682915826347564148990984138067809999604687488146734837340699359838791124995957584538873616661533093253551256845056046388738129702951381151861413688922986510005440943943014699244112555755279140760492764253740250410391056421979003289600000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Ok, let’s dial it up by a factor of 1000.
>>> math.factorial(365000)
Well it took a bit but it ran, with over 40 screens of numbers. Python is still going strong, so what is the upper bound of math.factorial
? A search brought me here http://bugs.python.org/issue8692 and specifically this message. Which means that the max size of the result can not exceed sys.maxsize - 1
digits, or on a 64bit platform, 2**63 – 1 digits of capability. Thanks to some dedicated individuals who seemed to be having as much fun as I was, math.factorial
is up to the task.
The take-away from this article is not the cool math, or the approximations of log n! — it is…
Don’t underestimate the power of Python.
Try straight Python before you move on to something more complex. The approximations of $\log n!$ were unnecessary. All that was needed was just Python and we get the following implementation of the Probability of Uniqueness:
from math import factorial p = factorial(N) / ((N**r) * factorial(N-r))
A simple and straightforward implementation of the equation.
* A quick note, I’m usually a stickler for good variable names, however, when working the math equations I stick as close as possible to the equation that I am implementing.
* Also, if you are working on big enough numbers and I mean huge numbers, then approximations might be needed but by then you are going to be working at the limits of what a 64bit platform can do.
Premature optimizations and all that, when was the last time you fell in to the trap of implementing something to handle a misconceived belief of a shortcoming in Python or one of the standard libs?