In 1938 a physicist called Frank Benford wrote a paper about something that he had noticed concerning collections of numbers.
These data sets were real life situations and were surprisingly diverse ranging from bills, to populations, to death rates. Being a physicist Benford captured the law mathematically
But it is best understood by considering the numbers in base 10 ie digits 1 through to 9, and drawing out the probability of finding each digit as the first digit of any number in one of the real life data sets that Benford considered. The result is:
Hence, if you study one of the relevant data sets you find that 1 is the first digit approximately 30% of the time and 9 about 5% of the time. The phenomenon had been noticed some years earlier in 1881 by a Canadian mathematician called Simon Newcombe but Benford was the one who did a lot of work that the law held good across a wide variety of data types, and so the law bears his name.
Whilst Benford's law was empirically derived it can be mathematically proved that it applies exactly to a whole range of naturally occurring number such as the are the Fibonacci numbers, the factorials, the powers of 2, and the powers of almost any other number for that matter.
Plus it can be shown to apply to numbers in other bases such as base 16. As late as 1995 a mathematician called Theodore Hill published a paper that showed that the law could be generalized to apply not just to the first but to any digit within a number.
Hence, the probability that d (d = 0, 1, ..., 9) is encountered as the n-th (n > 1) digit is:
Very interesting if you are a student mathematics and its use in modeling the world in which we live, but one might think that is as far as its usefulness goes. Not so.
As far back as 1972 an economist called Hal Varian (now working for Google) suggested that one could use Benford's law to differentiate between social-economic data had been manufactured or was derived from real life situations.
And suddenly the light bulb went on. Why could Benford's law not be applied to detect fraud in a range of data sets such as tax returns or electoral fraud? Extensive tests showed that Benford's law could, within certain limitations, give a reliable indication of fraud within data sets. To this day evidence based upon Benford's law is admissible in most US courts.
All of which brings us to the present day when we are presented with the ever increasing volumes of data that enter our lives electronically. The Internet now holds over a zettabyte of data and we are constantly having to make judgments about whether to trust that data.
It might be as simple as whether an image has been altered right through to whether large statistical datasets should be used to make a critical business decision. Which makes me ask if there is not some way to apply Benford's law, and its generalized forms, to help us decide whether or not we can trust some electronic data we may be about to rely upon.
Trust is such a fundamental aspect of how we use the Internet that at the very least this is an area that is worthy of more research.
Cross-posted from Professor Alan Woodward