Skip to main content

Estimates on primes and fingerprinting

Let π(x)\pi(x) be the number of primes less than or equal to xx.

Theorem 21 (Prime number theorem - PNT)

π(x)xlnx\pi(x) \sim \frac{x }{\ln x}

The theorem was independently discovered by Hadamard and de la Vallée Poussin in 1896, postulating the density of primes, i.e., out of xx numbers between 11 and xx, there is (roughly) a 1/lnx1/\ln x fraction that are primes (for sufficiently large xx). This is a nontrivial statement that we will not be proving here. For instance, if we want to calculate the number of primes that can be represented by 2020 digits in decimal system, we can use x=1020x= 10^{20} and therefore π(x)\pi(x) is roughly 1020/(20ln10)2.17×101810^{20}/(20 \ln 10) \approx 2.17 \times 10^{18} (this is like a huge number of primes).

In these notes, we will see an application of PNT in computer science. Note that we will be talking about probability without actually covering them in detail yet. Suppose Amelia and Ariel are working on the course lecture notes from two different locations. Amelia has file xx and Ariel has file yy (encoded as bit strings of length at most \ell). Amelia is confident that her file is the most up-to-date, while Ariel (having been out surfing on a California's beach) may not have one. How can Amelia effectively check whether Ariel has the correct file?

Of course, Amelia can ask Ariel to send his file yy to her, and this would consume \ell bits of the network's bandwidth. Can they do better than this?1

  1. Amelia picks a random prime pp between 11 and 2\ell^2 and sends it to Ariel along with xmodpx\mod p.
  2. Upon receiving pp, Ariel computes ymodpy \mod p and checks whether xy(modp).x \equiv y \pmod{p}.

This communication protocol uses very little bandwidth, in total 4log4\lceil \log \ell \rceil bits (instead of \ell bits). The verification process is correct with high probability (as mentioned, we will cover probability more intensively later).

Lemma 11

The probability that this algorithm makes an error is at most 4(ln)/4(\ln \ell)/\ell.

Proof:

There are two cases. In the first case when Ariel has the correct file, the algorithm never makes an error since x=yx = y implies that xy(modp)x \equiv y \pmod{p} for every pp. In the other case when xyx \neq y, we need to analyze the probability that p(xy)p \mid (x-y). Consider z=xyz = |x-y| and note that zz is (at most) an (+1)(\ell+1)-bit number. Let kk be the number of primes that divide zz. So, if we p1,,pkp_1,\ldots, p_k are the primes that divide zz, we must have that zp1p2pkz \geq p_1\cdot p_2 \cdot \ldots \cdot p_k, and since each prime is at least 22, this implies z2kz \geq 2^k. Taking a logarithm gives us k(+1)k\leq (\ell+1).

The algorithm would make an error when Amelia happens to pick one of these primes {p1,,pk}\{p_1,\ldots, p_k\}. From the Prime Number Theorem, there are 2/2ln\ell^2/2\ln \ell primes of value at most 2\ell^2, so the probability that pp is one of those kk primes is at most k(2/2ln)4ln/\frac{k}{(\ell^2/2 \ln \ell)} \leq 4 \ln \ell/\ell.

Notice that the above calculation simply replaces π(2)\pi(\ell^2) by 2/ln2\ell^2/ \ln \ell^2. Certain things should be cautioned. In particular, in the above example, when \ell is small (say, 55), the calculation can be off by a significant margin. But in general, once we take sufficiently large \ell (like 100100 or so), this calculation is quite accurate.

Exercise 55

Use PNT to prove that limnpn+1pn=1\lim_{n \rightarrow \infty} \frac{p_{n+1}}{p_n} = 1 where pnp_n is the nn-th prime number.

Being able to explain the English meaning of statements such as above is crucial for computer scientists. The above statement implies that two consecutive primes are arbitrarily close to each other infinitely many times. For instance, the predicate "pn+11.00001pnp_{n+1} \leq 1.00001 p_n" is true for infinitely many nn.

Exercise 56

Use PNT to prove that pnnlnnp_n \sim n \ln n.

Again, the English interpretation of the above statement is that pn[0.999nlnn,1.001nlnn]p_n \in [0.999 n \ln n, 1.001 n \ln n] for infinitely many nn. Such a statement is in fact implicit in many computer science results, e.g., when we say that a problem does not admit efficient algorithm, it means "every efficient algorithm must fail to solve the problem for infinitely many input" (therefore, hard computational problems may still be solved for real-world input).

Footnotes

  1. It might sound a bit silly that Amelia is trying to be smart in this situation. But, imagine a more real-world situation where Amelia is actually running a company, like Dropbox. In such a case, she has to deal with millions of users, not just our own Ariel, so saving the network's bandwidth would be extremely important.