Estimates on primes and fingerprinting

Let $\pi(x)$ be the number of primes less than or equal to $x$ .

Theorem 21 (Prime number theorem - PNT)

$\pi(x) \sim \frac{x }{\ln x}$

The theorem was independently discovered by Hadamard and de la Vallée Poussin in 1896, postulating the density of primes, i.e., out of $x$ numbers between $1$ and $x$ , there is (roughly) a $1/\ln x$ fraction that are primes (for sufficiently large $x$ ). This is a nontrivial statement that we will not be proving here. For instance, if we want to calculate the number of primes that can be represented by $20$ digits in decimal system, we can use $x= 10^{20}$ and therefore $\pi(x)$ is roughly $10^{20}/(20 \ln 10) \approx 2.17 \times 10^{18}$ (this is like a huge number of primes).

In these notes, we will see an application of PNT in computer science. Note that we will be talking about probability without actually covering them in detail yet. Suppose Amelia and Ariel are working on the course lecture notes from two different locations. Amelia has file $x$ and Ariel has file $y$ (encoded as bit strings of length at most $\ell$ ). Amelia is confident that her file is the most up-to-date, while Ariel (having been out surfing on a California's beach) may not have one. How can Amelia effectively check whether Ariel has the correct file?

Of course, Amelia can ask Ariel to send his file $y$ to her, and this would consume $\ell$ bits of the network's bandwidth. Can they do better than this?¹

Amelia picks a random prime $p$ between $1$ and $\ell^2$ and sends it to Ariel along with $x\mod p$ .
Upon receiving $p$ , Ariel computes $y \mod p$ and checks whether $x \equiv y \pmod{p}.$

This communication protocol uses very little bandwidth, in total $4\lceil \log \ell \rceil$ bits (instead of $\ell$ bits). The verification process is correct with high probability (as mentioned, we will cover probability more intensively later).

Lemma 11

The probability that this algorithm makes an error is at most $4(\ln \ell)/\ell$ .

Proof:

There are two cases. In the first case when Ariel has the correct file, the algorithm never makes an error since $x = y$ implies that $x \equiv y \pmod{p}$ for every $p$ . In the other case when $x \neq y$ , we need to analyze the probability that $p \mid (x-y)$ . Consider $z = |x-y|$ and note that $z$ is (at most) an $(\ell+1)$ -bit number. Let $k$ be the number of primes that divide $z$ . So, if we $p_1,\ldots, p_k$ are the primes that divide $z$ , we must have that $z \geq p_1\cdot p_2 \cdot \ldots \cdot p_k$ , and since each prime is at least $2$ , this implies $z \geq 2^k$ . Taking a logarithm gives us $k\leq (\ell+1)$ .

The algorithm would make an error when Amelia happens to pick one of these primes $\{p_1,\ldots, p_k\}$ . From the Prime Number Theorem, there are $\ell^2/2\ln \ell$ primes of value at most $\ell^2$ , so the probability that $p$ is one of those $k$ primes is at most $\frac{k}{(\ell^2/2 \ln \ell)} \leq 4 \ln \ell/\ell$ .

Notice that the above calculation simply replaces $\pi(\ell^2)$ by $\ell^2/ \ln \ell^2$ . Certain things should be cautioned. In particular, in the above example, when $\ell$ is small (say, $5$ ), the calculation can be off by a significant margin. But in general, once we take sufficiently large $\ell$ (like $100$ or so), this calculation is quite accurate.

Exercise 55

Use PNT to prove that $\lim_{n \rightarrow \infty} \frac{p_{n+1}}{p_n} = 1$ where $p_n$ is the $n$ -th prime number.

Being able to explain the English meaning of statements such as above is crucial for computer scientists. The above statement implies that two consecutive primes are arbitrarily close to each other infinitely many times. For instance, the predicate " $p_{n+1} \leq 1.00001 p_n$ " is true for infinitely many $n$ .

Exercise 56

Use PNT to prove that $p_n \sim n \ln n$ .

Again, the English interpretation of the above statement is that $p_n \in [0.999 n \ln n, 1.001 n \ln n]$ for infinitely many $n$ . Such a statement is in fact implicit in many computer science results, e.g., when we say that a problem does not admit efficient algorithm, it means "every efficient algorithm must fail to solve the problem for infinitely many input" (therefore, hard computational problems may still be solved for real-world input).

It might sound a bit silly that Amelia is trying to be smart in this situation. But, imagine a more real-world situation where Amelia is actually running a company, like Dropbox. In such a case, she has to deal with millions of users, not just our own Ariel, so saving the network's bandwidth would be extremely important. ↩

Footnotes​

Footnotes