Finding anagrams using Godel numbering and Randomization

This post is about a neat application of Gödel numbering. This post is inspired by a question my friend was asked during an interview for Microsoft.

Gödel numbering was invented by Kurt Gödel for the proof of his incompleteness theorems. Consider the alphabet \(\Sigma =\{ a,b,c,d,...,z\}\). Gödel numbering creates a one to one encoding scheme \(G: \Sigma^* \to N\), by assigning to each symbol in the alphabet a positive integer \(a=1, b=2,...,z=26\) and encodes strings by the product of successive primes raised to these powers.

\[G(x_1x_2...x_n) = 2^{x_1} \cdot 3^{x_2} \cdots p_n^{x_n}\]

So in theory, every string can be expressed as a natural number and every natural number can be broken down into a string by prime factorization. There are some practical limitations here:

The numbers get very large very fast, representing them would be difficult. For example: \(G(abcd) = 2 \cdot 9 \cdot 125 \cdot 2401 = 5402250\)
Integer Factorization is very difficult for large numbers. There is no efficient, non-quantum integer factorization algorithm for integer factorization. Moreover, it is not known exactly which complexity classes contain the decision version of the integer factorization problem.

Now the problem:

You are given a list of very large strings of equal length, where each string consists only of lower case alphabets. Group these strings into sets of anagrams. For example:

Input : \([abcd,abde,acdb,ebad,bade]\)

Output: \([abcd,acdb]\) and \([abde,ebad,bade]\)

Solution: We want strings which are anagrams to have the same \(G\) value. Applying the \(G\) function directly wont work as it takes into consideration the location of the symbols as well. Ex:\(G(ab)=18\) and \(G(ba)=12\).

Consider this example: \(aaaaabbbbb,ababababab, aabbaabbab\). These three strings are anagrams, all such anagrams could be represented as the multi-set \(\{a:5,b:5\}\) . We need to design a function which would map from the set of multi-sets of \(\{a,b,c,..z\}\) to \(N\)

Instead of assigning the values \(1,2,3,4,...,26\), lets assign the values \(2,3,5,7,...,p_{26}\) to \(a,b,c,d,...,z\) respectively. So \(\{a:5,b:5\}=2^5\cdot3^5=7776\)

The new \(G\) function we are using is this:

\[G(x) = 2^{ \#a} \cdot 3^{ \#b} \cdot 5^{ \#c} \cdots p_{26}^{\#z}\]

Where \(\#a\) represents the number of \(a\)s in \(x\).

Given a string \(x_1x_2...x_n\), \(G(x_1x_2...x_n)\) can be calculated in a single pass through the string as:

\[G(x_1x_2...x_n) = V[x_1]\cdot V[x_2] \cdots V[x_n]\]

Where \(V[x]\) is the value assigned to the alphabet \(x\), in this case \(p_x\).

Considering the original example, \([abcd,abde,acdb,ebad,bade]\). applying the \(G\) function to each string in this array we get \([120,144,120,144,120]\). The strings having the same \(G\) function value will be anagrams and they can be put in the respective sets in linear time.

A concise implementation in python would look like this:

from operator import mul
from collections import defaultdict
primes=[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101]
alphabets='abcdefghijklmnopqrstuvwxyz'

Values= dict(zip(alphabets, primes))

def G(s):
	L = [Values[i] for i in s]
	return reduce(mul, L, 1)

L=['abcd','abde','acdb','ebad','bade']
V=[G(s) for s in L]


d=defaultdict(list)
for v,l in zip(V,L):
	d[v].append(l)

print d.items()
# [(210, ['abcd', 'acdb']), (462, ['abde', 'ebad', 'bade'])]

We still have to deal with large numbers, but there is a clever way of using Randomization to overcome this.

Let the size of the largest string in the list be \(s\). So the largest possible value of the \(G\) function could be \(101^s\). The number of bits needed to represent this number in binary is \(\log_{2}101^s \le 7s\). Let \(z=7s\).

Choose a prime \(P\) uniformly at random from the interval \([103,z^d]\) and use this number as a modulus for calculating the \(G\) value of the strings. We will decide a suitable value of \(d\) later.

If \(X\) and \(Y\) are anagrams, then,\(G(X) \pmod P = G(Y) \pmod P \quad \forall P \in Primes[103,z^d]\)

In python this can be done by using this function instead of the one above:

def mod_mul(P):
	return lambda x,y:((x%P)*(y%P))%P
	
def G(s,P):
	L = [Values[i] for i in s]
	return reduce(mod_mul(P), L, 1)

This may not always work. If \(X\) and \(Y\) are not anagrams, then there might be some \(P\) for which \(G(X) \pmod P = G(Y) \pmod P\). In this case \(P\) divides \(\mid G(X)-G(Y)\mid\). Since there is some probability of going wrong on a false instance, this is a one-sided Monte Carlo algorithm.

Lets find an upper bound for failing to find the right anagram sets. Consider a list of \(l\) string, each of length \(s\) and none of which are anagrams. Let \(X\) and \(Y\) be two strings in the list.

It can be shown that there are at most \(z-1\) primes such that \(P\) divides \(\mid G(X)-G(Y)\mid\). From the Prime number theorem, we have that the number of primes in the interval \([2,m]\) approaches \(m/ \ln m\) as \(m\) approaches infinity. So the probability of choosing a prime \(P\) which will group \(X\) and \(Y\) as anagrams is :

\[\frac{z-1}{z^d/\ln z^d} \le \frac{d\ln z}{z^{d-1}}\]

As the size of the list is \(l\), there exist \({l \choose 2}\) such pairs of \(X\) and \(Y\). Using a simple union bound, we can upper bound the probability of failure:

\[Prob(failure) \le {l \choose 2} \frac{d\ln z}{z^{d-1}} \le l^2 \frac{d\ln z}{2z^{d-1}}\]

Hence we have \(Prob(failure) \le 1/2\) for

\[l \le \sqrt{\frac{z^{d-1}}{d\ln z}}\]

We need to choose a \(d\) such that:

\[\ln(l^2 \ln z) \le (d-1)\ln z - \ln d\\ \ln(l^2 \ln z) \le (d-1) \ln z\\ \frac{\ln(l^2 \ln z)}{\ln z} + 1\le d\]

We can run this algorithm \(k\) times, each time choosing a random prime between 103 and \(z^d\) and amplify the success probability.

Reference: Design and Analysis of Randomized Algorithms

Written on July 4, 2016