Jupyter notebook text-to-integer-conversion/text-conversion.ipynb

text-to-integer-conversion/text-conversion.ipynb

⁶²⁵ views

Kernel: Python 2

In this Jupyter Notebook we solve the problem of converting integers to base- $b$ arrays, and vice versa. By a base- $b$ array, we mean the array of base- $b$ digits that defines the integer when expressed in base- $b$ notation.

Converting integer to base- $b$

Here is a Python function that converts a given nonnegative integer $x$ into a base- $b$ array $[d_N, d_{N-1}, \dots, d_2, d_1, d_0]$ such that $x = \sum_{k=0}^N d_k b^k,$ where $0 \le d_k < b$ for each $k = 0, 1, \dots, N$ . For example, if $b=2$ then the array gives the binary representation of $x$ .

In [1]:

def int_to_base_array(b,x):
    ans = []
    while x > 0:
        ans = [x % b] + ans  # prepend digit to ans
        x = x//b
    return ans

Let's test the code.

In [2]:

base_array = int_to_base_array # shorter alias for typing convenience
base_array(2,7)

Out[2]:

[1, 1, 1]

In [3]:

base_array(10,357)

Out[3]:

[3, 5, 7]

In [4]:

base_array(5,357)

Out[4]:

[2, 4, 1, 2]

In [5]:

# better CHECK the last result
2*5**3 + 4*5**2 + 1*5**1 + 2*5**0

Out[5]:

357

Okay, our code seems to be working.

Converting from base- $b$ to integer

Now we want to code a function that reverses the above. Given an array of base- $b$ digits, we want to convert back to the original integer $x$ .

In [6]:

def base_array_to_int(b,digits):
    ans = 0
    for r in digits:
        ans = ans*b + r
    return ans

As usual, we had better test it.

In [7]:

to_int = base_array_to_int # shorter alias for typing convenience
to_int(5, [2,4,1,2])

Out[7]:

357

In [8]:

to_int(10,[3, 5, 7])

Out[8]:

357

In [9]:

to_int(2, [1, 1, 1])

Out[9]:

7

Everything checks. This seems to be working.

Converting a text block to an integer (part one)

Next, I want to apply the functions defined above to convert a block of text to an integer, and from the integer back into text again. Such encodings are of crucial importance in cryptography. First we look at a childish solution, based on stipping all puncuation and white space from the text, and assuming only lower case letters of the English alphabet. Since there are 26 letters in the alphabet, we can work in base-26.

In [10]:

def text2int(text):
    alphabet = 'abcdefghijklmnopqrstuvwxyz'
    base26array = [alphabet.index(char) for char in text]
    return base_array_to_int(26, base26array)

For example, let's convert the string helloworld to an integer using base-26 encoding.

In [11]:

text2int('helloworld')

Out[11]:

38933758647189

Next we define the inverse function to go backwards again.

In [12]:

def int2text(x):
    alphabet = 'abcdefghijklmnopqrstuvwxyz'
    base26array = int_to_base_array(26, x)
    text = ''
    for digit in base26array:
        text = text + alphabet[digit]
    return text

This should convert the integer back into text.

In [13]:

int2text(38933758647189)

Out[13]:

'helloworld'

Converting a text block to an integer (part two)

But the previous solution is childish, since there is no good reason to avoid punctuation and white space. Also, we ought to be able to distinguish between upper and lower case letters, and we should be able to handle special characters. In other words, we are now looking for a robust, industrial-strength, solution. There are many good ways to solve this problem. Here I will use the ASCII encodings of characters used in modern digital computers. If you don't know what ASCII means, then look it up!

All we really need to know about ASCII is that it is an alphabet of 256 characters, in which all the keyboard characters appear. The number $256 = 2^8$ appears here precisely because our computers are designed to work with bytes, which are bit strings of 8-bits. There are 256 possible bit strings of length 8. The second thing we need to know is that Python has a builtin bytearray function that converts text to an array of its numerical ASCII representations, character by character.

In [14]:

def text2int(text):
    barray = list(bytearray(text))
    return base_array_to_int(256, barray)

In [15]:

text2int("Now is the time to worry! Indeed, 'tis!")

Out[15]:

2556412079982006913653956174919156864365338629148352603563830428006298034728105396100932399905L

In [16]:

def int2text(x):
    base256array = int_to_base_array(256, x)
    barray = bytearray(base256array) # coerce to a bytearray
    return str(barray)

In [17]:

int2text(2556412079982006913653956174919156864365338629148352603563830428006298034728105396100932399905L)

Out[17]:

"Now is the time to worry! Indeed, 'tis!"

We can understand how the function definitions work by running them, one step at a time, on some test data. For example, let's suppose we have the text string "6 = half Dozen!". Let's run the text2int definition on this text string, one step at a time, showing all the intermediate steps.

In [18]:

# text2int simulation
text = "6 = half Dozen!"
b = bytearray(text); b

Out[18]:

bytearray(b'6 = half Dozen!')

In [19]:

barray = list(b); barray

Out[19]:

[54, 32, 61, 32, 104, 97, 108, 102, 32, 68, 111, 122, 101, 110, 33]

Converting integer to base- $b$

Converting from base- $b$ to integer

Converting a text block to an integer (part one)

Converting a text block to an integer (part two)

Product

Resources

Company

Converting integer to base-bbb

Converting from base-bbb to integer

Converting a text block to an integer (part one)

Converting a text block to an integer (part two)

Converting integer to base- $b$

Converting from base- $b$ to integer