Robert a.k.a. 00011 a.k.a 1100010100000010011
Compression
import pandas as pd
import math
Fun with compressing names which is either interesting to you as a reader or not :) It does connect to data engineering and data science:
- Column storage with very similar data, check out Roaring Bitmaps
- An old idea to compare music is to compress a song together with a catalogue of music from a genre. The lowest compression is then the genre.
- How well data compresses tells you something about the intrisic complexity/richness of the data. Having a large dataset is a very basic measure, compression (and ideas like compression) tells you more.
- Encoder/decoders kindof work like lossy compression.
This article will be followed up with compressing timeseries.
def parse_perc(s: str) -> float:
return float(s.strip('%')) / 100
def parse_count(s: str) -> int:
return int(s.replace(',', ''))
df = pd.read_csv("names.csv", sep=";",
converters={
'Frequency (%) M': parse_perc,
'Frequency (%) F': parse_perc,
'Count M': parse_count,
'Count F': parse_count
})
df.tail()
The idea is that we construct character 2-grams and list the top 4 letters following a letter. For example, Jefferey
would result in [je, ef, ff, fe, er, re, ey]
. The leter e
is the start of three 2-grams: ef
, er
and ey
. By summing up the occurences of each 2-gram we can find out what's the most likely character following a specified letter.
Below we select the top 4:
top = 4
def two_grams(s: str):
return list( zip(s[:-1], s[1:]))
# Construct flattened list of weighted 2-grams and get the top 8 per letter
weighted_two_grams = pd.DataFrame(
df.apply(
lambda row: [(a,b, row['Frequency (%) M']) for a,b in two_grams(row['Name M'])],
axis=1
).apply(pd.Series).stack().to_list(),
columns=['first', 'second', 'freq']
).groupby(by=['first', 'second']).sum().sort_values(by='freq', ascending=False).groupby(by='first').head(top).sort_index()
weighted_two_grams.head()
To simplify we assume we only have 26 letters. To encode names we go through the name and output a (binary) code for each letter. We distinguish two types of codes:
- The number of the letter in the alphabet (1-26)
- The position in the top-4 of letters of the letter preceding this one
Let's define some functions that do this for us!
def encode_char(c, prev):
if prev is not None and c in weighted_two_grams.loc[(prev,)].index:
return (0, weighted_two_grams.loc[(prev,)].index.get_loc(c))
else:
return (1, ord(c)-65)
def encode(name):
encoded = []
prev = None
for c in name:
code = encode_char(c, prev)
prev = c
encoded.append(code)
return encoded
def decode_char(code, prev):
t, nr = code
if prev is not None and t == 0:
return weighted_two_grams.loc[(prev,)].iloc[nr].name
else:
return chr(nr+65)
def decode(codes):
decoded = []
prev = None
for code in codes:
prev = decode_char(code, prev)
decoded.append(prev)
return ''.join(decoded)
As an example let's see how robert
is encoded and decoded:
encode("ROBERT")
decode(encode("ROBERT"))
How much bits do we need?
We could encode the two types of codes with a 0
followed by 2 bits or 1
followed by 5 bits. The two bits give the position in the top 4, and the 5 bits give the position in the 26 letter alphabet (2^5=32). There might be some room to use even less bits with some trickery.
The 2-gram lookup table requires 2645=520 bits: sequentially store the top 4 for a...z, and there is no top four pad with zeros).
Suppose we encounter the top 1000 names proportionally than on average we need 21.92 bits per name:
def nr_bits(codes):
return sum([ 1 + (math.log2(top) if t == 0 else 5) for t, nr in codes])
df['enc'] = df['Name M'].map(lambda x: nr_bits(encode(x)))
(df['enc'] * df['Frequency (%) M']).sum()
Robert would be encoded as
def to_bin(encoding):
return ''.join([str(code) + format(nr, 'b') for code, nr in encoding])
to_bin(encode('ROBERT'))
Can we do better?
If the first 1000 first names really cover 90% we can also encode the first 1000 names in a fixed list. We then encode it as follows:
- Single bit 0 if not in the list and 1 if in the list
- If in the list 10 bits a number <1000
- If not in the list 5 bits per letter
The table takes 5677 * 5 = 28385 bits to store. On average this takes 10 bits per name.
The break-even point is at roughly encoding 2300 names from the top 1000.
(28385-520)/12
Probably the best idea is to mix the frequency table approach and encoding the top 200 names, as these already cover 72% of all names. The frequency table is very small and also works for names outside of the top 1000.
df.sort_values(by='Frequency (%) M', ascending=False).head(200)['Frequency (%) M'].sum()
This is left as an exercise for the reader.