Monoids in python

Sometimes mathematicians come up with very obscure topics that seem like they will never be useful. Usually we realize, years later, that one of the mystical toys mathematiacians leave lying around is actually a perfect shaped tool for the problem at hand. I want to introduce you to one of these tools, that has been gaining traction in recent years, as a way to solve problems with big data.

A monoid is a combination of an object (a,b,c) and an operation ($+$) that meets the following conditions

  • the operation on two of the objects produces a new object of the same kind. e.g. $int + int = int$
  • the operation must be associative: $(a + b) + c = a + (b + c)$
  • a null object $e$ must exist, such that $e + a = a + e = a$

$(0,int,+)$ is a monoid:

  • $(a + b) + c = a + (b + c)$
  • $0+a=a+0=a$

$(1,int,\times)$ is a monoid:

  • $(a \times b)\times c = a\times(b\times c)$
  • $1\times a=a \times 1=a$

It turns out that if you have these properties, it becomes trivial to run these operations on massive datasets very quickly while using very little memory. Also splitting up these operations and running them in parallel becomes trivial. A couple caveats before we begin. We are doing this in python.

a duck typed language - if it moves like a duck and quacks like a duck, then its a duck.

But for something that cares so much about the types of objects, a duck typed language like python is not a very natural choice. These concepts fit much better in strongly typed languages like Scala or Haskell.

However, python is plenty to gain intuition for the topic. We will just need to do an extra step of typifying python objects into monoid compatable objects, and we will have to be careful that we don’t feed out monoids any objects of the wrong type.

class Monoid:
    def __init__(self, null, typify, op):
        self.null = null
        self.typeify = typify
        self.op   = op

    def __call__(self, *args):
        result = self.null
        for arg in args:
            arg = self.typeify(arg)
            result = self.op(result, arg)
        return result

Multiplication

String Concat

Counting

Other applications

We have seen some simple examples of monoids, but it turns out that they can get much more complicated.

  • Matrix Multiplication
  • Cartesian Products
  • Statistical Moments
  • Bloomfilters
  • HyperLogLogs
  • Stochastic Gradient Descent

Cartesian Products

A cartesian product

A cartesian product is just a fancy way of saying every combination of two lists of elements. There are many problems that can be solved by brute force, just by trying every combination. If you had a key ring of 10 keys and 2 locks, you could find a solution by checking every key against every lock.

The Cartesian Product of two sets of real numbers can be thought of as a cartesian grid. You could use the Cartesian product to generate all playing cards from [♥,♦,♣,♠] and [A,2,3,4,5,6,7,8,9,10,J,Q,K]. Cartiesian Products with an added filter are the conceptual basis for joins in SQL.

Lets check Cartesian products against our Monoids checklist?

  1. A cartesian it takes in two lists of elements and produces a new list. So ✓
  2. The order of the cartesian product doesn’t matter as long as the order of the pairs doesn’t matter. (it doesn’t matter if we try 10 keys against 2 locks, or 2 locks against 10 keys) ✓
  3. What about a null element?

We need some kind of list, that when combined with any other list, gives us that list. Depending on the use case, we can just create this by wrapping a list around an empty element. [’’]. This way when we generate a new combination:

$[’’] * [A,B] => [’‘+A, ‘‘+B] => [A,B]$

So if we can make a null element work for application then ✓

How do we print out every combination of 4 basepairs? When I give this problem to students, they often get stuck. They are easily able to make 2 for loops to generate every combination of 2 lists, but they start to feel dirty after adding the third for loop, and almost always start looking for a new solution by the fourth for loop.

If we use the Monoids concept, this problem becomes really easy.

def cartesian_prod(a_list,b_list):
    c = []
    for a in a_list:
        for b in b_list:
            c.append(a+b)
    return c

cart_prod_monoid  = Monoid([''],  lambda x: x, cartesian_prod)

base = ['A','T','C','G']
print cart_prod_monoid(base,base,base,base)

This will print out every possible 4 base pair strand of DNA, and the concept makes our code very simple

Another example usage of this. Lets say we wanted to find every possible word that your phone number spells out. One way to solve this problem would be to find every combination of letters and check if the word is in a list of english words. Here is my solution with Monoids.

phone_keys = {'2': "abc",
              '3': "def",
              '4': "ghi",
              '5': "jkl",
              '6': "mno",
              '7': "pqrs",
              '8': "tuv",
              '9': "wxyz",
              '0':' ',
              '1':' '}
#wget https://github.com/dolph/dictionary/blob/master/enable1.txt?raw=true
english_words = set([word.strip() for word in open('enable1.txt').readlines()])

def phone_words(n):
    possible_letters = [phone_keys[digit] for digit in str(n)]
    return [word for word in cart_prod_monoid(*possible_letters) if word in english_words]
    
    
#english_words    
phone_words(228)

Statistical Moments

I often joke, that counting things is pretty much all I do as a data scientist. But when you have millions or billions of objects to count, the real trick is dividing the task up between lots of computers.

Keeping a running sum of how many objects you have seen thus far is fairly trivial. If we want to divide up the task of counting between Alice and Bob. We can easily get the total count by adding together thier totals

just keep a running tally and update it as more objects pass. We have already seen how counting is a monoid.

Now what if we want Alice and Bob to keep track of the average mass of the objects as well as the count? At the end of the day we can combine their averages as long as we know the number of objects each saw. We just look at the difference between the averages and weight them the relative number they saw

It turns out there is a similar formula for computing variance as we go:

variance: $s^2_n = \frac{M_2}{n-1}$

Chan, Tony F.; Golub, Gene H.; LeVeque, Randall J. (1979), Updating Formulae and a Pairwise Algorithm for Computing Sample Variances. provides a formulation of the statistical moments that can be calculated like monoids

And Skewness

skewness: $g_1 = \frac{\sqrt{n} M_3}{M_2^{3/2}}$

And Kurtosis:

kurtosis: $g_2 = \frac{n M_4}{M_2^2}-3$

So here is my python implementation of that. I created a new class called StatisticalMoments, which only stores a few numbers worth of data in memory, but is able to summarize the shape of distribution of many many objects. I then override the addition operator for this class, which will handle the logic for combining these objects.

import math
from collections import namedtuple
from __future__ import division
import numpy as np
from scipy import stats

class StatisticalMoments(namedtuple('moments', ['n','mean','M2','M3','M4'])):
    def __add__(a,b):
        if a.n == 0:
            return b
        if b.n == 0:
            return a
        n = a.n + b.n
        delta = b.mean - a.mean
        delta2 = delta * delta
        delta3 = delta2 * delta
        delta4 = delta3 * delta
        mean = a.mean + delta*b.n/n
        M2 = a.M2 + b.M2 + delta2 * a.n * b.n / n 
        M3 = (a.M3 + b.M3 + delta3 * a.n * b.n * (a.n - b.n) / ( n * n ) +
                3*delta*(a.n*b.M2 - b.n*a.M2)/n)
        M4 = (a.M4 + b.M4 + delta4 * (a.n * b.n * (a.n * a.n - a.n * b.n + b.n * b.n)) / (n ** 3) +
                6 * delta2 * (a.n * a.n * b.M2 + b.n * b.n * a.M2) / (n ** 2) +
                4 * delta * (a.n * b.M3 - b.n * a.M3) / n )
        return StatisticalMoments(n,mean,M2,M3,M4)  
    
    def var(self):
        return self.M2/(self.n)
    
    def skew(self):
        return math.sqrt(self.n)*self.M3/math.sqrt(self.M2*self.M2*self.M2)
    
    def kurt(self):
        return self.n*self.M4/(self.M2**2) - 3
    
    def __repr__(self):
        s = "(n={}, mean={}".format(self.n,self.mean)
        if (self.n > 1):
            var = self.var()
            s += ",var={}".format(var)
        if (self.M2 > 0):
            skew = self.skew()
            s += ",skew={}".format(skew)
        if (self.M2 > 0):
            kurt = self.kurt()
            s += ",kurt={}".format(kurt)
        s += ')'
        return s

Then I just define a monoid using this new object type.

variance_monoid = Monoid( StatisticalMoments(0,0,0,0,0),lambda x: StatisticalMoments(1,x,0,0,0), lambda a,b: a+b)

and presto, we are can now keep track of all the statistical moments of data flying by, with almost no memory useage, and splitting this problem up between computers becomes trvial.

l = [2,2,7,2,3,4,4.5,6,6,6]

variance_monoid(*l)

Advantages:

  • Easy Parallelization
  • Incremental Aggregation for streaming applications
  • Simple Code
  • Unit testable

the Monoid framework is easy and can handle massive amounts of data.

Written on September 14, 2016