10 min read

I love Python, and the Data Analysis library pandas in particular. It’s a fantastic library for working with tabular data, contains connectors for many common data formats out of the box, has an excellent performance profile (as it sits on top of NumPy), and has many many common data operations built in.

As a data analyst, I use pandas every day, for most of the day. Sometimes I think of ways I could make more use out of it, for example this gist for getting results from Neo4J into a DataFrame. Sometimes, like today, I think about how I could misuse it for a purely procrastinatory purpose.

Want to learn Python? This week our Python Fundamentals course is free inside Mapt. It’s an accessible introduction that’s comprehensive enough to give you the confidence you need to explore Python further. Click here, log in, and go straight to it! 

FizzBuzz

FizzBuzz is a popular, simple test of a programmer’s knowledge and ability, often employed in job interviews to test a candidate’s programming/problem solving ability.

The basic problem is usually stated as follows:

"Write a program that prints the numbers from 1 to 100. But for multiples of three print 'Fizz' instead of the number and for the multiples of five print 'Buzz'. For numbers which are multiples of both three and five print 'FizzBuzz'."

There is so much discussion of FizzBuzz on the internet, that it almost seems pointless to discuss it anymore. Regardless of your opinion of it as a tool to test programmer knowledge/ability, it is certainly an interesting, if somewhat basic challenge, and is still approached (with varying degrees of irreverence) to this day.

The Challenge

Partially inspired by Joel Grus’ tongue in cheek article above, and partially out of pure interest, I recently started to ponder how one might go about implementing a FizzBuzz solution using pandas. The intention would be to rely on as few non-pandas operations as possible, whilst still producing legible code.

First of all, here’s a fairly standard vanilla Python solution to the problem:

def fizzbuzz(x):
    '''returns the fizzbuzz output for an integer x'''
    output = ''
    if x % 3 == 0:
        output += 'Fizz'    
    if x % 5 ==0:
        output += 'Buzz'
    if (x % 3) > 0 and (x % 5) > 0:
        output += str(x)
    return output

for i in range(1,101):
    print fizzbuzz(i)

Now, the most simple way to apply this with pandas would be to just use the apply function on a series of integers:

import pandas as pd
pd.Series(range(1,100)).apply(fizzbuzz)

which is simple, terse and readable, but the logic is still being done outside of pandas. What I’m really after is a way to express that fizzbuzz logic entirely with pandas operations.

My first crack at this is displayed below.

#create a DataFrame containing all the values we need, all the integers,
#and a series of Fizzes and Buzzes we can hack together to make our output
values = pd.DataFrame(
    {
        'n':range(1,101),
        'fizz':'Fizz',
        'buzz':'Buzz'
    }
)

#get columns for each of the seperate output types
fizzes = values[values['n'] % 3 == 0]['fizz']
buzzes = values[values['n'] % 5 == 0]['buzz']
ints = values[(values['n'] % 3 > 0) & (values['n'] % 5 > 0)].n.apply(str)

#put the columns together as one dataframe again
outputs = pd.concat([fizzes,buzzes,ints], axis=1)

#for each row, concatenate the non-null values together
outputs.apply(lambda x: x[~pd.isnull(x)].str.cat(),axis=1)

First, we’re taking advantage of pandas’ quite clever constructor functions to create a DataFrame with all the values we need.

Secondly, we use pandas’ expressive filtering syntax to create three separate columns containing ONLY the values we need.

The third part is to concatenate these columns together to give us one more dataframe containing only the values we need. This takes advantage of pandas’ powerful and extensive indexing capabilities, which returns us a dataframe with a nice contiguous index, and all our values in order.

Finally, we use apply again to turn each row into a single string. When you supply axis = 1 to the apply method, it feeds each row to your operation in the form of a Series, which makes it easier to work with the row in question.

I was fairly happy with this as a first pass. All the logic for deciding what to print is done with pandas operations, and the flow is fairly clear. It’s still pretty long though. Yes it could be condensed down to two (very long and ugly) lines of code, but it still doesn’t feel like an ‘optimal’ solution to this (very silly) problem.

We can condense this logic down further, and reach one-liner-nirvana by making even more impractical use of pandas’ DataFrame constructor:

pd.DataFrame(
    {
        'f':pd.Series('Fizz',index=filter(lambda x: x% 3 == 0, range(1,100))),
        'b':pd.Series('Buzz',index=filter(lambda x: x% 5 == 0, range(1,100))),
        'n':pd.Series(
                filter(lambda x: x%3>0 and x%5>0,range(1,100)),
                index=filter(lambda x: x%3>0 and x%5>0,range(1,100))
            ).apply(str)
    }
).apply(
    lambda x: x[~pd.isnull(x)].str.cat(),
    axis=1
)

This really feels like progress. The flow is essentially the same as before, but now we construct the Fizz, Buzz and output Series within the DataFrame constructor.

This makes the code more succinct, and also serves the crucial purpose of saving some precious bytes of memory, reducing the evident strain on my 16GB, i5 work PC.

We can still do better however. You’ll note that the above solution contains a filter() function for the fizzbuzz logic, which is a step backwards (within this already backwards problem), Also, the FizzBuzz lines actually say BuzzFizz, which is very annoying and not actually fixable.

Version 0.18 of pandas introduced some really neat changes including an expanded set of arithmetic operations for Timedeltas.

Another neat little addition was the addition of the ability to filter Series objects using a callable condition. You can also supply an ‘other’ argument to return when the conditions aren’t met. This allows us to bring that filter logic back into pandas, and create our best/worst solution yet:

pd.concat(
    [
        pd.Series('Fizz', index=range(1, 100)).where(lambda x: x.index % 3 ==0, other=''),
        pd.Series('Buzz', index=range(1, 100)).where(lambda x: x.index % 5 ==0, other=''),
        pd.Series(range(1,100),index=range(1,100)).where(lambda x:(x % 5 > 0) & (x % 3 > 0), '').apply(str),
    ],
    axis=1
).apply(lambda x: x.str.cat(), axis=1)

Ok, Ok, I admit it, I’ve gone too far. I looked at this daft code, and realised my quest for pandas based purity had driven me to creating this abomination. But hey, I learnt a bit about my favorite library in the process, so what’s the harm?

If I relax my condition for pandas purity slightly, this can be coaxed into something almost readable:

pd.concat(
    [
        pd.Series('Fizz', range(0, 100, 3)),
        pd.Series('Buzz', range(0, 100, 5)),
        pd.Series(range(100)).where(lambda x:(x % 5 > 0) & (x % 3 > 0), '').apply(str),
    ],
    axis=1
).apply(lambda x: x.str.cat(), axis=1)[1:]

And that’s where I decided to stop. It’s not going to get any better/worse than that, and besides, I have actual work to do. It’s been an entertaining bit of work, and I’m satisfied this solution is about as small as I’m going to get within the ridiculous constraints I set myself.

Also, I can hear the ardent code golfers screaming at me that NONE of my pandas solutions are valid, because they all print an index column as well as the fizzbuzz values. I don’t think there is a way of overcoming this purely in pandas, so you’ll have to settle for just wrapping any of them with

'n'.join(...)

Performance

So, we have several methods for fizzbuzzing with pandas. The next obvious question is about scalability and performance. What if you’re interviewing to be a Big Data Engineer at Google, and they ask you for the fastest, most scalable, general fizzbuzz solution in your arsenal?

Let’s take our functions above for a test drive. We’ll encapsulate each as a function, then see how they scale as we ask for ever larger ranges. I’m using a quick and dirty method of timing this as I don’t actually care that much.

def fizzbuzz(x):
    '''returns the fizzbuzz output for an integer x'''
    output = ''
    if x % 3 == 0:
        output += 'Fizz'    
    if x % 5 ==0:
        output += 'Buzz'
    if (x % 3) > 0 and (x % 5) > 0:
        output += str(x)
    return output





#our vanilla solution
def fb_vanilla(rng):
    return map(fizzbuzz, range(1, rng+1))




#our trivial pandas solution
def fb_pandas_vanilla(rng):
    return pd.Series(range(1, rng+1)).apply(fizzbuzz)





#I'm going to skip the first big pandas solution, this is pretty much identical

#our second pandas solution, down to one line
def fb_pandas_long(rng):
    return pd.DataFrame(
        {
            'f':pd.Series('Fizz',index=filter(lambda x: x% 3 == 0, range(1,rng+1))),
            'b':pd.Series('Buzz',index=filter(lambda x: x% 5 == 0, range(1,rng+1))),
            'n':pd.Series(
                    filter(lambda x: x%3>0 and x%5>0,range(1,rng+1)),
                    index=filter(lambda x: x%3>0 and x%5>0,range(1,rng+1))
                ).apply(str)
        }
    ).apply(
        lambda x: x[~pd.isnull(x)].str.cat(),
        axis=1
    )




#our more succinct, pandas only solution.
def fb_pandas_shorter(rng):
    return pd.concat(
        [
            pd.Series('Fizz', index=range(1, rng+1)).where(lambda x: x.index % 3 ==0, other=''),
            pd.Series('Buzz', index=range(1, 100)).where(lambda x: x.index % 5 ==0, other=''),
            pd.Series(range(1,rng+1),index=range(1,rng+1)).where(lambda x:(x % 5 > 0) & (x % 3 > 0), '').apply(str),
        ],
        axis=1
    ).apply(lambda x: x.str.cat(), axis=1)




#our shortest solution, relying on some non-pandas stuff
def fb_pandas_shortest(rng):
    return pd.concat(
        [
            pd.Series('Fizz', range(0, rng+1, 3)),
            pd.Series('Buzz', range(0, rng+1, 5)),
            pd.Series(range(rng+1)).where(lambda x:(x % 5 > 0) & (x % 3 > 0), '').apply(str),
        ],
        axis=1
    ).apply(lambda x: x.str.cat(), axis=1)[1:]





#Let's do some testing!

functions = [
    fb_vanilla,
    fb_pandas_vanilla,
    fb_pandas_long,
    fb_pandas_shorter,
    fb_pandas_shortest
]

times = {x.__name__:[] for x in functions}

tests = range(1,1000,100)

from time import time

for i in tests:
    for x in functions:
        t1 = time()
        _ = x(i)
        t2 = time()
        times[x.__name__].append(t2-t1)

results = pd.DataFrame(times, index = tests)

Well, so far so terrible. The first, longest solution actually scales O(n), which I find hilarious for some reason. The rest of my attempts fare a little better, but nothing compares to the vanilla python solution, which doesn’t even show up on this chart, because it completes in <1ms.

Let’s discard that long solution, and try an even more strenuous test, up to 100,000.

That seems pretty conclusive. Get me Google on the phone, I have solved this problem.

When should this be used?

NEVER. Like, seriously, those results are beyond terrible. I was expecting the pandas solutions to fall down compared to the vanilla python solutions, but this is ridiculous. If you’re asked to implement fizzbuzz in your next interview, the only possible reason you would have for using any of these is because you’re a flippant contrarian like me. Having said that, it’s important to note that I still love pandas, and don’t blame these results on the fantastic devs there. This is thoroughly not what the library is designed for, and I’m only doing this to test my own fluency with the tool.

Overall I think I acheived my goal with this. I made myself chuckle a few times, and I learnt a few things about pandas in the process.

If you have any suggestions for optimisations to this code, reach out in the comments or on Twitter. Equally, if you have any thoughts on why specifically these pandas functions are SOOO much slower than the vanilla solution, I’d be interested to hear them.

LEAVE A REPLY

Please enter your comment!
Please enter your name here