Beyond spreadsheets

Posted 2012-11-12 15:21 under data, python, programming

I frequently work with teams that analyse fairly large datasets with nothing but Excel. While you can get quite far with Excel, there are other tools that allow you to extract meaning from data more quickly, and automate repeated steps in the process. In this blog entry, I'd like to discuss some the tools that I use, and explain where they are appropriate.

First of all, spreadsheets are very useful tools. They are familiar, versatile, interactive, and widely available. However, here are some signs that spreadsheets may not be the best tool for a particular job:

In these cases, there is an easier way.

Introducing scripting languages

Scripting languages are open-source programs that allow you to write "scripts", or sequences of commands, that read and analyse data from different sources, perform calculations, and produce output such as reports, graphics, and new data.

They work as follows:

The basic difference is that while spreadsheets require you to type numbers and calculations into the same sheet, scripting languages separate the data from the processing rules. This gives you several advantages:

Of course, Excel includes VBA (Visual Basic for Applications) to provide some of this flexibility, but for power users I am convinced that a good scripting language provides a superior balance of power, flexibility, and ease of use.

The case for Python

There are many scripting languages, and among others I have used Perl, Ruby, and shell scripts. In addition, I frequently use R for data analysis and visualization.

The one I have found by far the most useful, however is Python. The software and documentation (including a great tutorial) can be downloaded free, and it is easy to get started.

Python is great for data analysis and visualization because:

Getting Started

A good way to start exploring Python is to download the software (available for Windows, Mac, and Linux), and work through the tutoral. For data analysis, look at file input and output, and check out matplotlib for basic charts and data visualization.

For books, I can recommend the following titles, all available in print or as e-books from O'Reilly:

Remember that O'Reilly gives you 50% of two or more e-books if you register with the site and log-in, and they have a daily deal for half price.

Python Course

Finally, I'm developing a new course in Python for Data Analysis, designed to be delivered over six weekly lunch hour breaks. For a limited time, I'm offering the first module for free to organizations based in and around London. This first module covers the basic whys and hows of scripting language, and an interactive introduction to basic Python syntax, with a focus of elements required for computation and data handling.

Please get in touch if you would be interested.

Add a comment