I frequently work with teams that analyse fairly large datasets with nothing but Excel. While you can get quite far with Excel, there are other tools that allow you to extract meaning from data more quickly, and automate repeated steps in the process. In this blog entry, I'd like to discuss some the tools that I use, and explain where they are appropriate.
First of all, spreadsheets are very useful tools. They are familiar, versatile, interactive, and widely available. However, here are some signs that spreadsheets may not be the best tool for a particular job:
- The data set is huge, and makes your computer creak and groan
- There are lots of transformations to the data before the real analysis can be done
- It takes a really long time to recalculate, reducing your productivity
- The data gets updated frequently, requiring you to import new data and make sure all the formulas still line up
- You need to create visualizations beyond basic graphs (e.g., time lines, network graphs, customized maps, etc.)
- You generally feel that you are "pushing Excel beyond its limits"
In these cases, there is an easier way.
Introducing scripting languages
Scripting languages are open-source programs that allow you to write "scripts", or sequences of commands, that read and analyse data from different sources, perform calculations, and produce output such as reports, graphics, and new data.
They work as follows:
- You download and install the relevant software (typically small and usually free)
- You run the "interpreter", and can then type commands to see how it works or test things, or to perform impromptu exploration
- For anything non-trivial, you can write "scripts" or programs, that run a series of commands, including looping operations and conditional branching
- Scripts can make use of libraries of pre-written code, which can be downloaded to extend the scripting language by adding capabilities such as connecting to databases, helping you to create graphics, turning your script into a web site, etc.
The basic difference is that while spreadsheets require you to type numbers and calculations into the same sheet, scripting languages separate the data from the processing rules. This gives you several advantages:
- Scripts can be re-used on different data sets
- Scripts are more transparent, i.e., it is easier to read the program and see what it is doing
- Scripts can do more sophisticated processing than is possible with spreadsheeet formulas, because they can go beyond the contraints of a cell
Of course, Excel includes VBA (Visual Basic for Applications) to provide some of this flexibility, but for power users I am convinced that a good scripting language provides a superior balance of power, flexibility, and ease of use.
The case for Python
The one I have found by far the most useful, however is Python. The software and documentation (including a great tutorial) can be downloaded free, and it is easy to get started.
Python is great for data analysis and visualization because:
- It is good at handling text, and most data includes text, not just numbers
- It includes a superb selection of
libraries (packages of reusable code that extend the basic capabilities of the language), for doing things like graphics, web access, database interfacing, and statistics
- It is very versatile, and equally useful for quick small scripts all the way up to full-fledged web applications
- It is an easy language to learn and explain, with an easy and straight-forward syntax
- It has built-in support for data handling constructs such as lists and dictionaries, allowing you to quickly define and manipulate rich data structures
- It interfaces with the real world, by allowing you to easily read and write data from/to a wide variety of sources, including different file formats, web sources, and databases
- It is fast enough for most purposes, and critical functions can be re-written in C if they need to be really fast (and this is almost never necessary)
- There is a huge community of people using it, with online forums, conferences, etc., so help is usually only a search away
- Python has excellent documentation, and a wide variety of good books is available
- The language is mature and reliable
A good way to start exploring Python is to download the software (available for Windows, Mac, and Linux), and work through the tutoral. For data analysis, look at file input and output, and check out matplotlib for basic charts and data visualization.
For books, I can recommend the following titles, all available in print or as e-books from O'Reilly:
- Think Python by Allen Downey for a gentle introduction to programming using Python (also available for free viewing online here
- Learning Python by Mark Lutz for a more comprehensive introduction to Python and its core libraries
- Python for Data Analysis by Wes McKinney for an overview of data-related capabilities of the language
Remember that O'Reilly gives you 50% of two or more e-books if you register with the site and log-in, and they have a daily deal for half price.
Finally, I'm developing a new course in Python for Data Analysis, designed to be delivered over six weekly lunch hour breaks. For a limited time, I'm offering the first module for free to organizations based in and around London. This first module covers the basic whys and hows of scripting language, and an interactive introduction to basic Python syntax, with a focus of elements required for computation and data handling.
Please get in touch if you would be interested.