Translate

Thursday, April 16, 2015

Two of the most useful spreadsheet tricks ever

Before I get going on making some sweet looking data visualizations, there are a couple of data-cleaning steps which make a world of difference. Both can be done relatively quickly (and painlessly) using OpenRefine (also known as Google Refine)"a powerful tool for working with messy data: cleaning it; transforming it from one format into another; extending it with web services; and linking it to databases."

There are so many useful options for working with data, what OpenRefine calls "facets" and "filters". Also, a simple interface tracks every step you make, allowing easy undo's/redo's anytime along the way. I'm going to go over two functions which I find incredibly useful in preparing data for later visualization: lengthening a wide dataset, and merging two datasets with a common column.

First, giving your data a growth spurt by trimming the fat. What the heck do I mean by that?

Well, lots of datasets which are downloadable from the internet come in a wide format. For example, World Bank data has its datasets sorted by year across columns. I'm going to use the data for carbon dioxide emissions to illustrate this function, and you can download the data here to follow along as well.

Create a project and upload the .csv file, leaving the default settings as is. You should get something that looks like this:


As you can see, individual years run across the columns making the dataset wide. What we want is to get all the yearly data into two columns, one with the year and the other with CO2 emissions. Here comes the easy part. Left click on the downward arrow next to the column heading 1960, hover over Transpose and select Transpose cells across columns into rows.


Fill in the dialogue box like below (from column 1960 to the last column it will create 2 new columns). There are some missing data values for certain countries so we don't want to Ignore blank cells (uncheck) but it's important to Fill down in other columns (check).


Now we have our data nice and tight width-wise, and 13640 rows long.


Second, adding data from another source can help add valuable details and more categorizations/filters for your original data. You can download the second dataset here. This has information about the region and income group for every country, which we're going to merge with our recently lengthened CO2 emissions dataset.

Like before, create a new project and import the nations.csv file, keeping the default import settings. Return back to the CO2 emissions browser tab. Left click on the downward arrow next to country, hover over Edit columns and select Add column based on this column. (note: you could also do this using the iso_a3 column since it is also featured in the nations.csv dataset)


Now we're going to use the GREL (Google Refine Expression Language) command: cell.cross("string projectName", "string commonColumn").cells["string columnName"].value[0]

Keep the quotation marks, but replace string projectName with nations csv, string commonColumn with country, and string columnName with region.


This will add a new column to the CO2 emissions dataset with the appropriate regions for the corresponding countries. Repeat the GREL cell.cross command to add a column containing income_group.


Voila! This dataset is now ready for a nice visualization makeover. I used it to make this treemap bar chart using tableau.

For other OpenRefine functions see https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions

No comments:

Post a Comment