Hands-on: 5 reasons to switch to OpenRefine to clean data
It has a pretty steep learning curve but it’s definitely worth the effort to learn OpenRefine if you need to clean large amounts of very messy data.
For years I’ve used Google Sheets to work with my data, mostly because I believed that, besides Microsoft Excel, it was my only option. I mostly built spreadsheets using Google Sheets, so it made perfect sense to use the same programme to edit, filter and compare my data.
That was until I learnt about OpenRefine. The OpenRefine website describes the application as ‘a powerful tool for working with messy data: cleaning it; transforming it from one format into another, and extending it with web services and external data’. And that is exactly what it does, leaping past some of the basic limitations of both Excel and Google Sheets.
Here are my top 5 reasons to start using OpenRefine. You’ll never look back:
1. Consolidating different spreadsheets
Your data isn’t always in the same spreadsheet. I was working on a project about grade 12 school results and had split the data into different spreadsheets, according to the provinces the schools were in. If I wanted to consolidate the dataset in Google Sheets, I would probably have had to use the tedious copy-and-paste method, which leaves room for plenty of errors. My dataset was over 50,000 rows long and I knew that might cause problems.
I was pleasantly surprised when I realised this could simply be done in OpenRefine, as long as the column headings were identical. I did this by:
- Downloading the different datasets in the same format, saving them all into one zipped folder.
- Creating a new project by importing the zipped folder into OpenRefine, where I found all my consolidated data in one place. (I’m not sure if OpenRefine sets a limit on the number of spreadsheets you can add to a zipped folder, but I have zipped together and uploaded as many as nine spreadsheets at once without any issues.)
2. It’s a data clean-up master
OpenRefine’s ‘Cluster and edit’ tool is one of my favourites! To use it, click the arrow next to your heading of choice, select ‘Edit cells’ from the dropdown menu and then select ‘Cluster and edit’. This tool groups certain rows that likely have the same name, but have minor differences in spelling, spacing or capitalisation.
The application suggests options of what the spelling could look like. You can choose the one you prefer, or you can change one cluster’s spelling by typing in your own option.
3. Mistakes aren’t set in stone
OpenRefine keeps track of every move you make, much like the Google Sheets ‘version history’ tool. When you’re working with a big dataset in Google Sheets and want to restore an earlier version of your document, it can take several minutes to load, and may even cause your system to crash. In OpenRefine, however, each edit is tracked in the left-hand panel under the ‘Undo / Redo’ section and you can easily and quickly switch to an older version of your work.
4. Filtering is a breeze
Unlike Excel and Google Sheets, you do not have to enable the filtering feature when you use OpenRefine. The filtering option is automatically on when you create your project.
OpenRefine also offers different methods of filtering your data:
- The text facet option automatically groups all similar values under a specific column and gives you the number of entries under each value
- The text filter allows you to filter your data for specific words, numbers or phrases that you can type in yourself
All the filtering occurs comfortably on the left-hand side of the screen, where you can easily select and deselect values. This also makes it easy to filter for multiple columns or variables at once.
5. OpenRefine works offline
If you usually use Google Sheets, then you’ll be familiar with the frustration that comes with a weak internet connection. Internet problems mean Google Sheets problems because the quality of your connection directly affects the ability to work. This is another reason why OpenRefine is a great option. When you download OpenRefine it works like a desktop application – although it opens in a web browser, you can work even when your internet connection is offline.
Excel and Google Sheets are among the most popular options for working with data, but there is no need to feel chained to these programmes when better options exist.
- Download OpenRefine here
- Watch: OpenRefine’s introductory video series
- If you are looking for a good breakdown of the basics, check out this tutorial from Open Data Literacy
- The Outlier Learning is now offering training programmes. Find out more here
- Sign up to the DataBites newsletter for more on data journalism training, tips and stories