dinsdag 26 november 2013


This short tutorial will show you how to use Refine for cleaning your data. The data being  used are not real, but created in such a way that it represents a practical situation (information downloaded from a database) and at the same time shows the most important possibilities of cleaning data with Refine.
See also and republished: Paul Bradshaw Online Journalism Blog

1. Install Refine from the following site

2. Open Refine by clicking on the icon. First a black DOS screen will open and next your browser(Firefox or Chrome) opens. From here you can start a new project.

3. Download the spreadsheet Refine.xls from Google drive; import the this file in Refine. Once the project is created the spreadsheet will open in Refine. To see all rows set number of rows to 50.

The spreadsheet is an example and show information which could have been download from a database. The first column show company name, second column an abbreviation of a product and the number of products. Then we have columns for address city and country. In the last call the name of the persons is shown who placed the order.

There are several problems with the spreadsheet:
  • first the spelling of company names is not consistent and incorrect
  • there is no column showing the names of the products
  • in the last column with the names alphabetical ordering is not possible, only when we have the last names in separate column. Also gender is not listed, but that can be derived from the column names: dhr=mr= male and mevr=mrs=female.
  • finally if we want to geocode the data – place the addresses of the persons  in a map, all address information must be in one column.

4. Cleaning column 1: company names

Click on the drop down menu of header of column 1, company, and choose text facet.

On the left you see the companies listed. Click on cluster and choose 'key collision' and 'finger print' for clustering; that is correcting the names. Choose 'merge selected & re cluster'. 

Continue clustering with a new algorithm: ngram-fingerprint. Continue till you have the following names:

Van Houten5

Now we have the correct spelling of the company names; let's turn them into uppercase. Choose from column company in the drop down menu edit cells, common transformations, into uppercase.

5.  Split column 2 product name and number.  Go to edit column in the drop down menu and choose split into several columns. Use – as the separator and choose 2 columns. Give column 3 a new name: choose edit column and rename: number. Change the name of column 2 in product, by choosing in the drop down menu: edit column and rename column.

6. In column 2 we have a code for the product but not the name of the product. Here is the conversion:
Go in column 2 to drop down menu, choose edit cells and transform. Now we have to write some code in order to the transformation:
Use the following expression:

The left column show the original and the right the result of the transformation.
Do this for all different products.

NOTE: we can UNDO and REDO our operation in the left of the screen!

7. To see the result of our work: do the following:
create a text facet for company; highlight Philips and you see all the product, numbers etc for Philips.
Create a text facet for product and you see the orders for the products.

8. Close the facets. Now lets look at the last column, names. Let's put in gender in the following way:
choose edit cells and transform:

value.replace(“dhr”, “m”)

value.replace(“mevr, “f”)
Split the column name into 2.  using space as separator. Rename column name 1 into gender. Next split column name 2 into 2 columns; using -  . space -as separator. Rename the columns: initial and name.Transform column name into title case: edit cells, common transformation, title case.

9. In order to geocode we have to bring all address information into one column.
First address and city. Go to address choose edit column, add column based on this column:
use the following transformation:

cells["address"].value + ", " + cells["city"].value 
name this column geocode
Next add the country to a new column columns and call the column geocode 2

cells["geocode"].value + ", " + cells["country"].value

In geocode 2 we have all information needed for geocoding; delete the columns city, country,  address and geocode.

10. The column geocode2 will be used to find longitude and latitude.
In column geocode 2 choose add column based on fetching url  using :

 "http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=" + escape(value,'url')


Call this new column JSON because the output is javascript.
This will take a couple of minutes.
Next we have to get the latitude and  longitude out of the JSON.
Create a new column based on column JSON and use the following transformation:

with(value.parseJson().results[0].geometry.location, pair, pair.lat +", " + pair.lng)

11. We can delete the column JSON.
Finally export the result as refineresult.xls
Open in a spreadsheet and start making calculations in pivot tables.

Import the file refineresult.xls in Google Fusion Tables to make a map. Don't forget to set column geocode2 to text and lat/lng to location.
Here is the map:

Geen opmerkingen:

Een reactie plaatsen

Opmerking: alleen leden van deze blog kunnen een reactie plaatsen.