## vrijdag 9 februari 2018

### HOW TO ANALYZE: FROM FLOURISH INTO R

Flourish is an awesome tool to create charts. Its output is almost art; this could move data journalism away from its original goal: being a kind of 'sociology done on deadline', aiming at 'improving reporting by using the tools of science'. Although a chart can be made fast, easily and beautiful, the question still is what does it show and what is the meaning?
Below I show how to use R and R Studio to do an analysis of the same dataset.

```setwd("/home/peter/Desktop/rdata")
loading the data set in a data frame h
Showing the structure of the data set
str(h)
'data.frame':   165 obs. of  8 variables:
\$ year                : int  2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
\$ country             : Factor w/ 11 levels "Angola","Botswana",..: 1 1 1 1 1 1 1 1 1 1 ...
\$ life.expec          : num  46.6 47.4 48.1 48.8 49.4 ...
\$ gdp.cap             : num  606 574 776 850 1136 ...
\$ code                : Factor w/ 11 levels "AGO","BWA","CMR",..: 1 1 1 1 1 1 1 1 1 1 ...
\$ Total.as.percGDP    : num  2.79 5.38 3.63 4.41 4.71 4.1 4.54 3.38 3.84 4.37 ...
\$ govperc.total.exp   : num  60.2 52.2 46.4 46.4 51.1 ...
\$ privat.perc.of.total: num  39.8 47.8 53.6 53.6 48.9 ...```

Making a chart for the relationship between GDP and life-expectancy using library Lattice.

`library("latticeExtra", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.0")`

`xyplot(life.expec ~ gdp.cap | factor(country), data = h, type = c("country", "r"))`

This grid with different scatter diagrams, doesn't differ very much from Flourish chart.
```But the interpretation of the relationships can get a better interpretation using correlations.

library("plyr", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.0")

Now the original data set h is split-up in subgroups(for country), and then a correlation is ```
```calculated in each group.

df <- data.frame(group = h\$country,var1 = h\$life.expec,var2 = h\$gdp.cap)

ddply(df, .(group), summarise, "corr" = cor(var1, var2, method = "pearson"))
group      corr
1        Angola 0.9602576
2      Botswana 0.9281507
3      Cameroon 0.8627927
4         Ghana 0.9674682
5         Kenya 0.9885749
6        Malawi 0.7567167
7       Namibia 0.9463229
8  South Africa 0.2292121
9      Tanzania 0.9846184
10       Uganda 0.9635527
11       Zambia 0.9828192

Note about method: The difference between the Pearson correlation and the Spearman correlation is that the Pearson ```
`is most appropriate for measurements taken from an interval scale, while the Spearman is more appropriate for `
`measurements taken from ordinal scales. Examples of interval scales include "temperature in Fahrenheit" and "length in `
`inches", in which the individual units (1 deg F, 1 in) are meaningful. Things like "satisfaction scores" tend to of the ordinal `
`type since while it is clear that "5 happiness" is happier than "3 happiness", it is not clear whether you could give a `
`meaningful interpretation of "1 unit of happiness". But when you add up many measurements of the ordinal type, which `
`is what you have in your case, you end up with a measurement which is really neither ordinal nor interval, and is difficult `
`to interpret`

```We can also start analysis on a detailed level, for example country. Letś take angola

angola<-h[h\$country=='Angola',c("year","life.expec","gdp.cap")]
str(angola)
'data.frame':   15 obs. of  3 variables:
\$ year      : int  2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
\$ life.expec: num  46.6 47.4 48.1 48.8 49.4 ...
\$ gdp.cap   : num  606 574 776 850 1136 ...

cor(angola\$life.expec,angola\$gdp.cap)
 0.9602576

library("ggplot2", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.0")

ggplot(angola, aes(y=angola\$life.expec, x=angola\$gdp.cap))+ geom_point() + stat_smooth() Now let's finally
look at the year 2014 for all countries.
```
```year<-h[h\$year=='2014',c("country","life.expec","gdp.cap")]
str(year)
'data.frame':   11 obs. of  3 variables:
\$ country   : Factor w/ 11 levels "Angola","Botswana",..: 1 2 3 4 5 6 7 9 10 8 ...
\$ life.expec: num  53.8 66.8 56.7 62.3 63.4 ...
\$ gdp.cap   : num  5233 7153 1407 1442 1368 …

summary(year\$life.expec)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
53.80   59.79   62.31   62.00   65.08   67.34 ```

```plot(year\$country,year\$life.expec)
abline(h=mean(year\$life.expec, col="red"))```
```
```
```
```

`ggplot(year, aes(x=year\$country, y=year\$life.expec))+ geom_point()`
```
```
```
```

Are the difference betrween life expectancy real?

```tan<-h[h\$country=='Tanzania',c("life.expec")]
ken<-h[h\$country=='Kenya',c("life.expec")]

t.test(tan,ken)

Welch Two Sample t-test

data:  tan and ken
t = 1.1652, df = 26.951, p-value = 0.2541
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.557958  5.652358
sample estimates:
mean of x mean of y
58.8036   56.7564 ```
Null hypothesis H0 is not rejected; p>0,05 and means are within 95% confidence interval.

```an<-h[h\$country=='Angola',c("life.expec")]
t.test(an,ken)

Welch Two Sample t-test

data:  an and ken
t = -4.8908, df = 20.964, p-value = 7.795e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.730796 -3.520804
sample estimates:
mean of x mean of y
50.6306   56.7564 ```
But here the alternative hypothesis H1 is accepted with p< 0.05 and the difference between life expectancy in Angola and Kenya is significant.
Translate scientific notation:
```paste("p",format(7.795e-05,scientific = FALSE))
 "p 0.00007795"```

#### Een reactie posten

Opmerking: Alleen leden van deze blog kunnen een reactie posten.