vrijdag 9 februari 2018

HOW TO ANALYZE: FROM FLOURISH INTO R



Flourish is an awesome tool to create charts. Its output is almost art; this could move data journalism away from its original goal: being a kind of 'sociology done on deadline', aiming at 'improving reporting by using the tools of science'. Although a chart can be made fast, easily and beautiful, the question still is what does it show and what is the meaning?
Below I show how to use R and R Studio to do an analysis of the same dataset.

setwd("/home/peter/Desktop/rdata")
loading the data set in a data frame h
h<-read.csv("health2.csv")
Showing the structure of the data set
str(h)
'data.frame':   165 obs. of  8 variables:
 $ year                : int  2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
 $ country             : Factor w/ 11 levels "Angola","Botswana",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ life.expec          : num  46.6 47.4 48.1 48.8 49.4 ...
 $ gdp.cap             : num  606 574 776 850 1136 ...
 $ code                : Factor w/ 11 levels "AGO","BWA","CMR",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Total.as.percGDP    : num  2.79 5.38 3.63 4.41 4.71 4.1 4.54 3.38 3.84 4.37 ...
 $ govperc.total.exp   : num  60.2 52.2 46.4 46.4 51.1 ...
 $ privat.perc.of.total: num  39.8 47.8 53.6 53.6 48.9 ...


Making a chart for the relationship between GDP and life-expectancy using library Lattice.

library("latticeExtra", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.0")

xyplot(life.expec ~ gdp.cap | factor(country), data = h, type = c("country", "r"))




This grid with different scatter diagrams, doesn't differ very much from Flourish chart.
But the interpretation of the relationships can get a better interpretation using correlations.

library("plyr", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.0")

Now the original data set h is split-up in subgroups(for country), and then a correlation is 
calculated in each group.

df <- data.frame(group = h$country,var1 = h$life.expec,var2 = h$gdp.cap)

ddply(df, .(group), summarise, "corr" = cor(var1, var2, method = "pearson"))
          group      corr
1        Angola 0.9602576
2      Botswana 0.9281507
3      Cameroon 0.8627927
4         Ghana 0.9674682
5         Kenya 0.9885749
6        Malawi 0.7567167
7       Namibia 0.9463229
8  South Africa 0.2292121
9      Tanzania 0.9846184
10       Uganda 0.9635527
11       Zambia 0.9828192

Note about method: The difference between the Pearson correlation and the Spearman correlation is that the Pearson 
is most appropriate for measurements taken from an interval scale, while the Spearman is more appropriate for 
measurements taken from ordinal scales. Examples of interval scales include "temperature in Fahrenheit" and "length in 
inches", in which the individual units (1 deg F, 1 in) are meaningful. Things like "satisfaction scores" tend to of the ordinal 
type since while it is clear that "5 happiness" is happier than "3 happiness", it is not clear whether you could give a 
meaningful interpretation of "1 unit of happiness". But when you add up many measurements of the ordinal type, which 
is what you have in your case, you end up with a measurement which is really neither ordinal nor interval, and is difficult 
to interpret

We can also start analysis on a detailed level, for example country. LetÅ› take angola

angola<-h[h$country=='Angola',c("year","life.expec","gdp.cap")]
str(angola)
'data.frame':   15 obs. of  3 variables:
 $ year      : int  2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
 $ life.expec: num  46.6 47.4 48.1 48.8 49.4 ...
 $ gdp.cap   : num  606 574 776 850 1136 ...

cor(angola$life.expec,angola$gdp.cap)
[1] 0.9602576

library("ggplot2", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.0")

ggplot(angola, aes(y=angola$life.expec, x=angola$gdp.cap))+ geom_point() + stat_smooth()

Now let's finally look at the year 2014 for all countries.
year<-h[h$year=='2014',c("country","life.expec","gdp.cap")]
str(year)
'data.frame':   11 obs. of  3 variables:
 $ country   : Factor w/ 11 levels "Angola","Botswana",..: 1 2 3 4 5 6 7 9 10 8 ...
 $ life.expec: num  53.8 66.8 56.7 62.3 63.4 ...
 $ gdp.cap   : num  5233 7153 1407 1442 1368 …

summary(year$life.expec)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  53.80   59.79   62.31   62.00   65.08   67.34 


plot(year$country,year$life.expec)
abline(h=mean(year$life.expec, col="red"))




ggplot(year, aes(x=year$country, y=year$life.expec))+ geom_point()





Are the difference betrween life expectancy real?

tan<-h[h$country=='Tanzania',c("life.expec")]
ken<-h[h$country=='Kenya',c("life.expec")]

t.test(tan,ken)

        Welch Two Sample t-test

data:  tan and ken
t = 1.1652, df = 26.951, p-value = 0.2541
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.557958  5.652358
sample estimates:
mean of x mean of y 
  58.8036   56.7564 
Null hypothesis H0 is not rejected; p>0,05 and means are within 95% confidence interval.

an<-h[h$country=='Angola',c("life.expec")]
t.test(an,ken)

        Welch Two Sample t-test

data:  an and ken
t = -4.8908, df = 20.964, p-value = 7.795e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -8.730796 -3.520804
sample estimates:
mean of x mean of y 
  50.6306   56.7564 
But here the alternative hypothesis H1 is accepted with p< 0.05 and the difference between life expectancy in Angola and Kenya is significant.
Translate scientific notation:
paste("p",format(7.795e-05,scientific = FALSE))
[1] "p 0.00007795"





Geen opmerkingen:

Een reactie posten

Opmerking: Alleen leden van deze blog kunnen een reactie posten.