Flourish is an
awesome tool to create charts. Its output is almost art; this could
move data journalism away from its original goal: being a kind of
'sociology done on deadline', aiming at 'improving reporting by using
the tools of science'. Although a chart can be made fast, easily and
beautiful, the question still is what does it show and what is the
meaning?
Below I show how to
use R and R Studio to do an analysis of the same dataset.
setwd("/home/peter/Desktop/rdata") loading the data set in a data frame h h<-read.csv("health2.csv") Showing the structure of the data set str(h) 'data.frame': 165 obs. of 8 variables: $ year : int 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ... $ country : Factor w/ 11 levels "Angola","Botswana",..: 1 1 1 1 1 1 1 1 1 1 ... $ life.expec : num 46.6 47.4 48.1 48.8 49.4 ... $ gdp.cap : num 606 574 776 850 1136 ... $ code : Factor w/ 11 levels "AGO","BWA","CMR",..: 1 1 1 1 1 1 1 1 1 1 ... $ Total.as.percGDP : num 2.79 5.38 3.63 4.41 4.71 4.1 4.54 3.38 3.84 4.37 ... $ govperc.total.exp : num 60.2 52.2 46.4 46.4 51.1 ... $ privat.perc.of.total: num 39.8 47.8 53.6 53.6 48.9 ...
Making a chart for
the relationship between GDP and life-expectancy using library
Lattice.
library("latticeExtra", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.0")
xyplot(life.expec ~ gdp.cap | factor(country), data = h, type = c("country", "r"))
This grid with different scatter diagrams, doesn't differ very much from Flourish chart.
But the interpretation of the relationships can get a better interpretation using correlations. library("plyr", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.0") Now the original data set h is split-up in subgroups(for country), and then a correlation is
calculated in each group. df <- data.frame(group = h$country,var1 = h$life.expec,var2 = h$gdp.cap) ddply(df, .(group), summarise, "corr" = cor(var1, var2, method = "pearson")) group corr 1 Angola 0.9602576 2 Botswana 0.9281507 3 Cameroon 0.8627927 4 Ghana 0.9674682 5 Kenya 0.9885749 6 Malawi 0.7567167 7 Namibia 0.9463229 8 South Africa 0.2292121 9 Tanzania 0.9846184 10 Uganda 0.9635527 11 Zambia 0.9828192 Note about method: The difference between the Pearson correlation and the Spearman correlation is that the Pearson
is most appropriate for measurements taken from an interval scale, while the Spearman is more appropriate for
measurements taken from ordinal scales. Examples of interval scales include "temperature in Fahrenheit" and "length in
inches", in which the individual units (1 deg F, 1 in) are meaningful. Things like "satisfaction scores" tend to of the ordinal
type since while it is clear that "5 happiness" is happier than "3 happiness", it is not clear whether you could give a
meaningful interpretation of "1 unit of happiness". But when you add up many measurements of the ordinal type, which
is what you have in your case, you end up with a measurement which is really neither ordinal nor interval, and is difficult
to interpret
We can also start analysis on a detailed level, for example country. LetÅ› take angola angola<-h[h$country=='Angola',c("year","life.expec","gdp.cap")] str(angola) 'data.frame': 15 obs. of 3 variables: $ year : int 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ... $ life.expec: num 46.6 47.4 48.1 48.8 49.4 ... $ gdp.cap : num 606 574 776 850 1136 ... cor(angola$life.expec,angola$gdp.cap) [1] 0.9602576 library("ggplot2", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.0") ggplot(angola, aes(y=angola$life.expec, x=angola$gdp.cap))+ geom_point() + stat_smooth() Now let's finally look at the year 2014 for all countries.
year<-h[h$year=='2014',c("country","life.expec","gdp.cap")] str(year) 'data.frame': 11 obs. of 3 variables: $ country : Factor w/ 11 levels "Angola","Botswana",..: 1 2 3 4 5 6 7 9 10 8 ... $ life.expec: num 53.8 66.8 56.7 62.3 63.4 ... $ gdp.cap : num 5233 7153 1407 1442 1368 … summary(year$life.expec) Min. 1st Qu. Median Mean 3rd Qu. Max. 53.80 59.79 62.31 62.00 65.08 67.34
plot(year$country,year$life.expec) abline(h=mean(year$life.expec, col="red"))
ggplot(year, aes(x=year$country, y=year$life.expec))+ geom_point()
Are the difference
betrween life expectancy real?
tan<-h[h$country=='Tanzania',c("life.expec")] ken<-h[h$country=='Kenya',c("life.expec")] t.test(tan,ken) Welch Two Sample t-test data: tan and ken t = 1.1652, df = 26.951, p-value = 0.2541 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.557958 5.652358 sample estimates: mean of x mean of y 58.8036 56.7564
Null hypothesis H0 is not rejected; p>0,05 and means are within
95% confidence interval.
an<-h[h$country=='Angola',c("life.expec")] t.test(an,ken) Welch Two Sample t-test data: an and ken t = -4.8908, df = 20.964, p-value = 7.795e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -8.730796 -3.520804 sample estimates: mean of x mean of y 50.6306 56.7564But here the alternative hypothesis H1 is accepted with p< 0.05 and the difference between life expectancy in Angola and Kenya is significant.
Translate scientific notation:
paste("p",format(7.795e-05,scientific = FALSE)) [1] "p 0.00007795"
Geen opmerkingen:
Een reactie posten
Opmerking: Alleen leden van deze blog kunnen een reactie posten.