R for data visualisation

Matt Green

1 Anscombe’s Quartet

`Don’t jump straight to summary statistics’

The following datasets all have the same mean, and the same correlation coefficient, and the correlations are equally significant (have the same \(p\) value).

These datasets are all the same, right?

No …

LOESS Locally Estimated Scatterplot Smoothing

LOESS with points

Distribution

Correlation

Evolution

3 Flowchart

Start by saying what kind of data you have, and use the flowchart to choose a plot that suits the type of data you have

Visit https://www.data-to-viz.com/

4 Build a plot in layers

library(dataviz)
library(ggplot2)
library(knitr)
library(magrittr)
data_ala
# A tibble: 273 × 3
   Participant Valence    RT
         <dbl> <fct>   <dbl>
 1           1 Happy   0.871
 2           1 Neutral 0.880
 3           1 Sad     0.849
 4           2 Happy   0.891
 5           2 Neutral 0.925
 6           2 Sad     0.958
 7           3 Happy   0.588
 8           3 Neutral 0.682
 9           3 Sad     0.828
10           4 Happy   0.804
# ℹ 263 more rows

Notice that this establishes the axes but doesn’t show any data

ggplot(data_ala, aes(y=RT, x=Valence))

Ask for a boxplot by adding a layer

ggplot(data_ala, aes(y=RT, x=Valence)) +
  geom_boxplot()

Ask for a violin plot instead by changing the geom

ggplot(data_ala, aes(y=RT, x=Valence)) +
  geom_violin()

Sometimes it’s easier to compare things in the other orientation. Add a layer that asks for the coordinates to be flipped.

ggplot(data_ala, aes(y=RT, x=Valence)) +
  geom_violin() +
  coord_flip()

There’s usually more than one way to achieve the same thing. Here we swap x and y instead of saying coord_flip

ggplot(data_ala, aes(y=Valence, x=RT)) +
  geom_violin() 

Violins represent the same data twice. Compare with a density plot. Violins are doubled-up density plots.

ggplot(data_ala, aes(x=RT)) +
  geom_density() +
  facet_wrap(~Valence, nrow=3)

Colour helps

ggplot(data_ala, aes(x=RT, fill=Valence)) +
  geom_density() +
  facet_wrap(~Valence, nrow=3)

Some distributions are easier to tell apart when overlaid

ggplot(data_ala, aes(x=RT, fill=Valence)) +
  geom_density()

See-through is controlled by alpha

ggplot(data_ala, aes(x=RT, fill=Valence)) +
  geom_density(alpha=0.5)

Ask for a different colour palette

ggplot(data_ala, aes(x=RT, fill=Valence)) +
  geom_density(alpha=0.5)+
  scale_fill_brewer(palette = "Set1")

There are lots of palettes

ggplot(data_ala, aes(x=RT, fill=Valence)) +
  geom_density(alpha=0.5)+
  scale_fill_hue()

Fine control is possible

ggplot(data_ala, aes(x=RT, fill=Valence)) +
  geom_density(alpha=0.5)+
  scale_fill_hue(l=40)

But, beware premature optimisation

  • It’s all too easy to get lost down a rabbit-hole in ggplot

  • R is pretty good at choosing sensible values for you

  • Stick to the defaults at first

  • Go broad before going deep

5 A tidier way of building a plot in layers

  • assignment yields no output
myaxes = ggplot(data_ala, aes(y=RT,  x=Valence, color=Valence))
  • stating the name (myaxes) yields just the axes
myaxes
  • adding a layer with + yields the violins
myaxes + geom_violin()
  • adding another layer
myaxes + geom_violin() + geom_boxplot()

So you can write the plot like this, which some people find much easier to write

# the next 3 lines are all assignments so they don't produce output
myaxes = ggplot(data_ala, aes(y=RT,  x=Valence, color=Valence))
myplot = myaxes + geom_violin()
myplot = myplot + geom_boxplot()
# the next line produces the whole plot as output
myplot

6 Multi-factorial data and facets

  • data_ldt from the Lexical Decision Task is provided by the dataviz package, so you don’t have to read it in, it’s already there
  • eyeball the first ten rows of any dataset by typing the name of the dataset
data_ldt
# A tibble: 200 × 6
   id      age language    condition    rt   acc
   <fct> <dbl> <fct>       <fct>     <dbl> <dbl>
 1 S001     22 monolingual word       379.    99
 2 S001     22 monolingual nonword    517.    90
 3 S002     33 monolingual word       312.    94
 4 S002     33 monolingual nonword    435.    82
 5 S003     23 monolingual word       405.    96
 6 S003     23 monolingual nonword    459.    87
 7 S004     28 monolingual word       298.    92
 8 S004     28 monolingual nonword    336.    76
 9 S005     26 monolingual word       316.    91
10 S005     26 monolingual nonword    401.    83
# ℹ 190 more rows
  • you can ask for a summary of any dataset
summary(data_ldt)
       id           age               language     condition         rt       
 S001   :  2   Min.   :18.00   monolingual:110   word   :100   Min.   :256.3  
 S002   :  2   1st Qu.:24.00   bilingual  : 90   nonword:100   1st Qu.:351.9  
 S003   :  2   Median :28.50                                   Median :412.6  
 S004   :  2   Mean   :29.75                                   Mean   :434.7  
 S005   :  2   3rd Qu.:33.25                                   3rd Qu.:510.0  
 S006   :  2   Max.   :58.00                                   Max.   :706.2  
 (Other):188                                                                  
      acc        
 Min.   : 76.00  
 1st Qu.: 85.00  
 Median : 91.00  
 Mean   : 89.95  
 3rd Qu.: 95.00  
 Max.   :100.00  
                 
  • how many participants in each cell of the design?
with(data_ldt, table(language, condition))
             condition
language      word nonword
  monolingual   55      55
  bilingual     45      45
  • facets are really good for data with more than one factor
  • “monolingual” is one facet; “bilingual” is the other facet
ggplot(data_ldt, aes(y=rt, x=condition))+
  facet_wrap(~language)

Quick and dirty first

ggplot(data_ldt, aes(y=rt, x=condition))+
  facet_wrap(~language)+
  geom_boxplot()

Add colour

ggplot(data_ldt, aes(y=rt, x=condition, colour=condition))+
  facet_wrap(~language)+
  geom_boxplot()

Change title of the legend for colour

ggplot(data_ldt, aes(y=rt, x=condition, colour=condition))+
  facet_wrap(~language)+
  geom_boxplot()+
  labs(colour="Word Type")

Maybe violins

ggplot(data_ldt, aes(y=rt, x=condition, colour=condition))+
  facet_wrap(~language)+
  geom_violin(aes(fill=condition), alpha=0.5)

I like to see the mean overlaid on violins

ggplot(data_ldt, aes(y=rt, x=condition, colour=condition))+
  facet_wrap(~language)+
  geom_violin(aes(fill=condition), alpha=0.5)+
  stat_summary(fun=mean)

Adding a line joining the means can help make interactions clearer

ggplot(data_ldt, aes(y=rt, x=condition, colour=condition))+
  facet_wrap(~language)+
  geom_violin(aes(fill=condition), alpha=0.5, show.legend = F)+
  stat_summary(fun=mean, show.legend =FALSE)+
  stat_summary(geom='line', fun=mean, aes(group=1), colour='black')

I want to see how the distributions overlap

ggplot(data_ldt, 
       aes(x=rt, group=interaction(condition, language), fill=condition))+
  facet_wrap(~language)+
  geom_density()

I want to see how the distributions overlap in the same facet

  • so I comment out the request for different facets using hash (#) at the start of line 4: now R ignores that line

  • output in next slide

ggplot(data_ldt, 
       aes(x=rt, 
           fill=interaction(condition, language)))+
  #facet_wrap(~language)+
  geom_density(alpha=0.5)+
  scale_fill_discrete(name='condition')+
  theme_bw()+theme(panel.grid = element_blank())

I want to see how the distributions overlap in the same facet

7 When measures change over time

summary(data_foraging)
                       pid           date                       
 5d15b34e0b1f87000115ec96: 10   Min.   :2023-07-10 09:56:15.00  
 64089281b1d589f65731874e: 10   1st Qu.:2023-07-10 18:05:22.00  
 5fc460087eebcc8af045e624: 10   Median :2023-07-12 12:05:02.00  
 61034829cd0ce25db9519e37: 10   Mean   :2023-07-12 01:04:45.19  
 5d287be62a33c100167c93c6: 10   3rd Qu.:2023-07-12 15:23:33.00  
 5f75f47376d11b000bc1bfc5: 10   Max.   :2023-07-13 20:31:52.00  
 (Other)                 :710                                   
                            condition   condition_number          phase    
 random fruit : different forests:190   Min.   :1.000    experimental:770  
 random fruit : same forest      :250   1st Qu.:3.000                      
 fixed fruit : same forest       :330   Median :4.000                      
                                        Mean   :4.169                      
                                        3rd Qu.:6.000                      
                                        Max.   :7.000                      
                                                                           
 session_duration       complete     approvals           age       
 Min.   : 533.6   ok        :750   Min.   :   2.0   Min.   :19.00  
 1st Qu.: 662.2   incomplete: 20   1st Qu.: 199.0   1st Qu.:25.00  
 Median : 786.6                    Median : 387.0   Median :33.00  
 Mean   :1010.6                    Mean   : 507.2   Mean   :35.49  
 3rd Qu.: 992.4                    3rd Qu.: 691.0   3rd Qu.:44.00  
 Max.   :5422.8                    Max.   :1856.0   Max.   :71.00  
                                                    NA's   :20     
     sex        ethnic               nation               OS     
 Female:390   Asian: 80   United Kingdom:450   Linux x86_64: 20  
 Male  :380   White:570   United States :100   MacIntel    :180  
              Black: 90   Canada        : 70   Win32       :570  
              Mixed: 20   South Africa  : 60                     
              Other: 10   Australia     : 30                     
                          Nigeria       : 20                     
                          (Other)       : 40                     
   frameRate                     trees              fruits    n_low_consumed  
 Min.   :  3.753   different forests:190   random fruit:440   Min.   : 0.000  
 1st Qu.: 59.773   same forest      :580   fixed fruit :330   1st Qu.: 3.000  
 Median : 59.981                                              Median : 6.000  
 Mean   : 67.606                                              Mean   : 5.677  
 3rd Qu.: 60.143                                              3rd Qu.: 8.000  
 Max.   :362.450                                              Max.   :15.000  
                                                                              
 n_hi_consumed                               textbox.text   trialCount  
 Min.   : 7.000   Type here ...                    : 60   Min.   : 1.0  
 1st Qu.: 8.000   yesType here ...                 : 20   1st Qu.: 3.0  
 Median : 9.000   YES THY WERE VERY CLEAR          : 10   Median : 5.5  
 Mean   : 8.825   The instructions were very clear.: 10   Mean   : 5.5  
 3rd Qu.:10.000   The instructions were very clear!: 10   3rd Qu.: 8.0  
 Max.   :10.000   N/A                              : 10   Max.   :10.0  
                  (Other)                          :650                 
          use_master_forest redistribute_fruit  mouse.time         high_fruit 
 FALSE             :190     FALSE:330          Length:770         apple :370  
 TRUE              :430     TRUE :440          Class :character   banana:400  
 pp gets own forest:150                        Mode  :character               
                                                                              
                                                                              
                                                                              
                                                                              
                instructions
 naive participants   :620  
 INFORMED participants:150  
                            
                            
                            
                            
                            

Quick and dirty first: slight improvement over trials before we distinguish between the conditions

ggplot(data_foraging, aes(y=n_hi_consumed, x=trialCount))+
  geom_smooth()

Distinguish between fixed and random fruit: performance improves more over trials for fixed fruit

ggplot(data_foraging, aes(y=n_hi_consumed, x=trialCount, colour=fruits))+
  geom_smooth()

Distinguish between naive and informed participants: performance is better for fixed fruit when participants are informed that the trees will be in the same place across trials (this primes them to remember where the trees are)

ggplot(data_foraging, aes(y=n_hi_consumed, x=trialCount, colour=fruits))+
  geom_smooth()+
  facet_wrap(~instructions)

Cosmetic adjustments

ggplot(data_foraging, aes(y=n_hi_consumed, x=trialCount, color=fruits))+
  facet_grid(~instructions)+
  geom_smooth()+
  scale_x_continuous(breaks=c(1,5,10))+
  ylab("Number of high fruit consumed")+
  labs(title="Mean number of high value fruit consumed per trial", 
       subtitle = "max possible is 10")+
  theme_bw() + 
  theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank())+#, aspect.ratio = 1)+
  scale_color_manual(values = c("red", "blue"))

Switch from LOESS regression to linear regression using method='lm'

ggplot(data_foraging, aes(y=n_hi_consumed, x=trialCount, color=fruits))+
  facet_grid(~instructions)+
  geom_smooth(method='lm')+
  scale_x_continuous(breaks=c(1,5,10))+
  ylab("Number of high fruit consumed")+
  labs(title="Mean number of high value fruit consumed per trial", 
       subtitle = "max possible is 10")+
  theme_bw() + 
  theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank())+#, aspect.ratio = 1)+
  scale_color_manual(values = c("red", "blue"))

An alternative using points and lines

ggplot(data_foraging, aes(y=n_hi_consumed, x=trialCount, colour=fruits))+
  stat_summary(geom='point',position=position_dodge(width=0.2))+
  stat_summary(geom='line',position=position_dodge(width=0.2))+
  facet_wrap(~instructions)+
  theme_bw()

8 I want to export my plot

  • I made my plot in R - now I want to save it as a png file so I can include it in my manuscript.
  • First run the code that generates the plot
# the next 3 lines are all assignments so they don't produce output
myaxes = ggplot(data_ldt, aes(y=rt,  x=condition, fill=language))
myplot = myaxes + geom_violin()
myplot = myplot + facet_wrap(~language)
# the next line produces the whole plot as output
myplot
  • now we can say ggsave which saves the most recently generated plot to file.
ggsave(filename = "Figure1.png", width=7, height=4)

8.1 test second level header

foo

bar