`Don’t jump straight to summary statistics’
The following datasets all have the same mean, and the same correlation coefficient, and the correlations are equally significant (have the same \(p\) value).
These datasets are all the same, right?
No …
LOESS Locally Estimated Scatterplot Smoothing
LOESS with points
Visit https://r-graph-gallery.com/
Browse graphs to choose one you like
Distribution
Correlation
Evolution
Start by saying what kind of data you have, and use the flowchart to choose a plot that suits the type of data you have
Notice that this establishes the axes but doesn’t show any data
Ask for a boxplot by adding a layer
Ask for a violin plot instead by changing the geom
Sometimes it’s easier to compare things in the other orientation. Add a layer that asks for the coordinates to be flipped.
There’s usually more than one way to achieve the same thing. Here we swap x and y instead of saying coord_flip
Violins represent the same data twice. Compare with a density plot. Violins are doubled-up density plots.
Colour helps
Some distributions are easier to tell apart when overlaid
See-through is controlled by alpha
Ask for a different colour palette
There are lots of palettes
Fine control is possible
But, beware premature optimisation
It’s all too easy to get lost down a rabbit-hole in ggplot
R is pretty good at choosing sensible values for you
Stick to the defaults at first
Go broad before going deep
So you can write the plot like this, which some people find much easier to write
data_ldt
from the Lexical Decision Task is provided by the dataviz
package, so you don’t have to read it in, it’s already there# A tibble: 200 × 6
id age language condition rt acc
<fct> <dbl> <fct> <fct> <dbl> <dbl>
1 S001 22 monolingual word 379. 99
2 S001 22 monolingual nonword 517. 90
3 S002 33 monolingual word 312. 94
4 S002 33 monolingual nonword 435. 82
5 S003 23 monolingual word 405. 96
6 S003 23 monolingual nonword 459. 87
7 S004 28 monolingual word 298. 92
8 S004 28 monolingual nonword 336. 76
9 S005 26 monolingual word 316. 91
10 S005 26 monolingual nonword 401. 83
# ℹ 190 more rows
summary
of any dataset id age language condition rt
S001 : 2 Min. :18.00 monolingual:110 word :100 Min. :256.3
S002 : 2 1st Qu.:24.00 bilingual : 90 nonword:100 1st Qu.:351.9
S003 : 2 Median :28.50 Median :412.6
S004 : 2 Mean :29.75 Mean :434.7
S005 : 2 3rd Qu.:33.25 3rd Qu.:510.0
S006 : 2 Max. :58.00 Max. :706.2
(Other):188
acc
Min. : 76.00
1st Qu.: 85.00
Median : 91.00
Mean : 89.95
3rd Qu.: 95.00
Max. :100.00
facet
s are really good for data with more than one factorQuick and dirty first
Add colour
Change title of the legend for colour
Maybe violins
I like to see the mean overlaid on violins
Adding a line joining the means can help make interactions clearer
I want to see how the distributions overlap
I want to see how the distributions overlap in the same facet
so I comment out the request for different facets using hash (#) at the start of line 4: now R ignores that line
output in next slide
I want to see how the distributions overlap in the same facet
pid date
5d15b34e0b1f87000115ec96: 10 Min. :2023-07-10 09:56:15.00
64089281b1d589f65731874e: 10 1st Qu.:2023-07-10 18:05:22.00
5fc460087eebcc8af045e624: 10 Median :2023-07-12 12:05:02.00
61034829cd0ce25db9519e37: 10 Mean :2023-07-12 01:04:45.19
5d287be62a33c100167c93c6: 10 3rd Qu.:2023-07-12 15:23:33.00
5f75f47376d11b000bc1bfc5: 10 Max. :2023-07-13 20:31:52.00
(Other) :710
condition condition_number phase
random fruit : different forests:190 Min. :1.000 experimental:770
random fruit : same forest :250 1st Qu.:3.000
fixed fruit : same forest :330 Median :4.000
Mean :4.169
3rd Qu.:6.000
Max. :7.000
session_duration complete approvals age
Min. : 533.6 ok :750 Min. : 2.0 Min. :19.00
1st Qu.: 662.2 incomplete: 20 1st Qu.: 199.0 1st Qu.:25.00
Median : 786.6 Median : 387.0 Median :33.00
Mean :1010.6 Mean : 507.2 Mean :35.49
3rd Qu.: 992.4 3rd Qu.: 691.0 3rd Qu.:44.00
Max. :5422.8 Max. :1856.0 Max. :71.00
NA's :20
sex ethnic nation OS
Female:390 Asian: 80 United Kingdom:450 Linux x86_64: 20
Male :380 White:570 United States :100 MacIntel :180
Black: 90 Canada : 70 Win32 :570
Mixed: 20 South Africa : 60
Other: 10 Australia : 30
Nigeria : 20
(Other) : 40
frameRate trees fruits n_low_consumed
Min. : 3.753 different forests:190 random fruit:440 Min. : 0.000
1st Qu.: 59.773 same forest :580 fixed fruit :330 1st Qu.: 3.000
Median : 59.981 Median : 6.000
Mean : 67.606 Mean : 5.677
3rd Qu.: 60.143 3rd Qu.: 8.000
Max. :362.450 Max. :15.000
n_hi_consumed textbox.text trialCount
Min. : 7.000 Type here ... : 60 Min. : 1.0
1st Qu.: 8.000 yesType here ... : 20 1st Qu.: 3.0
Median : 9.000 YES THY WERE VERY CLEAR : 10 Median : 5.5
Mean : 8.825 The instructions were very clear.: 10 Mean : 5.5
3rd Qu.:10.000 The instructions were very clear!: 10 3rd Qu.: 8.0
Max. :10.000 N/A : 10 Max. :10.0
(Other) :650
use_master_forest redistribute_fruit mouse.time high_fruit
FALSE :190 FALSE:330 Length:770 apple :370
TRUE :430 TRUE :440 Class :character banana:400
pp gets own forest:150 Mode :character
instructions
naive participants :620
INFORMED participants:150
Quick and dirty first: slight improvement over trials before we distinguish between the conditions
Distinguish between fixed and random fruit: performance improves more over trials for fixed fruit
Distinguish between naive and informed participants: performance is better for fixed fruit when participants are informed that the trees will be in the same place across trials (this primes them to remember where the trees are)
Cosmetic adjustments
ggplot(data_foraging, aes(y=n_hi_consumed, x=trialCount, color=fruits))+
facet_grid(~instructions)+
geom_smooth()+
scale_x_continuous(breaks=c(1,5,10))+
ylab("Number of high fruit consumed")+
labs(title="Mean number of high value fruit consumed per trial",
subtitle = "max possible is 10")+
theme_bw() +
theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank())+#, aspect.ratio = 1)+
scale_color_manual(values = c("red", "blue"))
Switch from LOESS regression to linear regression using method='lm'
ggplot(data_foraging, aes(y=n_hi_consumed, x=trialCount, color=fruits))+
facet_grid(~instructions)+
geom_smooth(method='lm')+
scale_x_continuous(breaks=c(1,5,10))+
ylab("Number of high fruit consumed")+
labs(title="Mean number of high value fruit consumed per trial",
subtitle = "max possible is 10")+
theme_bw() +
theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank())+#, aspect.ratio = 1)+
scale_color_manual(values = c("red", "blue"))
An alternative using points and lines
ggsave
which saves the most recently generated plot to file.foo
bar