See also https://github.com/cttobin/ggthemr
Think about how sampling influences data structure
Consider how we summarize our data
A little bit o’ Boolean
The split-apply-combine philosophy
Weight: 3.09kg
3.09, 2.91, 3.06, 2.69, 2.88, 2.98, 1.61, 2.16, 1.56, 1.76
Pair up with someone and come up all of the information you can think of that would summarize this population.
If you only chose one color, you would only get one range of sizes.
Spatial gradient in size
Oh, I’ll just grab those individuals closest to me…
Sample over a known gradient, aka cluster sampling
Can incorporate multiple gradients
How is your population defined?
What is the scale of your inference?
What might influence the inclusion of a replicate?
How important are external factors you know about?
How important are external factors you cannot assess?
Consider each scenario and in pairs design a sampling schema:
1. Population = All salmon across all rivers
2. Population = Salmon in one river
Think about how sampling influences data structure
Consider how we summarize our data
A little bit o’ Boolean
The split-apply-combine philosophy
We assume a sample is representative of a population
Therefore, sample statistics are estimates of population statistics
Larger Samples = Better Estimators
\(\large \bar{Y}\) - The average
value of a sample
\(y_{i}\) - The value of a measurement
for a single individual
n - The number of individuals in a sample
\(\mu\) - The average value of a
population
(Greek = population, Latin = Sample)
[1] 1.56 1.61 1.76 2.16 2.69 2.88 2.91 2.98 3.06 3.09
[1] 1.855
mean, median
What is the range of 2/3 of the population?
How variable was that population? \[\large s^2= \frac{\displaystyle \sum_{i=1}^{n}{(Y_i - \bar{Y})^2}} {n-1}\]
\[ \large s = \sqrt{s^2}\]
[1] 1.56 1.61 1.76 2.16 2.69 2.88 2.91 2.98 3.06 3.09
Quantiles:
5% 10% 50% 90% 95%
1.4270 1.5300 1.8550 2.9430 3.0865
Quartiles (quarter-quantiles):
0% 25% 50% 75% 100%
1.1800 1.6400 1.8550 2.2675 3.5300
Think about how sampling influences data structure
Consider how we summarize our data
A little bit o’ Boolean
The split-apply-combine philosophy
# A tibble: 228 × 3
mass river mass_class
<dbl> <chr> <fct>
1 3.09 a (2.75,3.14]
2 2.91 b (2.75,3.14]
3 3.06 c (2.75,3.14]
4 2.69 d (2.35,2.75]
5 2.88 e (2.75,3.14]
6 2.98 f (2.75,3.14]
7 1.61 a (1.57,1.96]
8 2.16 b (1.96,2.35]
9 1.56 c [1.18,1.57]
10 1.76 d (1.57,1.96]
# … with 218 more rows
# A tibble: 74 × 3
mass river mass_class
<dbl> <chr> <fct>
1 3.09 a (2.75,3.14]
2 2.91 b (2.75,3.14]
3 3.06 c (2.75,3.14]
4 2.69 d (2.35,2.75]
5 2.88 e (2.75,3.14]
6 2.98 f (2.75,3.14]
7 2.16 b (1.96,2.35]
8 3.3 f (3.14,3.53]
9 3.25 e (3.14,3.53]
10 2.18 f (1.96,2.35]
# … with 64 more rows
# A tibble: 38 × 3
mass river mass_class
<dbl> <chr> <fct>
1 3.09 a (2.75,3.14]
2 1.61 a (1.57,1.96]
3 1.91 a (1.57,1.96]
4 2.13 a (1.96,2.35]
5 1.53 a [1.18,1.57]
6 1.75 a (1.57,1.96]
7 1.76 a (1.57,1.96]
8 1.72 a (1.57,1.96]
9 2.29 a (1.96,2.35]
10 1.74 a (1.57,1.96]
# … with 28 more rows
# A tibble: 4 × 3
mass river mass_class
<dbl> <chr> <fct>
1 3.09 a (2.75,3.14]
2 3.04 a (2.75,3.14]
3 3.11 a (2.75,3.14]
4 3.05 a (2.75,3.14]
# A tibble: 1 × 1
n
<int>
1 4
# A tibble: 1 × 1
n
<int>
1 4
[1] 9
# A tibble: 1 × 1
n
<int>
1 228
Think about how sampling influences data structure
Consider how we summarize our data
A little bit o’ Boolean
The split-apply-combine philosophy
Filtering and working with one chunk of the data is not enough
We often want to summarize information about many groups
What are things you want to know about different rivers in the salmon data?
What are things you want to know about different size classes in the salmon data?
# A tibble: 6 × 3
river mean_mass sd_mass
<chr> <dbl> <dbl>
1 d 1.89 0.450
2 e 1.98 0.491
3 c 2.04 0.548
4 f 2.04 0.612
5 b 2.10 0.608
6 a 2.11 0.511