09_summarizing_data.knit

Making Data Tell its Story

First, some fun!

#devtools::install_github("gadenbuie/ggpomological")
library(ggpomological)

ggplot(iris, aes(x=Petal.Length, y = Petal.Width, color=Species)) +
  geom_point(size=3) +
  theme_pomological(base_family = 'Homemade Apple', base_size=16) +
  scale_color_pomological()

First, some fun!

Where are we going?

Objectives

Think about how sampling influences data structure
Consider how we summarize our data
A little bit o’ Boolean
The split-apply-combine philosophy

How much does one salmon weigh?

Weight: 3.09kg

How much do these salmon weigh?

3.09, 2.91, 3.06, 2.69, 2.88, 2.98, 1.61, 2.16, 1.56, 1.76

What can we say about the weights of all of these salmon?

Pair up with someone and come up all of the information you can think of that would summarize this population.

Population versus Sample

What is a population?

Population = All Individuals

What is a sample?

A sample of individuals in a randomly distributed population.

What are the properties of a good sample?

Sample is representative

Replicates have an equal chance of being sampled

Replicates are independent

Bias from Unequal Representation

If you only chose one color, you would only get one range of sizes.

Bias from Unequal Change of Sampling

Spatial gradient in size

Bias from Unequal Change of Sampling

Oh, I’ll just grab those individuals closest to me…

Good Sampling Schemes: Stratified Sampling

Sample over a known gradient, aka cluster sampling

Can incorporate multiple gradients

Good Sampling Schemes: Random Sampling

Two sampling schemes:

Random - samples chosen using random numbers
Haphazard - samples chosen without any system (careful!)

Stratified or Random?

How is your population defined?
What is the scale of your inference?
What might influence the inclusion of a replicate?
How important are external factors you know about?
How important are external factors you cannot assess?

Different Sampling for Different Populations

Consider each scenario and in pairs design a sampling schema:
1. Population = All salmon across all rivers
2. Population = Salmon in one river

Objectives

Think about how sampling influences data structure
Consider how we summarize our data
A little bit o’ Boolean
The split-apply-combine philosophy

Our Data

Sample versus Population Summarization

We assume a sample is representative of a population
Therefore, sample statistics are estimates of population statistics
Larger Samples = Better Estimators

Accuracy versus Precision

https://bobchoat.files.wordpress.com/2014/01/accurate-precise1.jpg

Central Tendancy: Mean

\(\bar{Y} = \frac{ \displaystyle \sum_{i=1}^{n}{y_{i}} }{n}\)

\(\large \bar{Y}\) - The average value of a sample
\(y_{i}\) - The value of a measurement for a single individual
n - The number of individuals in a sample

\(\mu\) - The average value of a population
(Greek = population, Latin = Sample)

Median - Dead Center

 [1] 1.56 1.61 1.76 2.16 2.69 2.88 2.91 2.98 3.06 3.09

Middle Value = 50th Percentile = 0.5 Quantile = Median
good for non-normal data

[1] 1.855

Central Tendancies

mean, median

What about population-level variability?

What is the range of 2/3 of the population?

What about population-level variability?

What is the range of 2/3 of the population?

Sample Properties: Variance

How variable was that population? \[\large s^2= \frac{\displaystyle \sum_{i=1}^{n}{(Y_i - \bar{Y})^2}} {n-1}\]

Sums of Squares over n-1
n-1 corrects for both sample size and sample bias
\(\sigma^2\) if describing the population
Units in square of measurement…

Sample Properties: Standard Deviation

\[ \large s = \sqrt{s^2}\]

Units the same as the measurement
If distribution is normal, 67% of data within 1 SD
95% within 2 SD
\(\sigma\) if describing the population

What about population-level variability?

Variability: Quantiles and Quartiles

 [1] 1.56 1.61 1.76 2.16 2.69 2.88 2.91 2.98 3.06 3.09

Quantiles:

    5%    10%    50%    90%    95% 
1.4270 1.5300 1.8550 2.9430 3.0865

Quartiles (quarter-quantiles):

    0%    25%    50%    75%   100% 
1.1800 1.6400 1.8550 2.2675 3.5300

Boxplots

Boxplot of One Population

Boxplot of Many Populations

Meh, I still like ridgelines

Objectives

Think about how sampling influences data structure
Consider how we summarize our data
A little bit o’ Boolean
The split-apply-combine philosophy

What do we want to see?

# A tibble: 228 × 3
    mass river mass_class 
   <dbl> <chr> <fct>      
 1  3.09 a     (2.75,3.14]
 2  2.91 b     (2.75,3.14]
 3  3.06 c     (2.75,3.14]
 4  2.69 d     (2.35,2.75]
 5  2.88 e     (2.75,3.14]
 6  2.98 f     (2.75,3.14]
 7  1.61 a     (1.57,1.96]
 8  2.16 b     (1.96,2.35]
 9  1.56 c     [1.18,1.57]
10  1.76 d     (1.57,1.96]
# … with 218 more rows

Our Hero…

Filtering

library(dplyr)

filter(salmon, mass > 2)

# A tibble: 74 × 3
    mass river mass_class 
   <dbl> <chr> <fct>      
 1  3.09 a     (2.75,3.14]
 2  2.91 b     (2.75,3.14]
 3  3.06 c     (2.75,3.14]
 4  2.69 d     (2.35,2.75]
 5  2.88 e     (2.75,3.14]
 6  2.98 f     (2.75,3.14]
 7  2.16 b     (1.96,2.35]
 8  3.3  f     (3.14,3.53]
 9  3.25 e     (3.14,3.53]
10  2.18 f     (1.96,2.35]
# … with 64 more rows

Filtering

filter(salmon, river == "a")

# A tibble: 38 × 3
    mass river mass_class 
   <dbl> <chr> <fct>      
 1  3.09 a     (2.75,3.14]
 2  1.61 a     (1.57,1.96]
 3  1.91 a     (1.57,1.96]
 4  2.13 a     (1.96,2.35]
 5  1.53 a     [1.18,1.57]
 6  1.75 a     (1.57,1.96]
 7  1.76 a     (1.57,1.96]
 8  1.72 a     (1.57,1.96]
 9  2.29 a     (1.96,2.35]
10  1.74 a     (1.57,1.96]
# … with 28 more rows

Logical Operators

Many Filters

filter(salmon, river == "a" & mass > 3)

# A tibble: 4 × 3
   mass river mass_class 
  <dbl> <chr> <fct>      
1  3.09 a     (2.75,3.14]
2  3.04 a     (2.75,3.14]
3  3.11 a     (2.75,3.14]
4  3.05 a     (2.75,3.14]

Many Filters

filter(salmon, river == "a" & mass > 3) |>
  count()

# A tibble: 1 × 1
      n
  <int>
1     4

filter(salmon, river == "b" & mass > 3) |>
  count()

# A tibble: 1 × 1
      n
  <int>
1     4

Pipes pass an object as the first argument to a function

5 |> sum(4)

[1] 9

salmon |> count()

# A tibble: 1 × 1
      n
  <int>
1   228

Pipes and Functional Programming

We often thing in terms of do this, then then, then this
Programming that way requires overwriting a lot of objects
It gets…. messy

Functional Programming Workflow

#filter the salmon data by river

#filter the salmon data by mass

#get the count of the salmon

Functional Programming Can get Verbose

#filter the salmon data by river
salmon_filtered <- filter(salmon, river == "a") 

#filter the salmon data by mass
salmon_filtered <- filter(salmon_filtered, mass > 3) 

#get the count of the salmon
count(salmon_filtered)

Pipes Make Programs Read Like Language!

salmon_filtered <- salmon |>
  
  #filter the salmon data by river
  filter(river == "a") |>
  
  #filter the salmon data by mass
  filter(mass > 3) |>
  
  #get the count of the salmon
  count()

Objectives

Think about how sampling influences data structure
Consider how we summarize our data
A little bit o’ Boolean
The split-apply-combine philosophy

Where are we going?

Split-Apply-Combine

Filtering and working with one chunk of the data is not enough
We often want to summarize information about many groups

Split-Apply-Combine

Grouping by River

salmon_rivers <- salmon |>

  group_by(river) |>
  
  summarize(mean_mass = mean(mass),
            sd_mass = sd(mass))

What can we do with this?

arrange(salmon_rivers, mean_mass)

# A tibble: 6 × 3
  river mean_mass sd_mass
  <chr>     <dbl>   <dbl>
1 d          1.89   0.450
2 e          1.98   0.491
3 c          2.04   0.548
4 f          2.04   0.612
5 b          2.10   0.608
6 a          2.11   0.511

Making Data Tell its Story

First, some fun!

First, some fun!

First, some fun!

First, some fun!

Where are we going?

Objectives

How much does one salmon weigh?

How much do these salmon weigh?

What can we say about the weights of all of these salmon?

Population versus Sample

What is a population?

What is a sample?

What are the properties of a good sample?

Bias from Unequal Representation

Bias from Unequal Change of Sampling

Bias from Unequal Change of Sampling

Good Sampling Schemes: Stratified Sampling

Good Sampling Schemes: Random Sampling

Stratified or Random?

Different Sampling for Different Populations

Objectives

Our Data

Sample versus Population Summarization

Accuracy versus Precision

Central Tendancy: Mean

Median - Dead Center

Central Tendancies

What about population-level variability?

What about population-level variability?

Sample Properties: Variance

Sample Properties: Standard Deviation

What about population-level variability?

Variability: Quantiles and Quartiles

Boxplots

Boxplot of One Population

Boxplot of Many Populations

Meh, I still like ridgelines

Objectives

What do we want to see?

Our Hero…

Filtering

Filtering

Logical Operators

Many Filters

Many Filters

Sidebar - the pipe!

Sidebar - the pipe!

Pipes pass an object as the first argument to a function

Pipes and Functional Programming

Functional Programming Workflow

Functional Programming Can get Verbose

Pipes Make Programs Read Like Language!

Objectives

Where are we going?

Split-Apply-Combine

Split-Apply-Combine

Think-Pair-Share

Grouping by River

What can we do with this?

What can we do with this?

Custom Counts