For this assignment we’re going to look at birthweights of babies in California from 2000 to 2013. The data has year, groups of birthweights, and counts of number of babies in that group of birth weights. There’s also information on county, zip code, lat/long, etc. These may or may not be useful, but are interesting grouping factors.
skimr
to show that
you’ve done so properly, and everything is as it should be.Visualize this by looking at the summed number of individuals in each birthweight category in each year. Make the plot as easy to understand as possible. Extra credit for making it fancy and not just a default plot. Note, Truckee has multiple zip codes, so, you’re going to want to make sure to sum over all zip codes in the city!
4a. Create a new column, lat_group
where you use
cut_interval
to cut the data into 10 groups.
4b. Calculate the summed birthweight count for each birthweight
group and latitude group - but also calculate the mean latitude in that
group.
4c. Plot! Remember, make your axes, labels, and titles informative!
That mean latitude will help you make a good plot. Trust me!
4d. Well that was unsatisfying, given the different number of
births at different latitudes. Can you redo this, correcting for number
of individuals in each latitude bin to make it more useful? So, percent
in each bin? Make those axes on the plot tell us what is going on! Hint:
to do this, you’ll need to group not just once, but twice. The first
time, but latitude group and birthweight group, the second, you’ll just
use latitude group and instead of summarize
, use a
mutate
to caculate percent in each birthweight category.
Don’t forget to ungroup
at the end!
4e. What did visualization by raw numbers versus percent tell you? How are they different? Why might they be different? What does this tell you about visualization and analysis of population trends in general?
Note that there is latitude and longitude information in this data. Can you use that in some way to plot out anything interesting in the data in terms of geographic distribution. Note, log(x+1) transformations may be your friend for some things. Or correcting by population size. Have fun with this! Feel free to look into geospatial visualization with ggplot2 or other packages - although we’ll do this more formally in a few weeks. Might not be necessary, but, you never know.