Putting it all together

For your final project, I’m asking you to create an exploration of a dataset of your choice. This project should provide people with ways to explore the dataset in a way that will guide them towards learning something about relationships and patterns in the data. The final project will have two graded parts - a Shiny app for end-users and a final paper where you tell us just what you did. Your app and paper should explore two or more particular features of the data you find interesting, and provide an end-user a way to derive new and meaningful inferences from the data within the bounds of the controls you provide them with.

The Shiny App

Your app should present 1) the dataset you are looking at, and a full explanation of what it is (this can be a front page, a tabset, or whatever you would like), 2) provide users at least 3 ways to explore different aspects of the data, 3) provide model fits and statistical tests where it would help the user, 4) useful visual representations that are easy for the user to understand. It should be organized in a way that a user has no problem navigating around, and should not just be one giant pane of overwhelming data and analyses.

You will present the app to both your classmates and visitors in a presentation Zoom session. There, I will set you each up with your own breakout room where people can visit you and talk to you about the app. You will have a running app on http://shinyapps.io (or other server) to which you can give them a link. Half of the class will present for the first half-hour, the other half for the second half-hour, so you can visit each others projects!

In addition to your professor and TA visiting you and trying out your app, you will be required to submit the code and data for the app as a zipped up archive. Note - You might want to use the goodpractices library and the styler library to make sure your code is readable and accords to good style-guide principles. You will submit this as you usually submit homeworks.

Grading

For your app and presentation, you will be graded on 1. Ease of use
2. Interpretability of what a user learns from the app
3. Thorough exploration of the data
4. Quality of visualizations
5. Validity of analyses
6. Quality and readbility of code
7. Presentation of app to an end-user
8. Ability to answer questions about the data and results

Extra credit is available to those who want to publicize their app more widely.
Each of the above will be graded on a 5 point scale.
1 = Student shows little ability to execute this concept.
2 = Student shows a flawed attempt at execution of this step, or minimal effort.
3 = Student shows understanding of the concept, and is able to achieve a minimal satisfactory outcome.
4 = Student shows understanding, and presents a well-crafted execution of the concept.
5 = Student has achieved mastery of the concept, with a deep and compelling execution.

The Paper

Organizational Structure

The final paper should be broken up into something along the lines of the following sections. You may feel free to adapt this flexibly given your unique data set and set of problems and questions. But this is a general guide, particularly if you are lost.

1) Introduction

What is the dataset you are using?
- Describe the dataset in detail
- Visualize summary information about the data
Based on the data, what types of elements did you want people to be able to explore? What questions did you want them to be able to ask and answer?

2) Methods

How did you prepare and manupulate the data to get it ready for your app? Write out the workflow(s) step by step.
How did you decide on the ways you wanted the end-user to be able to explore the data? What information did you chose to give to them and why? What choices did you give them and why?
What was the setup of your app? Tell us about the layout you chose, why you chose it, and how it created a
What are the visualizations and analyses that you provided access to?
- Show us the types of visualizations you wanted to create to help the data tell its story
- What are the underlying models you let the user fit? Why those models? What is their data generating process and your error generating process?
- What statistical tests did you return to the user. Why? What did they tell them?
Include a working link to your app

3) Results

What are the most obvious things that you can learn from your app about your dataset?
- Based on what people could do in your app, is there any 2-3 visualizations or analyses that stand out? Show them here.
What are other visualizations and analyses that come out of your dataset?
- Anything that your app made you think you should go look at?
- Any statistical tests of patterns that you can see in your app, but are not formally tested there?

4) Discussion

What did you learn from your app and your exploration of this data?
Does the way you setup your app allow you to learn anything you did not expect?
What feedback did you get about your app?
Do your results suggest additional visualizations/analyses? Feel free to conduct them here.
What final conclusion can you draw about your data set?

Some Markdown Notes for the paper

As a quick note, you might want to massage you code chunks a bit in your presentation so we don’t see code, error messages, warnings, etc. Remember warning=FALSE, message=FALSE, echo=FALSE, and more are all your friends. For more - including how to resize graphs and such - see here.

If you want to output your statistical results as a table, use knitr::kable or kabelExtra. If you want to build tables that you fill in with text, I recommend this markdown table generator as markdown tables can be tricky.

You might want to use the goodpractices library and the styler library to make sure your code is readable and accords to good style-guide principles.

Last, here’s a markdown cheatsheet - although Rstudio’s help has a good set of materials as well.

Grading of the Paper

Remember, I’m looking for quality well commented, well thought-out, well styled code in addition to gorgeous visualizations that convery a message and unambiguous analyses (where possible). As a guide to the whole paper, here’s my rubric. Once you finish your paper, assess yourself. Heck, if you even want to write a short section where you assess yourself against this rubric and justify why we should give you a certain grade according to this rubric, feel free!

Exepectations	4 (Exemplary)	3 (Accomplished)	2 (Developing)	1 (Beginning)
Description of data	The student provides the source of the dataset. The description includes background on how the data were collected, with a focus on details of the data collection that would be relevant to how they answer their question (e.g., understanding of sampling design that may be relevant to meeting assumptions of statistical tests). The student provides exploratory summary statistics or visualisations that help the reader understand the scope, content, and coverage of the data.	The student provides the source of the dataset. The description includes background on how the data were collected, with a focus on details of the data collection that would be relevant to how they answer their question (e.g., understanding of sampling design that may be relevant to meeting assumptions of statistical tests). The student provides exploratory summary statistics and visualizations.	The student provides the source and brief description of the dataset. The student provides exploratory summary statistics or visualizations.	Student gives us name and source of data set and what type of data is in it.
Explanation and justification of question(s)	There is a single focal question that is testable given the data. There are additional questions that are subsets or follow ups of the focal question. These questions will also be testable given the data. The student has described the rationale behind the question, providing context for how they came up with this question.	There is a single focal question that is testable given the data. The student has described the rationale behind the question, providing context for how they came up with this question.	There is a single focal question that is testable given the data. The student has described the rationale behind the question.	There is a single focal question that relates to the data. The rationale for the question is unclear, however.
Description of workflow	The student provides a verbal description of the workflow used to answer their question. What steps did they take to answer their question. This can include everything from data tidying to visualization to analysis. Justification for the workflow is included in this description.	The student provides a verbal description of the workflow used to answer their question. What steps did they take to answer their question? This can include everything from data tidying to visualization to analysis.	The student provides a verbal description of the workflow of the analyses used to answer their question.	The student provides a broad-based verbal description of the workflow of their process, but is not able to break it down into specific steps. The description reads as an abstract rather than a concrete set of actions.
Selection and justification of statistical methods	The statistical methods used are appropriate and answer the question posed. A justification of statistical method choice includes a clear statement of the underlying model used, with a description of the data generating process and error generating process. The student has clearly described the assumptions of the test or tests that they used and provided support that these assumptions have been met (e.g., verbal descriptions, additional statistical tests, data visualisations).	The statistical methods used are appropriate and answer the question posed. The student has clearly described the assumptions of the test or tests that they used and provided support that these assumptions have been met (e.g., verbal descriptions, additional statistical tests, data visualisations).	The statistical methods used are appropriate and answer the question posed.	The studnet proposes statistical methods, but how they relate to the question being asked is unclear.
Code quality	The student adheres to an R style guide (e.g., http://adv-r.had.co.nz/Style.html). The code is easy to read and well commented, allowing an external reviewer to understand why certain steps are being done. This code is modular, with complex problems being broken down into small, human readable, and logically discrete steps. Functions are used instead of repeated code chunks where appropriate. The report has been written in Rmarkdown	The code is easy to read and well commented, allowing an external reviewer to understand why certain steps are being done. This code is modular, with complex problems being broken down into small, human readable, and logically discrete steps. Functions are used instead of repeated code chunks where appropriate.	The code is easy to read and well commented, allowing an external reviewer to understand why certain steps are being done.	The student has made an effort to make the code readable via good commenting practices.
Presentation of results	Data visualisations are clearly relevant to the questions being asked and models being tested. Figures are appropriately captioned and can be interpreted with minimal additional context (i.e., can stand alone). Each visualisation conveys information that is related to the questions described in the introduction. Axes are well labeled, legends are clear, color schemes make key points easily understandable to the reader. Minutaue of font-sizes, visual aesthetics show clear attention to detail.	Data visualisations are clearly relevant to the questions being asked and models being tested. Figures are appropriately captioned and can be interpreted with minimal additional context (i.e., can stand alone). Each visualisation conveys information that is related to the questions described in the introduction.	Data visualisations are clearly relevant to the questions being asked and models being tested. Each visualisation conveys information that is related to the questions described in the introduction.	Visualizations convey information related to analyses and questions in a clear manner, but are difficult to interpret.
Discussion/evaluation of results	The conclusions are derived logically from the results and data visualisations. The student has examined and evaluated limitations in the analysis and has proposed ways to overcome these limitations. The student is able to synthesize multiple different results into a single strong conclusion.	The conclusions are derived logically from the results and data visualisations. The student has multiple conclusions drawn from analyses, but does not bring them together into a single strong point.	The conclusions are derived logically from the results and data visualisations.	The conclusions are derived from the results and data visualisations, but do not connect clarly and cleanly. Some analyses are ignored.
Quality of writing	Spelling and grammar count. Sections written so that they can be clearly understood by the reader. Student’s prose flow cleanly and clearly. Sentences are complete, organized, and are easy to understand. Clarity of communication is key.	Spelling and grammar count. Sections written so that they can be clearly understood by the reader. Clarity of communication is key.	Spelling and grammar count. Sections written so that they can be clearly understood by the reader.	Spelling and grammar are not great, but the writing is still clear.

So, each section of the rubric is out of 4.

Note: Using libraries not taught in class will be +1 point per library. Please note when you are using a new library for you.

An example from the past

This is before we moved to shiny apps, and, this needs a bit more organizational cohesion, and it was written in bullet points, which got big points off (write! and write well!) but overall, not a bad paper

Final Project

Intro to Data Science for Biology

Presentations May 12, 2023, Paper Due May 17, 2023

Submission links