Datasets

Marcio Diniz | Michael Luu

Cedars Sinai Medical Center

31 August, 2022

Introduction

R and R Studio

R

  • It is an programming language and a software environment that performs statistical computing and graphics.
  • It was released in 1995 by professors from the University of Auckland as an open-source implementation of the S programming language created at Bell Labs around 1976.

R Studio

  • It is a free and open-source integrated development environment (IDE) for R released by the corporation Rstudio (it will change to Posit in October, 2022).
  • A beta version was officially announced in 2011, and a full release in 2016.
  • It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.

R

Libraries

  • R is organized based on libraries/packages on the Comprehensive R Archive Network (CRAN);
  • Each library is a collection of functions. Usually, a library is focused to solve a specific statistical problem;
  • CRAN Task Views allows us to browse packages by topic: For example, Clinical Trials, Genetics, Pharmacokinetics and Reproducible Research.

R

What is a function?

A function takes inputs, make operations and gives back outputs

R

What is a function?

The function f takes objects from in X and transform them into objects into object in Y

R

Community

  • R has a very active and friendly online community;
  • Questions related to R can be asked on
    • Twitter: Use #Rstats;
    • StackOverFlow: Use tag r and package name.

R Studio

  • There are four panes:
  1. Source: Upper left pane to write the R script which you will run; Source Pane

R Studio

  1. Console: Lower left pane to show the results of your R script; Console Pane

R Studio

  1. Environment/History/Connections/Tutorial: Upper right pane has two important tabs:
  • Environment: To import datasets and manage all the active objects;
  • History: History of all commands you have run. Environment/History/Connections/Tutorial Pane

R Studio

  1. Files/Plots/Packages/Help/Viewer: Lower right pane has five important tabs:
  • Files: To show all the files in the Windows directory where R will save the results. You can change the directory, click on “. . .”;
  • Plots: When you run code in the Console that creates a plot, the Plots tab will be automatically selected;
  • Packages: This tab allows you to see the list of all the libraries (add-ons to the R code) you have access to, and which are loaded in already;
  • Help: This tab will be automatically selected whenever you run help code in the Console;
  • Viewer: It shows reports generated by R. Files/Plots/Packages/Help/Viewer Pane

R Studio - Customization

  • You can change the position of the panels: Tools > Global Options> Pane Layout.
  • You can change the appearance: Tools > Global Options > Appearance

Appearance

Pane Layout

R Studio - Projects

  • It makes straightforward to divide your work into multiple projects, each with their own working directory, workspace, history, and source documents.
  • Go to Files > New Project:
  1. Choose New Directory > New Project;
  2. Define the Directory name and choose where it will be located;
  3. “Create a git repository” will allow you to control the changes made on your code using GitHub or Gitlab. It is great for collaborating and sharing code;
  • Let`s create our first project!

Datasets

Why is important to handle data correctly?

Do different backups every day you work on your dataset.

Why is important to handle data correctly?

Do different backups every day you work on your dataset.

Why is important to handle data correctly?

  • To avoid to redo work;
  • To reduce the probability to make mistakes;
  • To make the interaction with other researchers more efficient;
  • To make your statistical analysis to be reproducible.

UppSala Scandal

  • Lönnstedt, O.M. and Eklöv, P., 2016. Environmentally relevant concentrations of microplastic particles influence larval fish ecology. Science, 352(6290), pp.1213-1216.
  • Authors reported experiments showing that fish that ate tiny ‘microplastics’ grew more slowly and were more likely to be eaten by predators.
  • A group of researchers raised a complaint about the study that not all the data underlying the results were available.
  • The only computer containing the study’s raw data was allegedly stolen and no backups existed on another machine or an online repository.
  • Uppsala University investigated those allegations last year and found no evidence of misconduct.
  • The paper was retracted by the authors.

Duke Scandal

  • Smoking and carcinoma of the lung. British medical journal. 1950 Sep 30;2(4682):739.
  • Mortality in relation to smoking: ten years’ observations of British doctors. British medical journal. 1964 May 30;1(5395):1399.

Personalized Medicine

  • Aim: Establish whether a patient’s genetic make-up can be used to identify therapeutic regimes that would provide better responses.
  • Potti A, Dressman HK, Bild A, Riedel RF, Chan G, Sayer R, Cragun J, Cottrill H, Kelley MJ, Petersen R, Harpole D. Genomic signatures to guide the use of chemotherapeutics. Nature medicine. 2006 Nov 1;12(11):1294-300.
  • Major breakthrough: oncologists could choose the most adequate chemotherapeutic regime based on the cancer patient’s insensitivity.

Keith Baggerly and Kevin Coombes

  • Biostatisticians at MD Anderson Cancer Center;
  • Data that they could not understand;
  • Mislabelled data;
  • Non-reproducible steps of the analysis;
  • Results seems the opposite;
  • Duke researchers did not listen to them and stopped replying them;
  • Medical journals published some communications, but they did not give much attention.

Clinical Trial

  • Started in 2007 based on Potti’s research;
  • Accrued 109 patients;
  • Baggerly KA, Coombes KR. Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology. The Annals of Applied Statistics. 2009 Dec 1:1309-34.
  • The trial was suspended by Duke;
  • The final Duke’s report was that the data had been subject to modification carried out in a non-random way.

Reproducible vs Replicable

Patil P, Peng RD, Leek JT. A visual tool for defining reproducibility and replicability. Nature human behaviour. 2019 Jul;3(7):650-2.

Data Policies

Nature

  • Authors must deposit their data in an approved data repository as part of the manuscript submission process; manuscripts will not otherwise be sent to review.
  • During the peer-review process, Editors, Editorial Board Members and referees are asked to evaluate whether the data repository(s) selected by the authors is appropriate, and may deem it necessary for authors to archive their data in additional repositories prior to publication.
  • More details
  • Data repositories

Science

  • Before publication, large data sets (including microarray data, protein or DNA sequences, atomic coordinates or electron microscopy maps for molecular and macromolecular structures, and climate data) must be deposited in an approved database and an accession number or a specific access address must be included in the published paper.
  • After publication, all data and materials necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of a Science Journal.
  • More details

What is wrong with this dataset?

Basic rules to organize data sets

  1. The preferred format are either Excel or .csv;
  2. Do not include Protected health information (any information in the medical record or designated record set that can be used to identify an individual and that was created, used, or disclosed in the course of providing a health care service such as diagnosis or treatment);
  3. Provide a dictionary of variables that describes each variable in more detail and indicates any coding scheme used for categorical variables (e.g., 0 = ‘Female’; 1 = ‘Male’)
  4. Each row should contain all the information for one sample unit (patient, mouse, . . . ). Avoid having multiple rows for the same patient unless the data is collected repeatedly over time;

Basic rules to organize data sets

  1. Each column (variable) must contain only one piece of information describing the sample unit. If necessary, create new variables;
  2. Use a specific code to indicate missing data (for example, NA);
  3. The first character of a variable name must be an alphabetic character. Subsequent characters can be alphabetic characters, numeric digits, or underscores. Special characters, except for the underscore or dot, are not allowed;
  4. Do not use colors or highlighting to distinguishing patient characteristics. Instead create a column (variable) to indicate the characteristic;

Basic rules to organize data sets

  1. All notes should be made in a separate column;
  2. Outcomes should all be converted to one specific unit and unit measure listed in the variable dictionary rather than in the data;
  3. Do not include blank/hidden rows or columns;
  4. Do not include calculations and graphics;
  5. Variables which are mutually exclusive listed as separate variables should be concatenate into one variable (column);
  6. If you have already sent the data to the statistician, but you need to add new information, do not change your previous format. If you have to add new variables, add the news variables in columns after the existing ones.

What does R expect when importing a dataset?

Dataset formats

  • R expects a rectangle dataset with n rows and p columns.

Wide

  • Each row is a subject;
  • Each column is a different variable with its first row labeled with the variable name.

Long

  • A subject can have more than one row;
  • Each column still is a variable, but there is one variable indicating the repeated measures.

Why does the data format matter?

  • R sometimes require data in different formats to plot data and perform tests of hypotheses.

Long format

Wide format

Importing datasets

  1. Go to Environment tab and click on ‘Import Dataset’;
  2. Choose the dataset file;
  3. Choose the name of the dataset and the missing data code;
  4. The dataset can be open either clicking on its name in the Environment tab or typing its name in the Console.

Environment tab

Storing your code

  • Any code that we run in the console goes to the History tab.
  • We should always store our code in a R-Script or Quarto report.

History tab

R-scripts and Quarto documents

R Scripts

  • An R script is simply a text file containing the same commands that you would enter on the command line of R;
  • It can be organized within sections: Ctrl + Shift + R;
  • It is often used for scientific programming and more complex data analysis.

R-MarkDown documents

  • It is a self-contained document weaving together narrative text and code to produce elegantly formatted reports;
  • It is used for data analyses as it is reproducible.
  • R code chunks are created: Ctrl + Alt + I;

Quarto documents

  • It is an extension from R-Markdown documents to multiple languages;
  • More details about Quarto here.

R-Markdown / Quarto

R-Markdown / Quarto

  • R-Markdown / Quarto embodies the idea of ‘literate programming’ from Donald Knuth, where you should think of programs as ‘works of literature’.

  • This idea is an effort that pushes to blend your literature (text) and your program (code) together, where we can read from top to bottom.

  • As an investigator, you are a story teller with your data

    • With R-Markdown / Quarto, we can incorporate code in your narrative while maintaining reproducibility
  • When to use R-Markdown / Quarto vs R Scripts

    • Everyone has their own style, and there’s no ‘right way’ or ‘wrong way’

    • In this course we will be promoting the use of R-Markdown / Quarto

    • R Scripts are typically used when we don’t need to incorporate in our code

      • e.g. we have a script (program) that performs a specific function like cleaning or processing our data
    • R-Markdown / Quarto is typically used when we want to form a narrative about our code and results

      • e.g. We run a statistical test, and now we want to explain what it means and its implications

R-Markdown / Quarto

R-Markdown / Quarto

  • We can create a Quarto Document by clicking on the plus symbol on the top left corner of Rstudio

  • We can then next select the Quarto Document

R-Markdown / Quarto

  • A new window will open, where we can define the parameters of the Quarto Document

  • Quarto supports outputs such as HTML, PDF, and WORD

  • For the purpose of this class we will focus on HTML outputs

  • Feel free to include a title for your document as well as your name as the author

  • Please keep Engine as Knitr

  • Please leave ‘Use visual markdown editor’ unchecked for now

  • ALL PARAMETERS CAN BE CHANGED LATER!

R-Markdown / Quarto

  • Great! You now have your very first Quarto Document

  • This is currently the ‘Source’ view of the quarto document

R-Markdown / Quarto

  • Rstudio also has a ‘Visual’ editor that we can switch between that is more akin to Microsoft Word

  • Everything done in the visual editor gets translated to source (Code), and you can switch between them at any time.

R-Markdown / Quarto

  • This section is considered the ‘YAML’

  • Within this section, you can customize certain parameters of the document such as the title, author, and format

R-Markdown / Quarto

  • Inside this section everything will be interpreted as plain text or Markup which is another formatting language

R-Markdown / Quarto

  • Everything inside the shaded section is called a ‘code chunk’

  • Everything inside this section is interpreted as R code

R-Markdown / Quarto

  • This is the Render button. This button will render the current document into an html output

R-Markdown / Quarto

R-Markdown / Quarto

We won’t have the opportunity to go into great depth outside of the basics in report generation, however you can learn more about the capabilities of R markdown from this book.

R-Markdown / Quarto

R-Markdown / Quarto