56  Data Pipeline Overview

56.1 Get access to Data

Health-related data comes from many sources, including:

  • Electronic Health Records (EPIC)
  • Lab/Clinical research data
  • Public datasets, e.g. NIH, UK Biobank, etc.

56.2 Handle and inspect data in the command line

Particularly useful for data sets of unknown structure (e.g. to find what delimiter is used) and very large data (will it fit into memory?)

56.3 Read Data into R

56.4 Clean data names & values

56.5 Define Data Types

or

56.6 Reshape

Convert long to wide or vice versa, as needed.

56.7 Join data sets

If you have data in multiple files that need to be merged, you can easily joining them:

56.8 Transform data

Data transformations will depend on the analysis or analyses you wish to perform. Note that we often need to perform different data transformation for different statistical tests or machine learning models (supervised, or unsupervised learning).

56.9 Visualize

Visualization is essential before, during, and data preparation, hypothesis testing, supervised, and unsupervised learning

56.10 Summarize & Aggregate

56.11 Statistical Hypothesis Testing

56.12 Predictive Modeling

Perform classification, regression, survival analysis

56.13 Decomposition

Do dimensionality reduction / matrix factorization:

56.14 Clustering

Group cases based on similarity across multiple features:

56.15 Saving data to disk

Save your cleaned dataset to disk:

56.16 Program your own functions!

For all the above operations, you will often be better off writing your own customized functions using the above base and third-party packages for your specific data needs and analysis goals.

56.17 Always document your code!

Always remember to add in-line comments (#) to your functions, scripts, Quarto documents for your future self, your collaborators, and the world.

56.18 Share your code on GitHub

Consider sharing your code on GitHub to allow review by others. This may be done at any time during your work, you should especially consider to publish code along with published manuscripts.