Open In Colab Open In nbviewer

Bulk RNA-Seq#

This tutorial is a brief guide for the implementation of BiT Age, a highly accurate bulk transcriptomic clock for C. elegans. Link to paper.

We just need two packages for this tutorial.

[1]:
import pandas as pd
import pyaging as pya

Download and load example data#

Let’s download the C. elegans RNA-seq dataset from the BiT Age paper.

[2]:
pya.data.download_example_data('GSE65765')
|-----> 🏗️ Starting download_example_data function
|-----------> Data found in pyaging_data/GSE65765_CPM.pkl
|-----> 🎉 Done! [0.5749s]
[3]:
df = pd.read_pickle('pyaging_data/GSE65765_CPM.pkl')
[4]:
df.head()
[4]:
WBGene00197333 WBGene00198386 WBGene00015153 WBGene00002061 WBGene00255704 WBGene00235314 WBGene00001177 WBGene00169236 WBGene00219784 WBGene00015152 ... WBGene00010964 WBGene00014467 WBGene00014468 WBGene00014469 WBGene00014470 WBGene00010965 WBGene00014471 WBGene00010966 WBGene00010967 WBGene00014473
SRR1793993 0.0 0.0 3.780174 169.240815 1.907427 0.277444 59.320986 0.0 0.000000 1.283178 ... 858.949156 0.0 0.000000 0.0 0.052021 234.526846 0.017340 54.483057 78.117815 0.000000
SRR1793991 0.0 0.0 0.510354 412.628597 0.061861 0.061861 22.239044 0.0 0.015465 0.201048 ... 1049.982885 0.0 0.015465 0.0 0.015465 372.511713 0.000000 54.545971 59.618577 0.000000
SRR1793994 0.0 0.0 4.718708 274.733671 1.234644 0.118391 42.400721 0.0 0.000000 0.642691 ... 664.255412 0.0 0.101478 0.0 0.000000 253.220421 0.033826 19.483698 86.492735 0.016913
SRR1793992 0.0 0.0 2.389905 351.612558 0.505892 0.069778 20.497358 0.0 0.017445 1.308342 ... 1298.799849 0.0 0.034889 0.0 0.000000 472.206803 0.000000 89.508039 76.459508 0.000000

4 rows × 46755 columns

Convert data to AnnData object#

AnnData objects are highly flexible and are thus our preferred method of organizing data for age prediction.

[5]:
adata = pya.preprocess.df_to_adata(df)
|-----> 🏗️ Starting df_to_adata function
|-----> ⚙️ Create anndata object started
|-----> ✅ Create anndata object finished [0.0190s]
|-----> ⚙️ Add metadata to anndata started
|-----------? No metadata provided. Leaving adata.obs empty
|-----> ⚠️ Add metadata to anndata finished [0.0005s]
|-----> ⚙️ Log data statistics started
|-----------> There are 4 observations
|-----------> There are 46755 features
|-----------> Total missing values: 0
|-----------> Percentage of missing values: 0.00%
|-----> ✅ Log data statistics finished [0.0011s]
|-----> ⚙️ Impute missing values started
|-----------> No missing values found. No imputation necessary
|-----> ✅ Impute missing values finished [0.0013s]
|-----> 🎉 Done! [0.0239s]

Note that the original DataFrame is stored in X_original under layers. is This is what the adata object looks like:

[6]:
adata
[6]:
AnnData object with n_obs × n_vars = 4 × 46755
    var: 'percent_na'
    layers: 'X_original'

Predict age#

We can either predict one clock at once or all at the same time. Given we only have one clock of interest for this tutorial, let’s go with one. The function is invariant to the capitalization of the clock name.

[7]:
pya.pred.predict_age(adata, 'BiTAge')
|-----> 🏗️ Starting predict_age function
|-----> ⚙️ Set PyTorch device started
|-----------> Using device: cpu
|-----> ✅ Set PyTorch device finished [0.0006s]
|-----> 🕒 Processing clock: bitage
|-----------> ⚙️ Load clock started
|-----------------> Data found in pyaging_data/bitage.pt
|-----------> ✅ Load clock finished [0.5446s]
|-----------> ⚙️ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------------> Added prepared input matrix to adata.obsm[X_bitage]
|-----------> ✅ Check features in adata finished [0.0424s]
|-----------> ⚙️ Predict ages with model started
|-----------------> The preprocessing method is binarize
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.0044s]
|-----------> ⚙️ Add predicted ages and clock metadata to adata started
|-----------> ✅ Add predicted ages and clock metadata to adata finished [0.0006s]
|-----> 🎉 Done! [0.6613s]
[8]:
adata.obs.head()
[8]:
bitage
SRR1793993 182.353658
SRR1793991 27.337245
SRR1793994 241.629584
SRR1793992 32.178003

Having so much information printed can be overwhelming, particularly when running several clocks at once. In such cases, just set verbose to False.

[9]:
pya.data.download_example_data('GSE65765', verbose=False)
df = pd.read_pickle('pyaging_data/GSE65765_CPM.pkl')
adata = pya.preprocess.df_to_adata(df, verbose=False)
pya.pred.predict_age(adata, ['BiTAge'], verbose=False)
[10]:
adata.obs.head()
[10]:
bitage
SRR1793993 182.353658
SRR1793991 27.337245
SRR1793994 241.629584
SRR1793992 32.178003

After age prediction, the clocks are added to adata.obs. Moreover, the percent of missing values for each clock and other metadata are included in adata.uns.

[11]:
adata
[11]:
AnnData object with n_obs × n_vars = 4 × 46755
    obs: 'bitage'
    var: 'percent_na'
    uns: 'bitage_percent_na', 'bitage_missing_features', 'bitage_metadata'
    layers: 'X_original'

Get citation#

The doi, citation, and some metadata are automatically added to the AnnData object under adata.uns[CLOCKNAME_metadata].

[12]:
adata.uns['bitage_metadata']
[12]:
{'clock_name': 'bitage',
 'data_type': 'transcriptomics',
 'species': 'C elegans',
 'year': 2021,
 'approved_by_author': '✅',
 'citation': 'Meyer, David H., and Björn Schumacher. "BiT age: A transcriptome‐based aging clock near the theoretical limit of accuracy." Aging cell 20.3 (2021): e13320.',
 'doi': 'https://doi.org/10.1111/acel.13320',
 'notes': None,
 'version': None}