Open In Colab Open In nbviewer

RRBS DNA methylation#

This tutorial focuses on predicting age from Mus musculus reduced-representation bisulfite sequencing (RRBS) data. There are a few clocks available that were trained on RRBS data. Moreover, it is possible to use Horvath’s mammalian clocks by converting the genomic location to the probes in the Horvath methylation array.

[1]:
import pandas as pd
import pyaging as pya
import os
import numpy as np

Download and load example data#

Let’s download the publicly available dataset GSE130735 with RRBS samples from mouse. Given it is RRBS, there are millions of CpG sites.

[2]:
pya.data.download_example_data('GSE130735')
|-----> πŸ—οΈ Starting download_example_data function
|-----------> Data found in pyaging_data/GSE130735_subset.pkl
|-----> πŸŽ‰ Done! [0.5425s]
[3]:
df = pd.read_pickle('pyaging_data/GSE130735_subset.pkl')

It is important to note that the features for RRBS clocks are the genomic coordinates in the format below.

[4]:
df.head()
[4]:
chr1:3020814 chr1:3020842 chr1:3020877 chr1:3020891 chr1:3020945 chr1:3020971 chr1:3020987 chr1:3021012 chr1:3037802 chr1:3037820 ... chrY:1825397 chrY:4682362 chrY:32122892 chrY:85867071 chrY:85867083 chrY:85867117 chrY:85867137 chrY:85867139 chrY:85867178 chrY:88224179
GSM3752631 0.609 0.25 0.408 0.189 0.068 0.373 0.571 0.252 0.333 0.158 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
GSM3752625 NaN NaN 0.973 0.984 0.912 0.915 0.987 0.974 0.991 0.932 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
GSM3752634 NaN NaN 0.526 0.131 0.000 0.038 0.469 0.769 0.772 0.146 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
GSM3752620 0.931 0.92 0.988 0.949 0.897 0.921 0.907 0.958 1.000 0.867 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
GSM3752622 NaN NaN 0.205 0.382 0.091 0.132 0.174 0.227 0.108 0.053 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows Γ— 1778324 columns

Convert data to AnnData object#

AnnData objects are highly flexible and are thus our preferred method of organizing data for age prediction.

[5]:
adata = pya.pp.df_to_adata(df, imputer_strategy='mean') # knn might be a bit slow
|-----> πŸ—οΈ Starting df_to_adata function
|-----> βš™οΈ Create anndata object started
|-----> βœ… Create anndata object finished [0.9882s]
|-----> βš™οΈ Add metadata to anndata started
|-----------? No metadata provided. Leaving adata.obs empty
|-----> ⚠️ Add metadata to anndata finished [0.0006s]
|-----> βš™οΈ Log data statistics started
|-----------> There are 14 observations
|-----------> There are 1778324 features
|-----------> Total missing values: 6322346
|-----------> Percentage of missing values: 25.39%
|-----> βœ… Log data statistics finished [0.0205s]
|-----> βš™οΈ Impute missing values started
|-----------> Imputing missing values using mean strategy
|-----> βœ… Impute missing values finished [0.4631s]
|-----> βš™οΈ Add imputer strategy to adata.uns started
|-----> βœ… Add imputer strategy to adata.uns finished [0.0087s]
|-----> πŸŽ‰ Done! [1.4897s]

This is what the adata object looks like:

Predict age with RRBS clocks#

We can either predict one clock at once or all at the same time. For convenience, let’s simply input all four available mammalian clocks at once. The function is invariant to the capitalization of the clock name.

[6]:
pya.pred.predict_age(adata, ['Thompson', 'Meer', 'Petkovich', 'Stubbs'])
|-----> πŸ—οΈ Starting predict_age function
|-----> βš™οΈ Set PyTorch device started
|-----------> Using device: cpu
|-----> βœ… Set PyTorch device finished [0.0033s]
|-----> πŸ•’ Processing clock: thompson
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/thompson.pt
|-----------> βœ… Load clock finished [0.5324s]
|-----------> βš™οΈ Check features in adata started
|-----------------? 1 out of 582 features (0.17%) are missing: ['chr4:91376687'], etc.
|-----------------> Filling missing features entirely with 0
|-----------------> Added prepared input matrix to adata.obsm[X_thompson]
|-----------> ⚠️ Check features in adata finished [0.0654s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0013s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0008s]
|-----> πŸ•’ Processing clock: meer
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/meer.pt
|-----------> βœ… Load clock finished [0.4402s]
|-----------> βš™οΈ Check features in adata started
|-----------------? 225 out of 435 features (51.72%) are missing: ['chr10:111559529', 'chr10:115250413', 'chr10:127620127'], etc.
|-----------------> Filling missing features entirely with 0
|-----------------> Added prepared input matrix to adata.obsm[X_meer]
|-----------> ⚠️ Check features in adata finished [0.0412s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0010s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0006s]
|-----> πŸ•’ Processing clock: petkovich
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/petkovich.pt
|-----------> βœ… Load clock finished [0.5167s]
|-----------> βš™οΈ Check features in adata started
|-----------------? 58 out of 90 features (64.44%) are missing: ['chr19:23893237', 'chr18:45589182', 'chr16:10502162'], etc.
|-----------------> Filling missing features entirely with 0
|-----------------> Added prepared input matrix to adata.obsm[X_petkovich]
|-----------> ⚠️ Check features in adata finished [0.0161s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is petkovich
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0033s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0020s]
|-----> πŸ•’ Processing clock: stubbs
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/stubbs.pt
|-----------> βœ… Load clock finished [0.4679s]
|-----------> βš™οΈ Check features in adata started
|-----------------? 8889 out of 17992 features (49.41%) are missing: ['chr1:10038066', 'chr1:106173313', 'chr1:106759301'], etc.
|-----------------> Using reference feature values for stubbs
|-----------------> Added prepared input matrix to adata.obsm[X_stubbs]
|-----------> ⚠️ Check features in adata finished [0.8672s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> The preprocessing method is quantile_normalization_and_scale_with_gold_standard
|-----------------> The postprocessing method is stubbs
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0263s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0014s]
|-----> πŸŽ‰ Done! [3.2757s]

All of the age predictions are in unit of months.

[7]:
adata.obs.head()
[7]:
thompson meer petkovich stubbs
GSM3752631 19.634113 7.315183 8.075177 0.957770
GSM3752625 -1.410461 0.028221 2.953822 -0.074265
GSM3752634 61.058783 21.322178 9.640489 1.389193
GSM3752620 -2.663815 1.611947 3.019351 -0.092710
GSM3752622 20.594114 7.592145 7.104766 0.667168
[8]:
adata.obs.head()
[8]:
thompson meer petkovich stubbs
GSM3752631 19.634113 7.315183 8.075177 0.957770
GSM3752625 -1.410461 0.028221 2.953822 -0.074265
GSM3752634 61.058783 21.322178 9.640489 1.389193
GSM3752620 -2.663815 1.611947 3.019351 -0.092710
GSM3752622 20.594114 7.592145 7.104766 0.667168

Having so much information printed can be overwhelming, particularly when running several clocks at once. In such cases, just set verbose to False.

[9]:
pya.data.download_example_data('GSE130735', verbose=False)
df = pd.read_pickle('pyaging_data/GSE130735_subset.pkl')
adata = pya.preprocess.df_to_adata(df, imputer_strategy='mean', verbose=False)
pya.pred.predict_age(adata, ['Thompson', 'Meer', 'Petkovich', 'Stubbs'], verbose=False)
[10]:
adata.obs.head()
[10]:
thompson meer petkovich stubbs
GSM3752631 19.634113 7.315183 8.075177 0.957770
GSM3752625 -1.410461 0.028221 2.953822 -0.074265
GSM3752634 61.058783 21.322178 9.640489 1.389193
GSM3752620 -2.663815 1.611947 3.019351 -0.092710
GSM3752622 20.594114 7.592145 7.104766 0.667168

After age prediction, the clocks are added to adata.obs. Moreover, the percent of missing values for each clock and other metadata are included in adata.uns.

[11]:
adata
[11]:
AnnData object with n_obs Γ— n_vars = 14 Γ— 1778324
    obs: 'thompson', 'meer', 'petkovich', 'stubbs'
    var: 'percent_na'
    uns: 'imputer_strategy', 'thompson_percent_na', 'thompson_missing_features', 'thompson_metadata', 'meer_percent_na', 'meer_missing_features', 'meer_metadata', 'petkovich_percent_na', 'petkovich_missing_features', 'petkovich_metadata', 'stubbs_percent_na', 'stubbs_missing_features', 'stubbs_metadata'
    layers: 'X_original', 'X_imputed'

Predict age with mammalian clocks#

We can predict age by converting the genomic locations directly into the probes from Horvath’s methylation array.

[12]:
os.system('git clone https://github.com/shorvath/MammalianMethylationConsortium.git')

# Let's read the manifest from the mammalian consortium
annotation_df = pd.read_csv('MammalianMethylationConsortium/Annotations, Amin Haghani/Mammals/Mus_musculus.grcm38.100.HorvathMammalMethylChip40.v1.csv', index_col=0)
annotation_df = annotation_df[~annotation_df.seqnames.isna()]
mm_genomic_locations = 'chr' + annotation_df['seqnames'].astype(str) + ':' + annotation_df['CGstart'].astype(int).astype(str)
mm_genomic_locations = mm_genomic_locations.tolist()
mammalian_probes = annotation_df['CGid'].tolist()
mm_loc_to_probe = dict(zip(mm_genomic_locations, mammalian_probes))

# Let's get the previous RRBS dataset and filter only for the genomic locations in the manifest file
df_columns_set = set(df.columns)
mm_loc_to_probe_set = set(mm_loc_to_probe.keys())
common_columns = df_columns_set.intersection(mm_loc_to_probe_set)
df_converted = df[list(common_columns)].copy()

# Then, convert the genomic location to the probe name
df_converted.columns = [mm_loc_to_probe[col] for col in df_converted.columns]

# Let's clean the GitHub
os.system('rm -r MammalianMethylationConsortium')
[12]:
0
[13]:
df_converted.head()
[13]:
cg05347424 cg26718996 cg07727941 cg16852837 cg12870762 cg26080798 cg02899039 cg12839061 cg05267150 cg13170453 ... cg02179016 cg20836420 cg18831685 cg08992395 cg13679010 cg12982463 cg17146242 cg13649253 cg07588415 cg14814195
GSM3752631 NaN NaN 0.000 0.0 0.015 0.000 0.005 NaN 0.023 0.000 ... NaN 0.000 0.028 NaN 0.000 0.000 0.018 0.000 0.021 NaN
GSM3752625 0.938 NaN 0.000 NaN 0.000 0.000 0.000 NaN 0.596 NaN ... NaN 0.000 NaN NaN 0.000 0.895 0.227 0.156 0.025 NaN
GSM3752634 0.125 NaN 0.000 NaN 0.627 0.017 0.033 NaN 0.745 NaN ... NaN 0.495 NaN NaN 0.014 0.278 0.519 0.786 0.012 NaN
GSM3752620 0.769 NaN 0.091 0.0 0.070 0.006 0.012 NaN 0.607 0.092 ... NaN 0.010 0.054 NaN 0.000 0.933 0.277 0.148 0.000 NaN
GSM3752622 NaN NaN 0.000 NaN 0.000 0.000 0.000 NaN 0.052 NaN ... NaN 0.000 NaN NaN 0.000 0.000 0.064 0.022 0.000 NaN

5 rows Γ— 5149 columns

Now we can finally put the dataframe into pyaging after defining the species as Mus musculus.

[14]:
df_converted['Mus musculus'] = 1
adata_mammalian = pya.pp.df_to_adata(df_converted, imputer_strategy='mean')
|-----> πŸ—οΈ Starting df_to_adata function
|-----> βš™οΈ Create anndata object started
|-----> βœ… Create anndata object finished [0.0057s]
|-----> βš™οΈ Add metadata to anndata started
|-----------? No metadata provided. Leaving adata.obs empty
|-----> ⚠️ Add metadata to anndata finished [0.0006s]
|-----> βš™οΈ Log data statistics started
|-----------> There are 14 observations
|-----------> There are 5150 features
|-----------> Total missing values: 17862
|-----------> Percentage of missing values: 24.77%
|-----> βœ… Log data statistics finished [0.0013s]
|-----> βš™οΈ Impute missing values started
|-----------> Imputing missing values using mean strategy
|-----> βœ… Impute missing values finished [0.0060s]
|-----> βš™οΈ Add imputer strategy to adata.uns started
|-----> βœ… Add imputer strategy to adata.uns finished [0.0004s]
|-----> πŸŽ‰ Done! [0.0174s]

Let’s use these five mammalian predictors.

[15]:
pya.pred.predict_age(adata_mammalian, ['Mammalian1', 'Mammalian2', 'Mammalian3', "MammalianLifespan", "MammalianFemale"])
|-----> πŸ—οΈ Starting predict_age function
|-----> βš™οΈ Set PyTorch device started
|-----------> Using device: cpu
|-----> βœ… Set PyTorch device finished [0.0013s]
|-----> πŸ•’ Processing clock: mammalian1
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/mammalian1.pt
|-----------> βœ… Load clock finished [0.4780s]
|-----------> βš™οΈ Check features in adata started
|-----------------? 274 out of 335 features (81.79%) are missing: ['cg00249943', 'cg00250826', 'cg00292639'], etc.
|-----------------> Filling missing features entirely with 0
|-----------------> Added prepared input matrix to adata.obsm[X_mammalian1]
|-----------> ⚠️ Check features in adata finished [0.0173s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is anti_logp2
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0083s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0017s]
|-----> πŸ•’ Processing clock: mammalian2
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/mammalian2.pt
|-----------> βœ… Load clock finished [0.4544s]
|-----------> βš™οΈ Check features in adata started
|-----------------? 2406 out of 2572 features (93.55%) are missing: ['cg00020468', 'cg00096922', 'cg00098422'], etc.
|-----------------> Using reference feature values for mammalian2
|-----------------> Added prepared input matrix to adata.obsm[X_mammalian2]
|-----------> ⚠️ Check features in adata finished [0.0407s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is mammalian2
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0196s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0006s]
|-----> πŸ•’ Processing clock: mammalian3
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/mammalian3.pt
|-----------> βœ… Load clock finished [0.5081s]
|-----------> βš™οΈ Check features in adata started
|-----------------? 2299 out of 2467 features (93.19%) are missing: ['cg00101675', 'cg06259996', 'cg15168457'], etc.
|-----------------> Using reference feature values for mammalian3
|-----------------> Added prepared input matrix to adata.obsm[X_mammalian3]
|-----------> ⚠️ Check features in adata finished [0.0222s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is mammalian3
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0095s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0006s]
|-----> πŸ•’ Processing clock: mammalianlifespan
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/mammalianlifespan.pt
|-----------> βœ… Load clock finished [0.4420s]
|-----------> βš™οΈ Check features in adata started
|-----------------? 133 out of 152 features (87.50%) are missing: ['cg00039845', 'cg00300233', 'cg00810217'], etc.
|-----------------> Using reference feature values for mammalianlifespan
|-----------------> Added prepared input matrix to adata.obsm[X_mammalianlifespan]
|-----------> ⚠️ Check features in adata finished [0.0043s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0018s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0009s]
|-----> πŸ•’ Processing clock: mammalianfemale
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/mammalianfemale.pt
|-----------> βœ… Load clock finished [0.4532s]
|-----------> βš™οΈ Check features in adata started
|-----------------? 73 out of 101 features (72.28%) are missing: ['cg01145947', 'cg02053792', 'cg02407848'], etc.
|-----------------> Filling missing features entirely with 0
|-----------------> Added prepared input matrix to adata.obsm[X_mammalianfemale]
|-----------> ⚠️ Check features in adata finished [0.0135s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is sigmoid
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0051s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0024s]
|-----> πŸŽ‰ Done! [3.0755s]

Note that RRBS clocks are in units of months whereas the mammalian clocks are in units of years.

[16]:
adata_mammalian.obs
[16]:
mammalian1 mammalian2 mammalian3 mammalianlifespan mammalianfemale
GSM3752631 2.537895 -0.009800 -0.048414 1.202134 0.732238
GSM3752625 3.353935 0.064448 -0.048203 1.480080 0.952105
GSM3752634 4.490610 0.813899 -0.035059 1.400278 0.978554
GSM3752620 3.603802 0.122934 -0.046958 1.644701 0.948646
GSM3752622 2.951263 0.005617 -0.047725 1.385282 0.741336
GSM3752637 5.718515 0.895781 -0.037649 1.399769 0.975114
GSM4558216 7.456245 0.684752 -0.012195 1.448119 0.785830
GSM3752643 5.881943 0.880053 -0.037656 1.408483 0.969979
GSM4558213 6.720080 0.855574 -0.026211 1.477236 0.821308
GSM3752640 6.452934 0.766701 -0.031763 1.371831 0.938162
GSM4558222 5.049247 0.120607 -0.045109 1.366729 0.789795
GSM4558219 6.098710 0.857243 -0.029878 1.422144 0.796568
GSM3752628 2.746949 0.098284 -0.047989 1.480599 0.794259
GSM3752617 2.739868 0.078172 -0.048249 1.484091 0.807197

Get citation#

The doi, citation, and some metadata are automatically added to the AnnData object under adata.uns[CLOCKNAME_metadata].

[17]:
adata.uns['thompson_metadata']
[17]:
{'clock_name': 'thompson',
 'data_type': 'methylation',
 'species': 'Mus musculus',
 'year': 2018,
 'approved_by_author': 'βœ…',
 'citation': 'Thompson, Michael J., et al. "A multi-tissue full lifespan epigenetic clock for mice." Aging (Albany NY) 10.10 (2018): 2832.',
 'doi': 'https://doi.org/10.18632/aging.101590',
 'notes': None,
 'version': None}
[18]:
adata.uns['meer_metadata']
[18]:
{'clock_name': 'meer',
 'data_type': 'methylation',
 'species': 'Mus musculus',
 'year': 2018,
 'approved_by_author': 'βŒ›',
 'citation': 'Meer, Margarita V., et al. "A whole lifespan mouse multi-tissue DNA methylation clock." Elife 7 (2018): e40675.',
 'doi': 'https://doi.org/10.7554/eLife.40675',
 'notes': None,
 'version': None}
[19]:
adata.uns['petkovich_metadata']
[19]:
{'clock_name': 'petkovich',
 'data_type': 'methylation',
 'species': 'Mus musculus',
 'year': 2017,
 'approved_by_author': 'βŒ›',
 'citation': 'Petkovich, Daniel A., et al. "Using DNA methylation profiling to evaluate biological age and longevity interventions." Cell metabolism 25.4 (2017): 954-960.',
 'doi': 'https://doi.org/10.1016/j.cmet.2017.03.016',
 'notes': None,
 'version': None}
[20]:
adata.uns['stubbs_metadata']
[20]:
{'clock_name': 'stubbs',
 'data_type': 'methylation',
 'species': 'Mus musculus',
 'year': 2017,
 'approved_by_author': 'βŒ›',
 'citation': 'Stubbs, Thomas M., et al. "Multi-tissue DNA methylation age predictor in mouse." Genome biology 18 (2017): 1-14.',
 'doi': 'https://doi.org/10.1186/s13059-017-1203-5',
 'notes': None,
 'version': None}
[21]:
adata_mammalian.uns['mammalian1_metadata']
[21]:
{'clock_name': 'mammalian1',
 'data_type': 'methylation',
 'species': 'multi',
 'year': 2023,
 'approved_by_author': 'βŒ›',
 'citation': 'Lu, A. T., et al. "Universal DNA methylation age across mammalian tissues." Nature aging 3.9 (2023): 1144-1166.',
 'doi': 'https://doi.org/10.1038/s43587-023-00462-6',
 'notes': None,
 'version': None}
[22]:
adata_mammalian.uns['mammalianlifespan_metadata']
[22]:
{'clock_name': 'mammalianlifespan',
 'data_type': 'methylation',
 'species': 'multi',
 'year': 2023,
 'approved_by_author': 'βŒ›',
 'citation': 'Li, Caesar Z., et al. "Epigenetic predictors of species maximum lifespan and other life history traits in mammals." bioRxiv (2023): 2023-11.',
 'doi': 'https://doi.org/10.1101/2023.11.02.565286',
 'notes': None,
 'version': None}