RRBS DNA methylation#

This tutorial focuses on predicting age from Mus musculus reduced-representation bisulfite sequencing (RRBS) data. There are a few clocks available that were trained on RRBS data. Moreover, it is possible to use Horvath’s mammalian clocks by converting the genomic location to the probes in the Horvath methylation array.

[1]:

import pandas as pd
import pyaging as pya
import os
import numpy as np

Download and load example data#

Let’s download the publicly available dataset GSE130735 with RRBS samples from mouse. Given it is RRBS, there are millions of CpG sites.

[2]:

pya.data.download_example_data('GSE130735')

|-----> 🏗️ Starting download_example_data function
|-----------> Data found in pyaging_data/GSE130735_subset.pkl
|-----> 🎉 Done! [0.5425s]

[3]:

df = pd.read_pickle('pyaging_data/GSE130735_subset.pkl')

It is important to note that the features for RRBS clocks are the genomic coordinates in the format below.

[4]:

df.head()

[4]:

	chr1:3020814	chr1:3020842	chr1:3020877	chr1:3020891	chr1:3020945	chr1:3020971	chr1:3020987	chr1:3021012	chr1:3037802	chr1:3037820	...	chrY:1825397	chrY:4682362	chrY:32122892	chrY:85867071	chrY:85867083	chrY:85867117	chrY:85867137	chrY:85867139	chrY:85867178	chrY:88224179
GSM3752631	0.609	0.25	0.408	0.189	0.068	0.373	0.571	0.252	0.333	0.158	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
GSM3752625	NaN	NaN	0.973	0.984	0.912	0.915	0.987	0.974	0.991	0.932	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
GSM3752634	NaN	NaN	0.526	0.131	0.000	0.038	0.469	0.769	0.772	0.146	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
GSM3752620	0.931	0.92	0.988	0.949	0.897	0.921	0.907	0.958	1.000	0.867	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
GSM3752622	NaN	NaN	0.205	0.382	0.091	0.132	0.174	0.227	0.108	0.053	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 1778324 columns

Convert data to AnnData object#

AnnData objects are highly flexible and are thus our preferred method of organizing data for age prediction.

[5]:

adata = pya.pp.df_to_adata(df, imputer_strategy='mean') # knn might be a bit slow

|-----> 🏗️ Starting df_to_adata function
|-----> ⚙️ Create anndata object started
|-----> ✅ Create anndata object finished [0.9882s]
|-----> ⚙️ Add metadata to anndata started
|-----------? No metadata provided. Leaving adata.obs empty
|-----> ⚠️ Add metadata to anndata finished [0.0006s]
|-----> ⚙️ Log data statistics started
|-----------> There are 14 observations
|-----------> There are 1778324 features
|-----------> Total missing values: 6322346
|-----------> Percentage of missing values: 25.39%
|-----> ✅ Log data statistics finished [0.0205s]
|-----> ⚙️ Impute missing values started
|-----------> Imputing missing values using mean strategy
|-----> ✅ Impute missing values finished [0.4631s]
|-----> ⚙️ Add imputer strategy to adata.uns started
|-----> ✅ Add imputer strategy to adata.uns finished [0.0087s]
|-----> 🎉 Done! [1.4897s]

This is what the adata object looks like:

Predict age with RRBS clocks#

We can either predict one clock at once or all at the same time. For convenience, let’s simply input all four available mammalian clocks at once. The function is invariant to the capitalization of the clock name.

[6]:

pya.pred.predict_age(adata, ['Thompson', 'Meer', 'Petkovich', 'Stubbs'])

|-----> 🏗️ Starting predict_age function
|-----> ⚙️ Set PyTorch device started
|-----------> Using device: cpu
|-----> ✅ Set PyTorch device finished [0.0033s]
|-----> 🕒 Processing clock: thompson
|-----------> ⚙️ Load clock started
|-----------------> Data found in pyaging_data/thompson.pt
|-----------> ✅ Load clock finished [0.5324s]
|-----------> ⚙️ Check features in adata started
|-----------------? 1 out of 582 features (0.17%) are missing: ['chr4:91376687'], etc.
|-----------------> Filling missing features entirely with 0
|-----------------> Added prepared input matrix to adata.obsm[X_thompson]
|-----------> ⚠️ Check features in adata finished [0.0654s]
|-----------> ⚙️ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.0013s]
|-----------> ⚙️ Add predicted ages and clock metadata to adata started
|-----------> ✅ Add predicted ages and clock metadata to adata finished [0.0008s]
|-----> 🕒 Processing clock: meer
|-----------> ⚙️ Load clock started
|-----------------> Data found in pyaging_data/meer.pt
|-----------> ✅ Load clock finished [0.4402s]
|-----------> ⚙️ Check features in adata started
|-----------------? 225 out of 435 features (51.72%) are missing: ['chr10:111559529', 'chr10:115250413', 'chr10:127620127'], etc.
|-----------------> Filling missing features entirely with 0
|-----------------> Added prepared input matrix to adata.obsm[X_meer]
|-----------> ⚠️ Check features in adata finished [0.0412s]
|-----------> ⚙️ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.0010s]
|-----------> ⚙️ Add predicted ages and clock metadata to adata started
|-----------> ✅ Add predicted ages and clock metadata to adata finished [0.0006s]
|-----> 🕒 Processing clock: petkovich
|-----------> ⚙️ Load clock started
|-----------------> Data found in pyaging_data/petkovich.pt
|-----------> ✅ Load clock finished [0.5167s]
|-----------> ⚙️ Check features in adata started
|-----------------? 58 out of 90 features (64.44%) are missing: ['chr19:23893237', 'chr18:45589182', 'chr16:10502162'], etc.
|-----------------> Filling missing features entirely with 0
|-----------------> Added prepared input matrix to adata.obsm[X_petkovich]
|-----------> ⚠️ Check features in adata finished [0.0161s]
|-----------> ⚙️ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is petkovich
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.0033s]
|-----------> ⚙️ Add predicted ages and clock metadata to adata started
|-----------> ✅ Add predicted ages and clock metadata to adata finished [0.0020s]
|-----> 🕒 Processing clock: stubbs
|-----------> ⚙️ Load clock started
|-----------------> Data found in pyaging_data/stubbs.pt
|-----------> ✅ Load clock finished [0.4679s]
|-----------> ⚙️ Check features in adata started
|-----------------? 8889 out of 17992 features (49.41%) are missing: ['chr1:10038066', 'chr1:106173313', 'chr1:106759301'], etc.
|-----------------> Using reference feature values for stubbs
|-----------------> Added prepared input matrix to adata.obsm[X_stubbs]
|-----------> ⚠️ Check features in adata finished [0.8672s]
|-----------> ⚙️ Predict ages with model started
|-----------------> The preprocessing method is quantile_normalization_and_scale_with_gold_standard
|-----------------> The postprocessing method is stubbs
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.0263s]
|-----------> ⚙️ Add predicted ages and clock metadata to adata started
|-----------> ✅ Add predicted ages and clock metadata to adata finished [0.0014s]
|-----> 🎉 Done! [3.2757s]

All of the age predictions are in unit of months.

[7]:

adata.obs.head()

[7]:

	thompson	meer	petkovich	stubbs
GSM3752631	19.634113	7.315183	8.075177	0.957770
GSM3752625	-1.410461	0.028221	2.953822	-0.074265
GSM3752634	61.058783	21.322178	9.640489	1.389193
GSM3752620	-2.663815	1.611947	3.019351	-0.092710
GSM3752622	20.594114	7.592145	7.104766	0.667168

[8]:

adata.obs.head()

[8]:

	thompson	meer	petkovich	stubbs
GSM3752631	19.634113	7.315183	8.075177	0.957770
GSM3752625	-1.410461	0.028221	2.953822	-0.074265
GSM3752634	61.058783	21.322178	9.640489	1.389193
GSM3752620	-2.663815	1.611947	3.019351	-0.092710
GSM3752622	20.594114	7.592145	7.104766	0.667168

Having so much information printed can be overwhelming, particularly when running several clocks at once. In such cases, just set verbose to False.

[9]:

pya.data.download_example_data('GSE130735', verbose=False)
df = pd.read_pickle('pyaging_data/GSE130735_subset.pkl')
adata = pya.preprocess.df_to_adata(df, imputer_strategy='mean', verbose=False)
pya.pred.predict_age(adata, ['Thompson', 'Meer', 'Petkovich', 'Stubbs'], verbose=False)

[10]:

adata.obs.head()

[10]:

	thompson	meer	petkovich	stubbs
GSM3752631	19.634113	7.315183	8.075177	0.957770
GSM3752625	-1.410461	0.028221	2.953822	-0.074265
GSM3752634	61.058783	21.322178	9.640489	1.389193
GSM3752620	-2.663815	1.611947	3.019351	-0.092710
GSM3752622	20.594114	7.592145	7.104766	0.667168

After age prediction, the clocks are added to adata.obs. Moreover, the percent of missing values for each clock and other metadata are included in adata.uns.

[11]:

adata

[11]:

AnnData object with n_obs × n_vars = 14 × 1778324
    obs: 'thompson', 'meer', 'petkovich', 'stubbs'
    var: 'percent_na'
    uns: 'imputer_strategy', 'thompson_percent_na', 'thompson_missing_features', 'thompson_metadata', 'meer_percent_na', 'meer_missing_features', 'meer_metadata', 'petkovich_percent_na', 'petkovich_missing_features', 'petkovich_metadata', 'stubbs_percent_na', 'stubbs_missing_features', 'stubbs_metadata'
    layers: 'X_original', 'X_imputed'

Predict age with mammalian clocks#

We can predict age by converting the genomic locations directly into the probes from Horvath’s methylation array.

[12]:

os.system('git clone https://github.com/shorvath/MammalianMethylationConsortium.git')

# Let's read the manifest from the mammalian consortium
annotation_df = pd.read_csv('MammalianMethylationConsortium/Annotations, Amin Haghani/Mammals/Mus_musculus.grcm38.100.HorvathMammalMethylChip40.v1.csv', index_col=0)
annotation_df = annotation_df[~annotation_df.seqnames.isna()]
mm_genomic_locations = 'chr' + annotation_df['seqnames'].astype(str) + ':' + annotation_df['CGstart'].astype(int).astype(str)
mm_genomic_locations = mm_genomic_locations.tolist()
mammalian_probes = annotation_df['CGid'].tolist()
mm_loc_to_probe = dict(zip(mm_genomic_locations, mammalian_probes))

# Let's get the previous RRBS dataset and filter only for the genomic locations in the manifest file
df_columns_set = set(df.columns)
mm_loc_to_probe_set = set(mm_loc_to_probe.keys())
common_columns = df_columns_set.intersection(mm_loc_to_probe_set)
df_converted = df[list(common_columns)].copy()

# Then, convert the genomic location to the probe name
df_converted.columns = [mm_loc_to_probe[col] for col in df_converted.columns]

# Let's clean the GitHub
os.system('rm -r MammalianMethylationConsortium')

[12]:

[13]:

df_converted.head()

[13]:

	cg05347424	cg26718996	cg07727941	cg16852837	cg12870762	cg26080798	cg02899039	cg12839061	cg05267150	cg13170453	...	cg02179016	cg20836420	cg18831685	cg08992395	cg13679010	cg12982463	cg17146242	cg13649253	cg07588415	cg14814195
GSM3752631	NaN	NaN	0.000	0.0	0.015	0.000	0.005	NaN	0.023	0.000	...	NaN	0.000	0.028	NaN	0.000	0.000	0.018	0.000	0.021	NaN
GSM3752625	0.938	NaN	0.000	NaN	0.000	0.000	0.000	NaN	0.596	NaN	...	NaN	0.000	NaN	NaN	0.000	0.895	0.227	0.156	0.025	NaN
GSM3752634	0.125	NaN	0.000	NaN	0.627	0.017	0.033	NaN	0.745	NaN	...	NaN	0.495	NaN	NaN	0.014	0.278	0.519	0.786	0.012	NaN
GSM3752620	0.769	NaN	0.091	0.0	0.070	0.006	0.012	NaN	0.607	0.092	...	NaN	0.010	0.054	NaN	0.000	0.933	0.277	0.148	0.000	NaN
GSM3752622	NaN	NaN	0.000	NaN	0.000	0.000	0.000	NaN	0.052	NaN	...	NaN	0.000	NaN	NaN	0.000	0.000	0.064	0.022	0.000	NaN

5 rows × 5149 columns

Now we can finally put the dataframe into pyaging after defining the species as Mus musculus.

[14]:

df_converted['Mus musculus'] = 1
adata_mammalian = pya.pp.df_to_adata(df_converted, imputer_strategy='mean')

|-----> 🏗️ Starting df_to_adata function
|-----> ⚙️ Create anndata object started
|-----> ✅ Create anndata object finished [0.0057s]
|-----> ⚙️ Add metadata to anndata started
|-----------? No metadata provided. Leaving adata.obs empty
|-----> ⚠️ Add metadata to anndata finished [0.0006s]
|-----> ⚙️ Log data statistics started
|-----------> There are 14 observations
|-----------> There are 5150 features
|-----------> Total missing values: 17862
|-----------> Percentage of missing values: 24.77%
|-----> ✅ Log data statistics finished [0.0013s]
|-----> ⚙️ Impute missing values started
|-----------> Imputing missing values using mean strategy
|-----> ✅ Impute missing values finished [0.0060s]
|-----> ⚙️ Add imputer strategy to adata.uns started
|-----> ✅ Add imputer strategy to adata.uns finished [0.0004s]
|-----> 🎉 Done! [0.0174s]

Let’s use these five mammalian predictors.

[15]:

pya.pred.predict_age(adata_mammalian, ['Mammalian1', 'Mammalian2', 'Mammalian3', "MammalianLifespan", "MammalianFemale"])

|-----> 🏗️ Starting predict_age function
|-----> ⚙️ Set PyTorch device started
|-----------> Using device: cpu
|-----> ✅ Set PyTorch device finished [0.0013s]
|-----> 🕒 Processing clock: mammalian1
|-----------> ⚙️ Load clock started
|-----------------> Data found in pyaging_data/mammalian1.pt
|-----------> ✅ Load clock finished [0.4780s]
|-----------> ⚙️ Check features in adata started
|-----------------? 274 out of 335 features (81.79%) are missing: ['cg00249943', 'cg00250826', 'cg00292639'], etc.
|-----------------> Filling missing features entirely with 0
|-----------------> Added prepared input matrix to adata.obsm[X_mammalian1]
|-----------> ⚠️ Check features in adata finished [0.0173s]
|-----------> ⚙️ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is anti_logp2
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.0083s]
|-----------> ⚙️ Add predicted ages and clock metadata to adata started
|-----------> ✅ Add predicted ages and clock metadata to adata finished [0.0017s]
|-----> 🕒 Processing clock: mammalian2
|-----------> ⚙️ Load clock started
|-----------------> Data found in pyaging_data/mammalian2.pt
|-----------> ✅ Load clock finished [0.4544s]
|-----------> ⚙️ Check features in adata started
|-----------------? 2406 out of 2572 features (93.55%) are missing: ['cg00020468', 'cg00096922', 'cg00098422'], etc.
|-----------------> Using reference feature values for mammalian2
|-----------------> Added prepared input matrix to adata.obsm[X_mammalian2]
|-----------> ⚠️ Check features in adata finished [0.0407s]
|-----------> ⚙️ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is mammalian2
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.0196s]
|-----------> ⚙️ Add predicted ages and clock metadata to adata started
|-----------> ✅ Add predicted ages and clock metadata to adata finished [0.0006s]
|-----> 🕒 Processing clock: mammalian3
|-----------> ⚙️ Load clock started
|-----------------> Data found in pyaging_data/mammalian3.pt
|-----------> ✅ Load clock finished [0.5081s]
|-----------> ⚙️ Check features in adata started
|-----------------? 2299 out of 2467 features (93.19%) are missing: ['cg00101675', 'cg06259996', 'cg15168457'], etc.
|-----------------> Using reference feature values for mammalian3
|-----------------> Added prepared input matrix to adata.obsm[X_mammalian3]
|-----------> ⚠️ Check features in adata finished [0.0222s]
|-----------> ⚙️ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is mammalian3
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.0095s]
|-----------> ⚙️ Add predicted ages and clock metadata to adata started
|-----------> ✅ Add predicted ages and clock metadata to adata finished [0.0006s]
|-----> 🕒 Processing clock: mammalianlifespan
|-----------> ⚙️ Load clock started
|-----------------> Data found in pyaging_data/mammalianlifespan.pt
|-----------> ✅ Load clock finished [0.4420s]
|-----------> ⚙️ Check features in adata started
|-----------------? 133 out of 152 features (87.50%) are missing: ['cg00039845', 'cg00300233', 'cg00810217'], etc.
|-----------------> Using reference feature values for mammalianlifespan
|-----------------> Added prepared input matrix to adata.obsm[X_mammalianlifespan]
|-----------> ⚠️ Check features in adata finished [0.0043s]
|-----------> ⚙️ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.0018s]
|-----------> ⚙️ Add predicted ages and clock metadata to adata started
|-----------> ✅ Add predicted ages and clock metadata to adata finished [0.0009s]
|-----> 🕒 Processing clock: mammalianfemale
|-----------> ⚙️ Load clock started
|-----------------> Data found in pyaging_data/mammalianfemale.pt
|-----------> ✅ Load clock finished [0.4532s]
|-----------> ⚙️ Check features in adata started
|-----------------? 73 out of 101 features (72.28%) are missing: ['cg01145947', 'cg02053792', 'cg02407848'], etc.
|-----------------> Filling missing features entirely with 0
|-----------------> Added prepared input matrix to adata.obsm[X_mammalianfemale]
|-----------> ⚠️ Check features in adata finished [0.0135s]
|-----------> ⚙️ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is sigmoid
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.0051s]
|-----------> ⚙️ Add predicted ages and clock metadata to adata started
|-----------> ✅ Add predicted ages and clock metadata to adata finished [0.0024s]
|-----> 🎉 Done! [3.0755s]

Note that RRBS clocks are in units of months whereas the mammalian clocks are in units of years.

[16]:

adata_mammalian.obs

[16]:

	mammalian1	mammalian2	mammalian3	mammalianlifespan	mammalianfemale
GSM3752631	2.537895	-0.009800	-0.048414	1.202134	0.732238
GSM3752625	3.353935	0.064448	-0.048203	1.480080	0.952105
GSM3752634	4.490610	0.813899	-0.035059	1.400278	0.978554
GSM3752620	3.603802	0.122934	-0.046958	1.644701	0.948646
GSM3752622	2.951263	0.005617	-0.047725	1.385282	0.741336
GSM3752637	5.718515	0.895781	-0.037649	1.399769	0.975114
GSM4558216	7.456245	0.684752	-0.012195	1.448119	0.785830
GSM3752643	5.881943	0.880053	-0.037656	1.408483	0.969979
GSM4558213	6.720080	0.855574	-0.026211	1.477236	0.821308
GSM3752640	6.452934	0.766701	-0.031763	1.371831	0.938162
GSM4558222	5.049247	0.120607	-0.045109	1.366729	0.789795
GSM4558219	6.098710	0.857243	-0.029878	1.422144	0.796568
GSM3752628	2.746949	0.098284	-0.047989	1.480599	0.794259
GSM3752617	2.739868	0.078172	-0.048249	1.484091	0.807197

Get citation#

The doi, citation, and some metadata are automatically added to the AnnData object under adata.uns[CLOCKNAME_metadata].

[17]:

adata.uns['thompson_metadata']

[17]:

{'clock_name': 'thompson',
 'data_type': 'methylation',
 'species': 'Mus musculus',
 'year': 2018,
 'approved_by_author': '✅',
 'citation': 'Thompson, Michael J., et al. "A multi-tissue full lifespan epigenetic clock for mice." Aging (Albany NY) 10.10 (2018): 2832.',
 'doi': 'https://doi.org/10.18632/aging.101590',
 'notes': None,
 'version': None}

[18]:

adata.uns['meer_metadata']

[18]:

{'clock_name': 'meer',
 'data_type': 'methylation',
 'species': 'Mus musculus',
 'year': 2018,
 'approved_by_author': '⌛',
 'citation': 'Meer, Margarita V., et al. "A whole lifespan mouse multi-tissue DNA methylation clock." Elife 7 (2018): e40675.',
 'doi': 'https://doi.org/10.7554/eLife.40675',
 'notes': None,
 'version': None}

[19]:

adata.uns['petkovich_metadata']

[19]:

{'clock_name': 'petkovich',
 'data_type': 'methylation',
 'species': 'Mus musculus',
 'year': 2017,
 'approved_by_author': '⌛',
 'citation': 'Petkovich, Daniel A., et al. "Using DNA methylation profiling to evaluate biological age and longevity interventions." Cell metabolism 25.4 (2017): 954-960.',
 'doi': 'https://doi.org/10.1016/j.cmet.2017.03.016',
 'notes': None,
 'version': None}

[20]:

adata.uns['stubbs_metadata']

[20]:

{'clock_name': 'stubbs',
 'data_type': 'methylation',
 'species': 'Mus musculus',
 'year': 2017,
 'approved_by_author': '⌛',
 'citation': 'Stubbs, Thomas M., et al. "Multi-tissue DNA methylation age predictor in mouse." Genome biology 18 (2017): 1-14.',
 'doi': 'https://doi.org/10.1186/s13059-017-1203-5',
 'notes': None,
 'version': None}

[21]:

adata_mammalian.uns['mammalian1_metadata']

[21]:

{'clock_name': 'mammalian1',
 'data_type': 'methylation',
 'species': 'multi',
 'year': 2023,
 'approved_by_author': '⌛',
 'citation': 'Lu, A. T., et al. "Universal DNA methylation age across mammalian tissues." Nature aging 3.9 (2023): 1144-1166.',
 'doi': 'https://doi.org/10.1038/s43587-023-00462-6',
 'notes': None,
 'version': None}

[22]:

adata_mammalian.uns['mammalianlifespan_metadata']

[22]:

{'clock_name': 'mammalianlifespan',
 'data_type': 'methylation',
 'species': 'multi',
 'year': 2023,
 'approved_by_author': '⌛',
 'citation': 'Li, Caesar Z., et al. "Epigenetic predictors of species maximum lifespan and other life history traits in mammals." bioRxiv (2023): 2023-11.',
 'doi': 'https://doi.org/10.1101/2023.11.02.565286',
 'notes': None,
 'version': None}