Open In Colab Open In nbviewer

Illumina Mammalian Methylation Arrays#

We just need two packages for this tutorial.

[1]:
import pandas as pd
import pyaging as pya

Download and load example data#

Let’s download the publicly avaiable dataset GSE223748 with Illumina’s Mammalian Methylation array. The CpG coverage of the this array (~37k) spans highly conserved CpG sequences. Let’s download a subset of that data.

[2]:
pya.data.download_example_data('GSE223748')
|-----> πŸ—οΈ Starting download_example_data function
|-----------> Data found in pyaging_data/GSE223748_subset.pkl
|-----> πŸŽ‰ Done! [0.5310s]
[3]:
df = pd.read_pickle('pyaging_data/GSE223748_subset.pkl')
[4]:
df.head()
[4]:
cg00000165 cg00001209 cg00001364 cg00001582 cg00002920 cg00003994 cg00004555 cg00005112 cg00005271 cg00006213 ... rs7746156_II_F_C_37550 rs798149_II_F_C_37528 rs845016_II_F_C_37529 rs877309_II_F_C_37552 rs9292570_I_F_C_37499 rs9363764_II_F_C_37541 rs939290_II_F_C_37535 rs951295_I_F_C_37507 rs966367_II_F_C_37551 rs9839873_II_F_C_37532
204509080002_R01C02 0.094879 0.916154 0.890314 0.053583 0.490381 0.034852 0.159705 0.763959 0.973245 0.928975 ... 0.488592 0.491361 0.480024 0.500000 0.484252 0.489448 0.505585 0.505335 0.485003 0.510081
202897220142_R04C02 0.497077 0.441263 0.915314 0.047339 0.651029 0.037774 0.082634 0.415800 0.702857 0.821715 ... 0.508102 0.500299 0.507261 0.490684 0.499673 0.497256 0.564106 0.482151 0.486667 0.505236
204529320092_R01C02 0.321141 0.834158 0.881194 0.056124 0.688350 0.030225 0.086776 0.777588 0.974587 0.923934 ... 0.520404 0.509568 0.507549 0.501659 0.492823 0.487243 0.516018 0.471244 0.491066 0.491759
202794570004_R02C01 0.495226 0.924121 0.915812 0.050866 0.688335 0.032344 0.113318 0.872094 0.969189 0.917076 ... 0.499314 0.516132 0.487009 0.487146 0.469119 0.495125 0.548238 0.512283 0.514257 0.492520
203531420070_R05C02 0.183954 0.934332 0.924153 0.055032 0.717495 0.037108 0.098632 0.859614 0.973422 0.963446 ... 0.501432 0.509412 0.485055 0.497272 0.480637 0.467502 0.494246 0.500924 0.531334 0.503709

5 rows Γ— 37554 columns

Convert data to AnnData object#

AnnData objects are highly flexible and are thus our preferred method of organizing data for age prediction.

[5]:
adata = pya.pp.df_to_adata(df, imputer_strategy='knn')
|-----> πŸ—οΈ Starting df_to_adata function
|-----> βš™οΈ Create anndata object started
|-----> βœ… Create anndata object finished [0.0119s]
|-----> βš™οΈ Add metadata to anndata started
|-----------? No metadata provided. Leaving adata.obs empty
|-----> ⚠️ Add metadata to anndata finished [0.0019s]
|-----> βš™οΈ Log data statistics started
|-----------> There are 100 observations
|-----------> There are 37554 features
|-----------> Total missing values: 0
|-----------> Percentage of missing values: 0.00%
|-----> βœ… Log data statistics finished [0.0128s]
|-----> βš™οΈ Impute missing values started
|-----------> No missing values found. No imputation necessary
|-----> βœ… Impute missing values finished [0.0079s]
|-----> πŸŽ‰ Done! [0.0404s]

This is what the adata object looks like:

[6]:
adata
[6]:
AnnData object with n_obs Γ— n_vars = 100 Γ— 37554
    var: 'percent_na'
    layers: 'X_original'

Predict age#

Mammalian predictors without species declaration#

We can either predict one clock at once or all at the same time. Let’s first start with the mammalian clocks that do not need the species Latin name for the conversion of the output into units of years.

[7]:
pya.pred.predict_age(adata, ['Mammalian1', 'MammalianLifespan', 'MammalianFemale'])
|-----> πŸ—οΈ Starting predict_age function
|-----> βš™οΈ Set PyTorch device started
|-----------> Using device: cpu
|-----> βœ… Set PyTorch device finished [0.0013s]
|-----> πŸ•’ Processing clock: mammalian1
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/mammalian1.pt
|-----------> βœ… Load clock finished [0.5443s]
|-----------> βš™οΈ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------------> Added prepared input matrix to adata.obsm[X_mammalian1]
|-----------> βœ… Check features in adata finished [0.0318s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is anti_logp2
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0015s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0006s]
|-----> πŸ•’ Processing clock: mammalianlifespan
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/mammalianlifespan.pt
|-----------> βœ… Load clock finished [0.4468s]
|-----------> βš™οΈ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------------> Added prepared input matrix to adata.obsm[X_mammalianlifespan]
|-----------> βœ… Check features in adata finished [0.0127s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0006s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0007s]
|-----> πŸ•’ Processing clock: mammalianfemale
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/mammalianfemale.pt
|-----------> βœ… Load clock finished [0.4320s]
|-----------> βš™οΈ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------------> Added prepared input matrix to adata.obsm[X_mammalianfemale]
|-----------> βœ… Check features in adata finished [0.0095s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is sigmoid
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0015s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0005s]
|-----> πŸŽ‰ Done! [1.6454s]
[8]:
adata.obs.head()
[8]:
mammalian1 mammalianlifespan mammalianfemale
204509080002_R01C02 26.372437 93.886067 0.994351
202897220142_R04C02 1.176586 6.999176 0.991473
204529320092_R01C02 18.776438 73.335119 0.008419
202794570004_R02C01 0.890973 5.332615 0.941965
203531420070_R05C02 10.371315 68.409331 0.009133

Having so much information printed can be overwhelming, particularly when running several clocks at once. In such cases, just set verbose to False.

[9]:
pya.data.download_example_data('GSE223748', verbose=False)
df = pd.read_pickle('pyaging_data/GSE223748_subset.pkl')
adata = pya.preprocess.df_to_adata(df, imputer_strategy='knn', verbose=False)
pya.pred.predict_age(adata, ['Mammalian1', 'MammalianLifespan', 'MammalianFemale'], verbose=False)
adata.obs.head()
[9]:
mammalian1 mammalianlifespan mammalianfemale
204509080002_R01C02 26.372437 93.886067 0.994351
202897220142_R04C02 1.176586 6.999176 0.991473
204529320092_R01C02 18.776438 73.335119 0.008419
202794570004_R02C01 0.890973 5.332615 0.941965
203531420070_R05C02 10.371315 68.409331 0.009133

After age prediction, the clocks are added to adata.obs. Moreover, the percent of missing values for each clock and other metadata are included in adata.uns.

[10]:
adata
[10]:
AnnData object with n_obs Γ— n_vars = 100 Γ— 37554
    obs: 'mammalian1', 'mammalianlifespan', 'mammalianfemale'
    var: 'percent_na'
    uns: 'mammalian1_percent_na', 'mammalian1_missing_features', 'mammalian1_metadata', 'mammalianlifespan_percent_na', 'mammalianlifespan_missing_features', 'mammalianlifespan_metadata', 'mammalianfemale_percent_na', 'mammalianfemale_missing_features', 'mammalianfemale_metadata'
    layers: 'X_original'

Mammalian predictors with species declaration#

Mammalian2 and mammalian3 types of clocks require species declaration for the reverse transformation of the output into units of years. For the mammalian2 clocks, there are 1756 species in the dictionary with the available variables for reverse transformation; for the mammalian3, there are 1707 species. By default, Homo sapiens is the chosen species.

Let’s first have a look at the species that can be used for these clocks by loading the models themselves.

[11]:
logger = pya.logger.Logger('test_logger')
device = 'cpu'
dir = 'pyaging_data'
indent_level = 1

mammalian2_model = pya.pred.load_clock('Mammalian2', device, dir, logger, indent_level=indent_level)
mammalian3_model = pya.pred.load_clock('Mammalian3', device, dir, logger, indent_level=indent_level)
|-----> βš™οΈ Load clock started
|-----------> Data found in pyaging_data/mammalian2.pt
|-----> βœ… Load clock finished [0.4540s]
|-----> βš™οΈ Load clock started
|-----------> Data found in pyaging_data/mammalian3.pt
|-----> βœ… Load clock finished [0.4946s]

We need to filter the features for the ones that are not CpG sites.

[12]:
mammalian2_species = [feature for feature in mammalian2_model.features if feature[0:2] != 'cg']
mammalian3_species = [feature for feature in mammalian3_model.features if feature[0:2] != 'cg']
print(f"There are {len(mammalian2_species)} species Latin name features in mammalian2.")
print(f"There are {len(mammalian3_species)} species Latin name features in mammalian3.")
There are 1756 species Latin name features in mammalian2.
There are 1707 species Latin name features in mammalian3.
[13]:
mammalian2_species[0:10]
[13]:
['Anaxyrus americanus',
 'Anaxyrus boreas',
 'Anaxyrus canorus',
 'Anaxyrus cognatus',
 'Anaxyrus retiformis',
 'Anaxyrus terrestris',
 'Rhinella marina',
 'Dendrobates auratus',
 'Dendrobates leucomelas',
 'Hyla chrysoscelis']
[14]:
mammalian3_species[0:10]
[14]:
['Anaxyrus americanus',
 'Anaxyrus boreas',
 'Anaxyrus canorus',
 'Anaxyrus cognatus',
 'Rhinella marina',
 'Dendrobates auratus',
 'Dendrobates leucomelas',
 'Phyllobates vittatus',
 'Hyla chrysoscelis',
 'Hyla versicolor']

To chose a species, simply add the Latin name as a feature with value 1. In this subset version of the GSE223748 dataset, the species names are not available. Therefore, let’s use the naked mole rat (Heterocephalus glaber) as our species.

Let’s first check that it is available in the clocks.

[15]:
'Heterocephalus glaber' in  mammalian2_species
[15]:
True
[16]:
'Heterocephalus glaber' in  mammalian3_species
[16]:
True

Then, let’s add it as a feature to the pandas dataframe and create a new adata object.

[17]:
df['Heterocephalus glaber'] = 1
adata = pya.pp.df_to_adata(df, imputer_strategy='knn')
|-----> πŸ—οΈ Starting df_to_adata function
|-----> βš™οΈ Create anndata object started
|-----> βœ… Create anndata object finished [0.0291s]
|-----> βš™οΈ Add metadata to anndata started
|-----------? No metadata provided. Leaving adata.obs empty
|-----> ⚠️ Add metadata to anndata finished [0.0006s]
|-----> βš™οΈ Log data statistics started
|-----------> There are 100 observations
|-----------> There are 37555 features
|-----------> Total missing values: 0
|-----------> Percentage of missing values: 0.00%
|-----> βœ… Log data statistics finished [0.0045s]
|-----> βš™οΈ Impute missing values started
|-----------> No missing values found. No imputation necessary
|-----> βœ… Impute missing values finished [0.0040s]
|-----> πŸŽ‰ Done! [0.0415s]

Finally, let’s make the predictions using the multi-tissue mammalian2 and mammalian3 clocks plus the blood-specific and skin-specific versions.

[18]:
pya.pred.predict_age(adata, ['Mammalian2', 'MammalianSkin2', 'MammalianBlood2', 'Mammalian3', 'MammalianSkin3', 'MammalianBlood3'])
|-----> πŸ—οΈ Starting predict_age function
|-----> βš™οΈ Set PyTorch device started
|-----------> Using device: cpu
|-----> βœ… Set PyTorch device finished [0.0013s]
|-----> πŸ•’ Processing clock: mammalian2
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/mammalian2.pt
|-----------> βœ… Load clock finished [0.4827s]
|-----------> βš™οΈ Check features in adata started
|-----------------? 1755 out of 2572 features (68.23%) are missing: ['Anaxyrus americanus', 'Anaxyrus boreas', 'Anaxyrus canorus'], etc.
|-----------------> Using reference feature values for mammalian2
|-----------------> Added prepared input matrix to adata.obsm[X_mammalian2]
|-----------> ⚠️ Check features in adata finished [0.0556s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is mammalian2
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0042s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0007s]
|-----> πŸ•’ Processing clock: mammalianskin2
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/mammalianskin2.pt
|-----------> βœ… Load clock finished [0.4283s]
|-----------> βš™οΈ Check features in adata started
|-----------------? 1755 out of 2240 features (78.35%) are missing: ['Anaxyrus americanus', 'Anaxyrus boreas', 'Anaxyrus canorus'], etc.
|-----------------> Using reference feature values for mammalianskin2
|-----------------> Added prepared input matrix to adata.obsm[X_mammalianskin2]
|-----------> ⚠️ Check features in adata finished [0.0447s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is mammalian2
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0022s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0006s]
|-----> πŸ•’ Processing clock: mammalianblood2
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/mammalianblood2.pt
|-----------> βœ… Load clock finished [0.4847s]
|-----------> βš™οΈ Check features in adata started
|-----------------? 1755 out of 2257 features (77.76%) are missing: ['Anaxyrus americanus', 'Anaxyrus boreas', 'Anaxyrus canorus'], etc.
|-----------------> Using reference feature values for mammalianblood2
|-----------------> Added prepared input matrix to adata.obsm[X_mammalianblood2]
|-----------> ⚠️ Check features in adata finished [0.0799s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is mammalian2
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0013s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0005s]
|-----> πŸ•’ Processing clock: mammalian3
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/mammalian3.pt
|-----------> βœ… Load clock finished [0.5006s]
|-----------> βš™οΈ Check features in adata started
|-----------------? 1706 out of 2467 features (69.15%) are missing: ['Anaxyrus americanus', 'Anaxyrus boreas', 'Anaxyrus canorus'], etc.
|-----------------> Using reference feature values for mammalian3
|-----------------> Added prepared input matrix to adata.obsm[X_mammalian3]
|-----------> ⚠️ Check features in adata finished [0.1086s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is mammalian3
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0059s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0005s]
|-----> πŸ•’ Processing clock: mammalianskin3
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/mammalianskin3.pt
|-----------> βœ… Load clock finished [0.5248s]
|-----------> βš™οΈ Check features in adata started
|-----------------? 1706 out of 2055 features (83.02%) are missing: ['Anaxyrus americanus', 'Anaxyrus boreas', 'Anaxyrus canorus'], etc.
|-----------------> Using reference feature values for mammalianskin3
|-----------------> Added prepared input matrix to adata.obsm[X_mammalianskin3]
|-----------> ⚠️ Check features in adata finished [0.0743s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is mammalian3
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0024s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0007s]
|-----> πŸ•’ Processing clock: mammalianblood3
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/mammalianblood3.pt
|-----------> βœ… Load clock finished [0.4440s]
|-----------> βš™οΈ Check features in adata started
|-----------------? 1706 out of 2097 features (81.35%) are missing: ['Anaxyrus americanus', 'Anaxyrus boreas', 'Anaxyrus canorus'], etc.
|-----------------> Using reference feature values for mammalianblood3
|-----------------> Added prepared input matrix to adata.obsm[X_mammalianblood3]
|-----------> ⚠️ Check features in adata finished [0.0856s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is mammalian3
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0024s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0006s]
|-----> πŸŽ‰ Done! [3.6574s]

During age prediction, if the other species are not present in the input data, they will show up as missing features and the value will be automatically replaced with 0. Therefore, those missing features are not necessarily CpG sites. To double check, one can simply go to the adata.uns to check for missing features.

[19]:
adata.uns['mammalian2_missing_features'][0:10]
[19]:
['Anaxyrus americanus',
 'Anaxyrus boreas',
 'Anaxyrus canorus',
 'Anaxyrus cognatus',
 'Anaxyrus retiformis',
 'Anaxyrus terrestris',
 'Rhinella marina',
 'Dendrobates auratus',
 'Dendrobates leucomelas',
 'Hyla chrysoscelis']

Finally, let’s look at the predictions.

[20]:
adata.obs.head()
[20]:
mammalian2 mammalianskin2 mammalianblood2 mammalian3 mammalianskin3 mammalianblood3
204509080002_R01C02 14.946335 6.562780 15.321810 9.211476 4.328208 13.111774
202897220142_R04C02 17.037182 3.966518 4.293048 17.737573 0.900470 4.240264
204529320092_R01C02 12.065347 13.392643 14.483315 5.950473 7.048246 5.498531
202794570004_R02C01 15.263569 19.451386 7.148743 15.870878 14.483290 11.685419
203531420070_R05C02 6.689490 6.809801 7.602141 4.106029 2.040104 6.293626
... ... ... ... ... ... ...
205128010037_R03C02 22.698138 22.948062 25.326321 14.432799 14.013267 21.264321
206116820044_R06C02 11.146012 9.927552 16.813175 5.092238 4.367878 6.303160
203203210055_R03C02 2.220405 3.114650 7.204062 1.323726 1.719778 6.897330
203203210003_R06C02 21.672136 28.800661 21.536502 16.231949 20.177062 24.678961
204027420026_R03C02 4.925257 6.656713 5.481918 4.350099 4.850269 3.033646

100 rows Γ— 6 columns

For curiosity let’s check the correlation between the clocks.

[21]:
adata.obs.corr('pearson')
[21]:
mammalian2 mammalianskin2 mammalianblood2 mammalian3 mammalianskin3 mammalianblood3
mammalian2 1.000000 0.658934 0.609392 0.910119 0.611179 0.717135
mammalianskin2 0.658934 1.000000 0.418290 0.607472 0.900305 0.594863
mammalianblood2 0.609392 0.418290 1.000000 0.463504 0.388197 0.754871
mammalian3 0.910119 0.607472 0.463504 1.000000 0.675022 0.729099
mammalianskin3 0.611179 0.900305 0.388197 0.675022 1.000000 0.646473
mammalianblood3 0.717135 0.594863 0.754871 0.729099 0.646473 1.000000

Again, having so much information printed can be overwhelming, particularly when running several clocks at once. In such cases, just set verbose to False.

[22]:
pya.data.download_example_data('GSE223748', verbose=False)
df = pd.read_pickle('pyaging_data/GSE223748_subset.pkl')
df['Heterocephalus glaber'] = 1
adata = pya.preprocess.df_to_adata(df, imputer_strategy='knn', verbose=False)
pya.pred.predict_age(adata, ['Mammalian2', 'MammalianSkin2', 'MammalianBlood2', 'Mammalian3', 'MammalianSkin3', 'MammalianBlood3'], verbose=False)
adata.obs.head()
[22]:
mammalian2 mammalianskin2 mammalianblood2 mammalian3 mammalianskin3 mammalianblood3
204509080002_R01C02 14.946335 6.562780 15.321810 9.211476 4.328208 13.111774
202897220142_R04C02 17.037182 3.966518 4.293048 17.737573 0.900470 4.240264
204529320092_R01C02 12.065347 13.392643 14.483315 5.950473 7.048246 5.498531
202794570004_R02C01 15.263569 19.451386 7.148743 15.870878 14.483290 11.685419
203531420070_R05C02 6.689490 6.809801 7.602141 4.106029 2.040104 6.293626

Get citation#

The doi, citation, and some metadata are automatically added to the AnnData object under adata.uns[CLOCKNAME_metadata].

[23]:
adata.uns['mammalian2_metadata']
[23]:
{'clock_name': 'mammalian2',
 'data_type': 'methylation',
 'species': 'multi',
 'year': 2023,
 'approved_by_author': 'βŒ›',
 'citation': 'Lu, A. T., et al. "Universal DNA methylation age across mammalian tissues." Nature aging 3.9 (2023): 1144-1166.',
 'doi': 'https://doi.org/10.1038/s43587-023-00462-6',
 'notes': None,
 'version': None}