Illumina Human Methylation Arrays#
This tutorial is a brief guide for the implementation of an array of bulk DNA-methylation epigenetic clocks that predict age in humans. In this notebook, we will demonstrate the breadth of epigenetic clock models available in pyaging by showing:
Horvathβs 2013 ElasticNet-based clock (paper);
AltumAge, a highly accurate deep-learning based clock (paper);
PCGrimAge, a principal-component based version of the GrimAge clock (paper);
GrimAge2, the latest version of GrimAge (paper);
DunedinPACE, a biomarker of the pace of aging (paper).
We just need two packages for this tutorial.
[1]:
import pandas as pd
import pyaging as pya
Download and load example data#
Letβs download the publicly avaiable dataset GSE139307 with Illuminaβs 450k array. The CpG coverage of the 450k array should be good enough for most clocks.
[2]:
pya.data.download_example_data('GSE139307')
|-----> ποΈ Starting download_example_data function
|-----------> Data found in pyaging_data/GSE139307.pkl
|-----> π Done! [0.0007s]
[3]:
df = pd.read_pickle('pyaging_data/GSE139307.pkl')
[4]:
df.head()
[4]:
| dataset | tissue_type | age | gender | cg00000029 | cg00000108 | cg00000109 | cg00000165 | cg00000236 | cg00000289 | ... | ch.X.93511680F | ch.X.938089F | ch.X.94051109R | ch.X.94260649R | ch.X.967194F | ch.X.97129969R | ch.X.97133160R | ch.X.97651759F | ch.X.97737721F | ch.X.98007042R | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GSM4137709 | GSE139307 | sperm | 84.0 | M | 0.084811 | 0.920696 | 0.856851 | 0.084567 | 0.838699 | 0.247273 | ... | 0.061751 | 0.045942 | 0.037631 | 0.056455 | 0.249872 | 0.049022 | 0.085691 | 0.037435 | 0.077820 | 0.106234 |
| GSM4137710 | GSE139307 | sperm | 69.0 | M | 0.099626 | 0.919073 | 0.890024 | 0.115541 | 0.852584 | 0.198103 | ... | 0.075077 | 0.041849 | 0.032573 | 0.089790 | 0.250245 | 0.079095 | 0.079756 | 0.046229 | 0.091256 | 0.120241 |
| GSM4137711 | GSE139307 | sperm | 69.0 | M | 0.117228 | 0.920276 | 0.894317 | 0.117127 | 0.839258 | 0.213410 | ... | 0.068679 | 0.049515 | 0.058097 | 0.079919 | 0.299758 | 0.079305 | 0.089815 | 0.065364 | 0.086864 | 0.156005 |
| GSM4137712 | GSE139307 | sperm | 69.0 | M | 0.077096 | 0.910204 | 0.908400 | 0.073885 | 0.861615 | 0.163276 | ... | 0.070091 | 0.033289 | 0.038836 | 0.108213 | 0.295428 | 0.050731 | 0.099943 | 0.047597 | 0.078480 | 0.107480 |
| GSM4137713 | GSE139307 | sperm | 67.0 | M | 0.063524 | 0.911608 | 0.884643 | 0.079877 | 0.864654 | 0.176169 | ... | 0.082368 | 0.038411 | 0.048787 | 0.088631 | 0.316694 | 0.041873 | 0.079303 | 0.048823 | 0.089010 | 0.117903 |
5 rows Γ 485516 columns
For PCGrimAge and GrimAge2, both age and sex are features. Therefore, to get the full prediction, letβs convert the column gender into a column called female, with 1 being female and 0 being male.
[5]:
# needs only numerical data (doesn't work with strings)
df['female'] = (df['gender'] == 'F').astype(int)
Moreover, it is important to note that some probes are duplicated in the EPICv2 array, following the format cg#########_BC11 and cg#########_TC11 for the opposite strands. Given that at this moment most clocks have not been trained with EPICv2 data directly, it is recommended to average these probes. This is particularly the case for DunedinPACE, from which some clock probes were duplicated in the update from EPICv1. To remedy this issue, simply use the following function to aggregate any duplicated probes that may be present.
[6]:
df = pya.pp.epicv2_probe_aggregation(df)
|-----> ποΈ Starting epicv2_probe_aggregation function
|-----> βοΈ Looking for duplicated probes started
|-----------> in progress: 100.0000%
|-----------> There are no duplicated probes. Returning original data
|-----> π Done! [8.0469s]
Convert data to AnnData object#
AnnData objects are highly flexible and are thus our preferred method of organizing data for age prediction.
[7]:
adata = pya.pp.df_to_adata(df, metadata_cols=['gender', 'tissue_type', 'dataset'], imputer_strategy='knn')
|-----> ποΈ Starting df_to_adata function
|-----> βοΈ Create anndata object started
|-----------? Dropping 1 columns with only NAs: ['cg01550828'], etc.
|-----> β οΈ Create anndata object finished [0.4365s]
|-----> βοΈ Add metadata to anndata started
|-----------> Adding provided metadata to adata.obs
|-----> β
Add metadata to anndata finished [0.0010s]
|-----> βοΈ Log data statistics started
|-----------> There are 37 observations
|-----------> There are 485513 features
|-----------> Total missing values: 489
|-----------> Percentage of missing values: 0.00%
|-----> β
Log data statistics finished [0.0292s]
|-----> βοΈ Impute missing values started
|-----------> Imputing missing values using knn strategy
|-----> β
Impute missing values finished [6.2591s]
|-----> βοΈ Add imputer strategy to adata.uns started
|-----> β
Add imputer strategy to adata.uns finished [0.0003s]
|-----> π Done! [6.8518s]
Note that the original DataFrame is stored in X_original under layers. is This is what the adata object looks like:
[8]:
adata
[8]:
AnnData object with n_obs Γ n_vars = 37 Γ 485513
obs: 'gender', 'tissue_type', 'dataset'
var: 'percent_na'
uns: 'imputer_strategy'
layers: 'X_original', 'X_imputed'
Predict age#
We can either predict one clock at once or all at the same time. For convenience, letβs simply input all four clocks of interest at once. The function is invariant to the capitalization of the clock name.
[9]:
pya.pred.predict_age(adata, ['Horvath2013', 'AltumAge', 'PCGrimAge', 'GrimAge2', 'DunedinPACE'])
|-----> ποΈ Starting predict_age function
|-----> βοΈ Set PyTorch device started
|-----------> Using device: cpu
|-----> β
Set PyTorch device finished [0.0017s]
|-----> π Processing clock: horvath2013
|-----------> βοΈ Load clock started
|-----------------> Data found in pyaging_data/horvath2013.pt
|-----------> β
Load clock finished [0.0059s]
|-----------> βοΈ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------> β
Check features in adata finished [0.0936s]
|-----------> βοΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is anti_log_linear
|-----------------> in progress: 100.0000%
|-----------> β
Predict ages with model finished [0.0026s]
|-----------> βοΈ Add predicted ages and clock metadata to adata started
|-----------> β
Add predicted ages and clock metadata to adata finished [0.0017s]
|-----> π Processing clock: altumage
|-----------> βοΈ Load clock started
|-----------------> Data found in pyaging_data/altumage.pt
|-----------> β
Load clock finished [0.0080s]
|-----------> βοΈ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------> β
Check features in adata finished [0.1086s]
|-----------> βοΈ Predict ages with model started
|-----------------> The preprocessing method is scale
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> β
Predict ages with model finished [0.0087s]
|-----------> βοΈ Add predicted ages and clock metadata to adata started
|-----------> β
Add predicted ages and clock metadata to adata finished [0.0016s]
|-----> π Processing clock: pcgrimage
|-----------> βοΈ Load clock started
|-----------------> Data found in pyaging_data/pcgrimage.pt
|-----------> β
Load clock finished [0.2855s]
|-----------> βοΈ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------> β
Check features in adata finished [0.1395s]
|-----------> βοΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> β
Predict ages with model finished [0.1859s]
|-----------> βοΈ Add predicted ages and clock metadata to adata started
|-----------> β
Add predicted ages and clock metadata to adata finished [0.0007s]
|-----> π Processing clock: grimage2
|-----------> βοΈ Load clock started
|-----------------> Data found in pyaging_data/grimage2.pt
|-----------> β
Load clock finished [0.0032s]
|-----------> βοΈ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------> β
Check features in adata finished [0.0921s]
|-----------> βοΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is cox_to_years
|-----------------> in progress: 100.0000%
|-----------> β
Predict ages with model finished [0.0030s]
|-----------> βοΈ Add predicted ages and clock metadata to adata started
|-----------> β
Add predicted ages and clock metadata to adata finished [0.0014s]
|-----> π Processing clock: dunedinpace
|-----------> βοΈ Load clock started
|-----------------> Data found in pyaging_data/dunedinpace.pt
|-----------> β
Load clock finished [0.0084s]
|-----------? β οΈ Clock 'dunedinpace' is for research purposes only. Please check the clock's documentation or notes for more information.
|-----------> βοΈ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------> β οΈ Check features in adata finished [0.1133s]
|-----------> βοΈ Predict ages with model started
|-----------------> The preprocessing method is quantile_normalization_with_gold_standard
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> β
Predict ages with model finished [0.0741s]
|-----------> βοΈ Add predicted ages and clock metadata to adata started
|-----------> β
Add predicted ages and clock metadata to adata finished [0.0011s]
|-----> π Done! [4.5616s]
[10]:
adata.obs.head()
[10]:
| gender | tissue_type | dataset | horvath2013 | altumage | pcgrimage | grimage2 | dunedinpace | |
|---|---|---|---|---|---|---|---|---|
| GSM4137709 | M | sperm | GSE139307 | 33.624776 | 37.007213 | 95.506114 | 77.581057 | 1.326327 |
| GSM4137710 | M | sperm | GSE139307 | 28.829344 | 29.426899 | 83.934244 | 65.926346 | 1.215611 |
| GSM4137711 | M | sperm | GSE139307 | 28.316545 | 22.798928 | 82.709334 | 63.358341 | 1.271091 |
| GSM4137712 | M | sperm | GSE139307 | 24.850630 | 18.079173 | 84.269462 | 60.218880 | 1.276866 |
| GSM4137713 | M | sperm | GSE139307 | 25.942111 | 20.071985 | 84.356985 | 61.235919 | 1.262023 |
For curiosity, we can also check if there are any correlations amongst these clocks.
[11]:
adata.obs.iloc[:, 3:].corr('pearson')
[11]:
| horvath2013 | altumage | pcgrimage | grimage2 | dunedinpace | |
|---|---|---|---|---|---|
| horvath2013 | 1.000000 | 0.676242 | 0.211881 | 0.459193 | 0.354771 |
| altumage | 0.676242 | 1.000000 | 0.156456 | 0.440044 | 0.164101 |
| pcgrimage | 0.211881 | 0.156456 | 1.000000 | 0.859490 | 0.061491 |
| grimage2 | 0.459193 | 0.440044 | 0.859490 | 1.000000 | 0.183725 |
| dunedinpace | 0.354771 | 0.164101 | 0.061491 | 0.183725 | 1.000000 |
Having so much information printed can be overwhelming, particularly when running several clocks at once. In such cases, just set verbose to False.
[12]:
pya.data.download_example_data('GSE139307', verbose=False)
df = pd.read_pickle('pyaging_data/GSE139307.pkl')
df['female'] = (df['gender'] == 'F').astype(int)
df = pya.pp.epicv2_probe_aggregation(df, verbose=False)
adata = pya.preprocess.df_to_adata(df, metadata_cols=['gender', 'tissue_type', 'dataset'], imputer_strategy='mean', verbose=False)
pya.pred.predict_age(adata, ['Horvath2013', 'AltumAge', 'PCGrimAge', 'GrimAge2', 'DunedinPACE'], verbose=False)
adata.obs.head()
[12]:
| gender | tissue_type | dataset | horvath2013 | altumage | pcgrimage | grimage2 | dunedinpace | |
|---|---|---|---|---|---|---|---|---|
| GSM4137709 | M | sperm | GSE139307 | 33.624776 | 37.007213 | 95.505780 | 77.581057 | 1.326308 |
| GSM4137710 | M | sperm | GSE139307 | 28.829344 | 29.426899 | 83.934244 | 65.926346 | 1.215614 |
| GSM4137711 | M | sperm | GSE139307 | 28.316545 | 22.805551 | 82.709334 | 63.358341 | 1.271033 |
| GSM4137712 | M | sperm | GSE139307 | 24.850630 | 18.060107 | 84.269462 | 60.218880 | 1.276866 |
| GSM4137713 | M | sperm | GSE139307 | 25.942111 | 20.071985 | 84.356985 | 61.235919 | 1.262023 |
After age prediction, the clocks are added to adata.obs. Moreover, the percent of missing values for each clock and other metadata are included in adata.uns.
[13]:
adata
[13]:
AnnData object with n_obs Γ n_vars = 37 Γ 485513
obs: 'gender', 'tissue_type', 'dataset', 'horvath2013', 'altumage', 'pcgrimage', 'grimage2', 'dunedinpace'
var: 'percent_na'
uns: 'imputer_strategy', 'horvath2013_percent_na', 'horvath2013_missing_features', 'horvath2013_metadata', 'altumage_percent_na', 'altumage_missing_features', 'altumage_metadata', 'pcgrimage_percent_na', 'pcgrimage_missing_features', 'pcgrimage_metadata', 'grimage2_percent_na', 'grimage2_missing_features', 'grimage2_metadata', 'dunedinpace_percent_na', 'dunedinpace_missing_features', 'dunedinpace_metadata'
layers: 'X_original', 'X_imputed'
We can also look at which features seem to be missing from each clock (if there are any).
[14]:
adata.uns['dunedinpace_missing_features']
[14]:
[]
Get citation#
The doi, citation, and some metadata are automatically added to the AnnData object under adata.uns[CLOCKNAME_metadata].
[15]:
adata.uns['horvath2013_metadata']
[15]:
{'clock_name': 'horvath2013',
'data_type': 'methylation',
'species': 'Homo sapiens',
'year': 2013,
'approved_by_author': 'β',
'citation': 'Horvath, Steve. "DNA methylation age of human tissues and cell types." Genome biology 14.10 (2013): 1-20.',
'doi': 'https://doi.org/10.1186/gb-2013-14-10-r115',
'notes': None,
'version': None}
[16]:
adata.uns['altumage_metadata']
[16]:
{'clock_name': 'altumage',
'data_type': 'methylation',
'species': 'Homo sapiens',
'year': 2022,
'approved_by_author': 'β
',
'citation': 'de Lima Camillo, Lucas Paulo, Louis R. Lapierre, and Ritambhara Singh. "A pan-tissue DNA-methylation epigenetic clock based on deep learning." npj Aging 8.1 (2022): 4.',
'doi': 'https://doi.org/10.1038/s41514-022-00085-y',
'notes': None,
'version': None}
[17]:
adata.uns['pcgrimage_metadata']
[17]:
{'clock_name': 'pcgrimage',
'data_type': 'methylation',
'species': 'Homo sapiens',
'year': 2022,
'approved_by_author': 'β',
'citation': 'Higgins-Chen, Albert T., et al. "A computational solution for bolstering reliability of epigenetic clocks: Implications for clinical trials and longitudinal tracking." Nature aging 2.7 (2022): 644-661.',
'doi': 'https://doi.org/10.1038/s43587-022-00248-2',
'notes': None,
'research_only': None,
'version': None}
[18]:
adata.uns['grimage2_metadata']
[18]:
{'clock_name': 'grimage2',
'data_type': 'methylation',
'species': 'Homo sapiens',
'year': 2022,
'approved_by_author': 'β',
'citation': 'Lu, Ake T., et al. "DNA methylation GrimAge version 2." Aging (Albany NY) 14.23 (2022): 9484.',
'doi': 'https://doi.org/10.18632/aging.204434',
'notes': None,
'version': None}
[19]:
adata.uns['dunedinpace_metadata']
[19]:
{'clock_name': 'dunedinpace',
'data_type': 'methylation',
'species': 'Homo sapiens',
'year': 2022,
'approved_by_author': 'β
',
'citation': 'Belsky, Daniel W., et al. "DunedinPACE, a DNA methylation biomarker of the pace of aging." Elife 11 (2022): e73420.',
'doi': 'https://doi.org/10.7554/eLife.73420',
'notes': "This model is for research purposes only. Commercial users should contact exclusive DunedinPACE licensee TruDiagnosticTM. The automatic failure if fewer than 80% of the CpG probes are available is not implemented and left to the user's discretion.",
'version': None,
'research_only': True}