Open In Colab Open In nbviewer

Bulk ATAC-Seq#

This tutorial is a brief guide for the implementation of the two ATAC clocks developed by Morandini et al. Link to paper.

We just need two packages for this tutorial.

[1]:
import pandas as pd
import pyaging as pya

Download and load example data#

If you have your own ATAC-Seq data, please follow the recommendations in the Ocampo paper. Specifically, one needs to count the number of reads for each of the peak regions from the paper (file here). This can be done through the code found on their GitHub using featureCounts.

For testing purposes, let’s download an example of input for the ATAC clocks. For instructions on how to go from raw sequencing reads to the data table, please refer to the paper.

[2]:
pya.data.download_example_data('GSE193140')
|-----> πŸ—οΈ Starting download_example_data function
|-----------> Data found in pyaging_data/GSE193140.pkl
|-----> πŸŽ‰ Done! [0.4942s]
[3]:
df = pd.read_pickle('pyaging_data/GSE193140.pkl')
[4]:
df.head()
[4]:
chr1:817100-817691 chr1:826742-828191 chr1:841908-843021 chr1:844055-844921 chr1:857908-859108 chr1:869571-870271 chr1:898378-899076 chr1:904303-905702 chr1:906675-907111 chr1:912617-913368 ... chrY:21073148-21074236 chrY:21174455-21175401 chrY:21177324-21177828 chrY:21180682-21181317 chrY:21239902-21241040 chrY:21248553-21249961 chrY:21256824-21257260 chrY:21259823-21260874 chrY:22086084-22086722 chrY:22499696-22500344
CR_124 182 2652 15 11 9 843 2 714 556 37 ... 62 104 65 31 90 20 50 21 2 2
CR_122 96 2688 27 25 40 1097 13 786 167 12 ... 11 13 31 25 270 37 29 18 9 12
CR_121 137 2785 42 46 69 1297 8 638 351 24 ... 0 0 0 0 0 0 0 0 0 0
CR_120 169 2819 29 35 46 1373 20 931 301 10 ... 7 9 8 33 151 47 50 18 32 14
CR_119 205 3005 18 45 37 1025 33 1138 241 36 ... 15 18 17 12 57 25 7 8 7 7

5 rows Γ— 80400 columns

Convert data to AnnData object#

AnnData objects are highly flexible and are thus our preferred method of organizing data for age prediction.

[5]:
adata = pya.preprocess.df_to_adata(df)
|-----> πŸ—οΈ Starting df_to_adata function
|-----> βš™οΈ Create anndata object started
|-----> βœ… Create anndata object finished [0.0289s]
|-----> βš™οΈ Add metadata to anndata started
|-----------? No metadata provided. Leaving adata.obs empty
|-----> ⚠️ Add metadata to anndata finished [0.0004s]
|-----> βš™οΈ Log data statistics started
|-----------> There are 157 observations
|-----------> There are 80400 features
|-----------> Total missing values: 0
|-----------> Percentage of missing values: 0.00%
|-----> βœ… Log data statistics finished [0.0049s]
|-----> βš™οΈ Impute missing values started
|-----------> No missing values found. No imputation necessary
|-----> βœ… Impute missing values finished [0.0053s]
|-----> πŸŽ‰ Done! [0.0419s]

Note that the original DataFrame is stored in X_original under layers. is This is what the adata object looks like:

[6]:
adata
[6]:
AnnData object with n_obs Γ— n_vars = 157 Γ— 80400
    var: 'percent_na'
    layers: 'X_original'

Predict age#

We can either predict one clock at once or all at the same time. For convenience, let’s simply input all two clocks of interest at once. The function is invariant to the capitalization of the clock name.

[7]:
pya.pred.predict_age(adata, ['OcampoATAC1', 'OcampoATAC2'])
|-----> πŸ—οΈ Starting predict_age function
|-----> βš™οΈ Set PyTorch device started
|-----------> Using device: cpu
|-----> βœ… Set PyTorch device finished [0.0006s]
|-----> πŸ•’ Processing clock: ocampoatac1
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/ocampoatac1.pt
|-----------> βœ… Load clock finished [0.5113s]
|-----------> βš™οΈ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------------> Added prepared input matrix to adata.obsm[X_ocampoatac1]
|-----------> βœ… Check features in adata finished [3.8480s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> The preprocessing method is tpm_norm_log1p
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.1635s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0007s]
|-----> πŸ•’ Processing clock: ocampoatac2
|-----------> βš™οΈ Load clock started
|-----------------> Data found in pyaging_data/ocampoatac2.pt
|-----------> βœ… Load clock finished [0.4514s]
|-----------> βš™οΈ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------------> Added prepared input matrix to adata.obsm[X_ocampoatac2]
|-----------> βœ… Check features in adata finished [4.9598s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> The preprocessing method is tpm_norm_log1p
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0690s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0007s]
|-----> πŸŽ‰ Done! [10.1175s]
[8]:
adata.obs.head()
[8]:
ocampoatac1 ocampoatac2
CR_124 29.527124 28.114206
CR_122 39.003097 40.061162
CR_121 40.716008 43.095199
CR_120 32.380372 33.033456
CR_119 36.440711 38.301516

Having so much information printed can be overwhelming, particularly when running several clocks at once. In such cases, just set verbose to False.

[9]:
pya.data.download_example_data('GSE193140', verbose=False)
df = pd.read_pickle('pyaging_data/GSE193140.pkl')
adata = pya.preprocess.df_to_adata(df, verbose=False)
pya.pred.predict_age(adata, ['OcampoATAC1', 'OcampoATAC2'], verbose=False)
[10]:
adata.obs.head()
[10]:
ocampoatac1 ocampoatac2
CR_124 29.527124 28.114206
CR_122 39.003097 40.061162
CR_121 40.716008 43.095199
CR_120 32.380372 33.033456
CR_119 36.440711 38.301516

After age prediction, the clocks are added to adata.obs. Moreover, the percent of missing values for each clock and other metadata are included in adata.uns.

[11]:
adata
[11]:
AnnData object with n_obs Γ— n_vars = 157 Γ— 80400
    obs: 'ocampoatac1', 'ocampoatac2'
    var: 'percent_na'
    uns: 'ocampoatac1_percent_na', 'ocampoatac1_missing_features', 'ocampoatac1_metadata', 'ocampoatac2_percent_na', 'ocampoatac2_missing_features', 'ocampoatac2_metadata'
    layers: 'X_original'

Get citation#

The doi, citation, and some metadata are automatically added to the AnnData object under adata.uns[CLOCKNAME_metadata].

[12]:
adata.uns['ocampoatac1_metadata']
[12]:
{'clock_name': 'ocampoatac1',
 'data_type': 'atac',
 'species': 'Homo sapiens',
 'year': 2023,
 'approved_by_author': 'βŒ›',
 'citation': 'Morandini, Francesco, et al. "ATAC-clock: An aging clock based on chromatin accessibility." GeroScience (2023): 1-18.',
 'doi': 'https://doi.org/10.1007/s11357-023-00986-0',
 'notes': None,
 'version': None}