Bulk ATAC-Seq#

This tutorial is a brief guide for the implementation of the two ATAC clocks developed by Morandini et al. Link to paper.

We just need two packages for this tutorial.

[1]:

import pandas as pd
import pyaging as pya

Download and load example data#

If you have your own ATAC-Seq data, please follow the recommendations in the Ocampo paper. Specifically, one needs to count the number of reads for each of the peak regions from the paper (file here). This can be done through the code found on their GitHub using featureCounts.

For testing purposes, let’s download an example of input for the ATAC clocks. For instructions on how to go from raw sequencing reads to the data table, please refer to the paper.

[2]:

pya.data.download_example_data('GSE193140')

|-----> 🏗️ Starting download_example_data function
|-----------> Data found in pyaging_data/GSE193140.pkl
|-----> 🎉 Done! [0.4942s]

[3]:

df = pd.read_pickle('pyaging_data/GSE193140.pkl')

[4]:

df.head()

[4]:

	chr1:817100-817691	chr1:826742-828191	chr1:841908-843021	chr1:844055-844921	chr1:857908-859108	chr1:869571-870271	chr1:898378-899076	chr1:904303-905702	chr1:906675-907111	chr1:912617-913368	...	chrY:21073148-21074236	chrY:21174455-21175401	chrY:21177324-21177828	chrY:21180682-21181317	chrY:21239902-21241040	chrY:21248553-21249961	chrY:21256824-21257260	chrY:21259823-21260874	chrY:22086084-22086722	chrY:22499696-22500344
CR_124	182	2652	15	11	9	843	2	714	556	37	...	62	104	65	31	90	20	50	21	2	2
CR_122	96	2688	27	25	40	1097	13	786	167	12	...	11	13	31	25	270	37	29	18	9	12
CR_121	137	2785	42	46	69	1297	8	638	351	24	...	0	0	0	0	0	0	0	0	0	0
CR_120	169	2819	29	35	46	1373	20	931	301	10	...	7	9	8	33	151	47	50	18	32	14
CR_119	205	3005	18	45	37	1025	33	1138	241	36	...	15	18	17	12	57	25	7	8	7	7

5 rows × 80400 columns

Convert data to AnnData object#

AnnData objects are highly flexible and are thus our preferred method of organizing data for age prediction.

[5]:

adata = pya.preprocess.df_to_adata(df)

|-----> 🏗️ Starting df_to_adata function
|-----> ⚙️ Create anndata object started
|-----> ✅ Create anndata object finished [0.0289s]
|-----> ⚙️ Add metadata to anndata started
|-----------? No metadata provided. Leaving adata.obs empty
|-----> ⚠️ Add metadata to anndata finished [0.0004s]
|-----> ⚙️ Log data statistics started
|-----------> There are 157 observations
|-----------> There are 80400 features
|-----------> Total missing values: 0
|-----------> Percentage of missing values: 0.00%
|-----> ✅ Log data statistics finished [0.0049s]
|-----> ⚙️ Impute missing values started
|-----------> No missing values found. No imputation necessary
|-----> ✅ Impute missing values finished [0.0053s]
|-----> 🎉 Done! [0.0419s]

Note that the original DataFrame is stored in X_original under layers. is This is what the adata object looks like:

[6]:

adata

[6]:

AnnData object with n_obs × n_vars = 157 × 80400
    var: 'percent_na'
    layers: 'X_original'

Predict age#

We can either predict one clock at once or all at the same time. For convenience, let’s simply input all two clocks of interest at once. The function is invariant to the capitalization of the clock name.

[7]:

pya.pred.predict_age(adata, ['OcampoATAC1', 'OcampoATAC2'])

|-----> 🏗️ Starting predict_age function
|-----> ⚙️ Set PyTorch device started
|-----------> Using device: cpu
|-----> ✅ Set PyTorch device finished [0.0006s]
|-----> 🕒 Processing clock: ocampoatac1
|-----------> ⚙️ Load clock started
|-----------------> Data found in pyaging_data/ocampoatac1.pt
|-----------> ✅ Load clock finished [0.5113s]
|-----------> ⚙️ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------------> Added prepared input matrix to adata.obsm[X_ocampoatac1]
|-----------> ✅ Check features in adata finished [3.8480s]
|-----------> ⚙️ Predict ages with model started
|-----------------> The preprocessing method is tpm_norm_log1p
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.1635s]
|-----------> ⚙️ Add predicted ages and clock metadata to adata started
|-----------> ✅ Add predicted ages and clock metadata to adata finished [0.0007s]
|-----> 🕒 Processing clock: ocampoatac2
|-----------> ⚙️ Load clock started
|-----------------> Data found in pyaging_data/ocampoatac2.pt
|-----------> ✅ Load clock finished [0.4514s]
|-----------> ⚙️ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------------> Added prepared input matrix to adata.obsm[X_ocampoatac2]
|-----------> ✅ Check features in adata finished [4.9598s]
|-----------> ⚙️ Predict ages with model started
|-----------------> The preprocessing method is tpm_norm_log1p
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.0690s]
|-----------> ⚙️ Add predicted ages and clock metadata to adata started
|-----------> ✅ Add predicted ages and clock metadata to adata finished [0.0007s]
|-----> 🎉 Done! [10.1175s]

[8]:

adata.obs.head()

[8]:

	ocampoatac1	ocampoatac2
CR_124	29.527124	28.114206
CR_122	39.003097	40.061162
CR_121	40.716008	43.095199
CR_120	32.380372	33.033456
CR_119	36.440711	38.301516

Having so much information printed can be overwhelming, particularly when running several clocks at once. In such cases, just set verbose to False.

[9]:

pya.data.download_example_data('GSE193140', verbose=False)
df = pd.read_pickle('pyaging_data/GSE193140.pkl')
adata = pya.preprocess.df_to_adata(df, verbose=False)
pya.pred.predict_age(adata, ['OcampoATAC1', 'OcampoATAC2'], verbose=False)

[10]:

adata.obs.head()

[10]:

	ocampoatac1	ocampoatac2
CR_124	29.527124	28.114206
CR_122	39.003097	40.061162
CR_121	40.716008	43.095199
CR_120	32.380372	33.033456
CR_119	36.440711	38.301516

After age prediction, the clocks are added to adata.obs. Moreover, the percent of missing values for each clock and other metadata are included in adata.uns.

[11]:

adata

[11]:

AnnData object with n_obs × n_vars = 157 × 80400
    obs: 'ocampoatac1', 'ocampoatac2'
    var: 'percent_na'
    uns: 'ocampoatac1_percent_na', 'ocampoatac1_missing_features', 'ocampoatac1_metadata', 'ocampoatac2_percent_na', 'ocampoatac2_missing_features', 'ocampoatac2_metadata'
    layers: 'X_original'

Get citation#

The doi, citation, and some metadata are automatically added to the AnnData object under adata.uns[CLOCKNAME_metadata].

[12]:

adata.uns['ocampoatac1_metadata']

[12]:

{'clock_name': 'ocampoatac1',
 'data_type': 'atac',
 'species': 'Homo sapiens',
 'year': 2023,
 'approved_by_author': '⌛',
 'citation': 'Morandini, Francesco, et al. "ATAC-clock: An aging clock based on chromatin accessibility." GeroScience (2023): 1-18.',
 'doi': 'https://doi.org/10.1007/s11357-023-00986-0',
 'notes': None,
 'version': None}