Open In Colab Open In nbviewer

Bulk histone mark ChIP-Seq#

This tutorial is a brief guide for the implementation of the seven histone-mark-specific clocks and the pan-histone-mark clock developed ourselves. Link to preprint.

We just need two packages for this tutorial.

[ ]:
!pip install pybigwig -q
[1]:
import pandas as pd
import pyaging as pya

Download and load example data#

Let’s download an example of H3K4me3 ChIP-Seq bigWig file from the ENCODE project.

[2]:
pya.data.download_example_data('ENCFF386QWG')
|-----> πŸ—οΈ Starting download_example_data function
|-----------> Downloading data to pyaging_data/ENCFF386QWG.bigWig
|-----------> in progress: 100.0000%
|-----> πŸŽ‰ Done! [2075.2977s]

To exemplify that multiple bigWigs can be turned into a df object at once, let’s just repeat the file path.

[3]:
df = pya.pp.bigwig_to_df(['pyaging_data/ENCFF386QWG.bigWig', 'pyaging_data/ENCFF386QWG.bigWig'])
|-----> πŸ—οΈ Starting bigwig_to_df function
|-----> βš™οΈ Load Ensembl genome metadata started
|-----------> Downloading data to pyaging_data/Ensembl-105-EnsDb-for-Homo-sapiens-genes.csv
|-----------> in progress: 100.0000%
|-----> βœ… Load Ensembl genome metadata finished [20.4846s]
|-----> βš™οΈ Processing bigWig files started
|-----------> Processing file: pyaging_data/ENCFF386QWG.bigWig
|-----------> in progress: 100.0000%
|-----------> Processing file: pyaging_data/ENCFF386QWG.bigWig
|-----------> in progress: 100.0000%
|-----> βœ… Processing bigWig files finished [12.4227s]
|-----> πŸŽ‰ Done! [44.8066s]
[4]:
df.index = ['sample1', 'sample2'] # just to avoid an annoying anndata warning that samples have same names
[5]:
df.head()
[5]:
ENSG00000223972 ENSG00000227232 ENSG00000278267 ENSG00000243485 ENSG00000284332 ENSG00000237613 ENSG00000268020 ENSG00000240361 ENSG00000186092 ENSG00000238009 ... ENSG00000237801 ENSG00000237040 ENSG00000124333 ENSG00000228410 ENSG00000223484 ENSG00000124334 ENSG00000270726 ENSG00000185203 ENSG00000182484 ENSG00000227159
sample1 0.028616 0.030415 0.027783 0.028616 0.028616 0.028616 0.044171 0.036474 0.030784 0.03181 ... 0.034435 0.006822 1.413119 0.029424 0.140005 0.049786 0.069296 0.332126 0.028596 0.028616
sample2 0.028616 0.030415 0.027783 0.028616 0.028616 0.028616 0.044171 0.036474 0.030784 0.03181 ... 0.034435 0.006822 1.413119 0.029424 0.140005 0.049786 0.069296 0.332126 0.028596 0.028616

2 rows Γ— 62241 columns

Convert data to AnnData object#

AnnData objects are highly flexible and are thus our preferred method of organizing data for age prediction.

[6]:
adata = pya.preprocess.df_to_adata(df)
|-----> πŸ—οΈ Starting df_to_adata function
|-----> βš™οΈ Create anndata object started
|-----> βœ… Create anndata object finished [0.0282s]
|-----> βš™οΈ Add metadata to anndata started
|-----------? No metadata provided. Leaving adata.obs empty
|-----> ⚠️ Add metadata to anndata finished [0.0004s]
|-----> βš™οΈ Log data statistics started
|-----------> There are 2 observations
|-----------> There are 62241 features
|-----------> Total missing values: 0
|-----------> Percentage of missing values: 0.00%
|-----> βœ… Log data statistics finished [0.0010s]
|-----> βš™οΈ Impute missing values started
|-----------> No missing values found. No imputation necessary
|-----> βœ… Impute missing values finished [0.0016s]
|-----> πŸŽ‰ Done! [0.0333s]

Note that the original DataFrame is stored in X_original under layers. This is what the adata object looks like:

[7]:
adata
[7]:
AnnData object with n_obs Γ— n_vars = 2 Γ— 62241
    var: 'percent_na'
    layers: 'X_original'

Predict age#

We can either predict one clock at once or all at the same time. For convenience, let’s simply input a few clocks of interest at once. The function is invariant to the capitalization of the clock name.

[8]:
pya.pred.predict_age(adata, ['CamilloH3K4me3', 'CamilloH3K9me3', 'CamilloPanHistone'])
|-----> πŸ—οΈ Starting predict_age function
|-----> βš™οΈ Set PyTorch device started
|-----------> Using device: cpu
|-----> βœ… Set PyTorch device finished [0.0031s]
|-----> πŸ•’ Processing clock: camilloh3k4me3
|-----------> βš™οΈ Load clock started
|-----------------> Downloading data to pyaging_data/camilloh3k4me3.pt
|-----------------> in progress: 100.0000%
|-----------> βœ… Load clock finished [5.0816s]
|-----------> βš™οΈ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------------> Added prepared input matrix to adata.obsm[X_camilloh3k4me3]
|-----------> βœ… Check features in adata finished [0.0076s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0250s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0013s]
|-----> πŸ•’ Processing clock: camilloh3k9me3
|-----------> βš™οΈ Load clock started
|-----------------> Downloading data to pyaging_data/camilloh3k9me3.pt
|-----------------> in progress: 100.0000%
|-----------> βœ… Load clock finished [12.2932s]
|-----------> βš™οΈ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------------> Added prepared input matrix to adata.obsm[X_camilloh3k9me3]
|-----------> βœ… Check features in adata finished [0.0187s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0030s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0007s]
|-----> πŸ•’ Processing clock: camillopanhistone
|-----------> βš™οΈ Load clock started
|-----------------> Downloading data to pyaging_data/camillopanhistone.pt
|-----------------> in progress: 100.0000%
|-----------> βœ… Load clock finished [222.1535s]
|-----------> βš™οΈ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------------> Added prepared input matrix to adata.obsm[X_camillopanhistone]
|-----------> βœ… Check features in adata finished [0.0201s]
|-----------> βš™οΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> βœ… Predict ages with model finished [0.0109s]
|-----------> βš™οΈ Add predicted ages and clock metadata to adata started
|-----------> βœ… Add predicted ages and clock metadata to adata finished [0.0008s]
|-----> πŸŽ‰ Done! [239.8831s]
[9]:
adata.obs.head()
[9]:
camilloh3k4me3 camilloh3k9me3 camillopanhistone
sample1 53.998544 44.3229 54.021884
sample2 53.998544 44.3229 54.021884

Having so much information printed can be overwhelming, particularly when running several clocks at once. In such cases, just set verbose to False.

[10]:
pya.data.download_example_data('ENCFF386QWG', verbose=False)
df = pya.pp.bigwig_to_df(['pyaging_data/ENCFF386QWG.bigWig', 'pyaging_data/ENCFF386QWG.bigWig'], verbose=False)
df.index = ['sample1', 'sample2']
adata = pya.preprocess.df_to_adata(df, verbose=False)
pya.pred.predict_age(adata, ['CamilloH3K4me3', 'CamilloH3K9me3', 'CamilloPanHistone'], verbose=False)
[11]:
adata.obs.head()
[11]:
camilloh3k4me3 camilloh3k9me3 camillopanhistone
sample1 53.998544 44.3229 54.021884
sample2 53.998544 44.3229 54.021884

After age prediction, the clocks are added to adata.obs. Moreover, the percent of missing values for each clock and other metadata are included in adata.uns.

[12]:
adata
[12]:
AnnData object with n_obs Γ— n_vars = 2 Γ— 62241
    obs: 'camilloh3k4me3', 'camilloh3k9me3', 'camillopanhistone'
    var: 'percent_na'
    uns: 'camilloh3k4me3_percent_na', 'camilloh3k4me3_missing_features', 'camilloh3k4me3_metadata', 'camilloh3k9me3_percent_na', 'camilloh3k9me3_missing_features', 'camilloh3k9me3_metadata', 'camillopanhistone_percent_na', 'camillopanhistone_missing_features', 'camillopanhistone_metadata'
    layers: 'X_original'

Get citation#

The doi, citation, and some metadata are automatically added to the AnnData object under adata.uns[CLOCKNAME_metadata].

[13]:
adata.uns['camilloh3k4me3_metadata']
[13]:
{'clock_name': 'camilloh3k4me3',
 'data_type': 'histone mark',
 'species': 'Homo sapiens',
 'year': 2023,
 'approved_by_author': 'βœ…',
 'citation': 'de Lima Camillo, Lucas Paulo, et al. "Histone mark age of human tissues and cells." bioRxiv (2023): 2023-08.',
 'doi': 'https://doi.org/10.1101/2023.08.21.554165',
 'notes': None,
 'version': None}