Bulk histone mark ChIP-Seq

Bulk histone mark ChIP-Seq#

This tutorial is a brief guide for the implementation of the seven histone-mark-specific clocks and the pan-histone-mark clock developed ourselves. Link to preprint.

We just need two packages for this tutorial.

[ ]:

!pip install pybigwig -q

[1]:

import pandas as pd
import pyaging as pya

Download and load example data#

Let’s download an example of H3K4me3 ChIP-Seq bigWig file from the ENCODE project.

[2]:

pya.data.download_example_data('ENCFF386QWG')

|-----> 🏗️ Starting download_example_data function
|-----------> Downloading data to pyaging_data/ENCFF386QWG.bigWig
|-----------> in progress: 100.0000%
|-----> 🎉 Done! [2075.2977s]

To exemplify that multiple bigWigs can be turned into a df object at once, let’s just repeat the file path.

[3]:

df = pya.pp.bigwig_to_df(['pyaging_data/ENCFF386QWG.bigWig', 'pyaging_data/ENCFF386QWG.bigWig'])

|-----> 🏗️ Starting bigwig_to_df function
|-----> ⚙️ Load Ensembl genome metadata started
|-----------> Downloading data to pyaging_data/Ensembl-105-EnsDb-for-Homo-sapiens-genes.csv
|-----------> in progress: 100.0000%
|-----> ✅ Load Ensembl genome metadata finished [20.4846s]
|-----> ⚙️ Processing bigWig files started
|-----------> Processing file: pyaging_data/ENCFF386QWG.bigWig
|-----------> in progress: 100.0000%
|-----------> Processing file: pyaging_data/ENCFF386QWG.bigWig
|-----------> in progress: 100.0000%
|-----> ✅ Processing bigWig files finished [12.4227s]
|-----> 🎉 Done! [44.8066s]

[4]:

df.index = ['sample1', 'sample2'] # just to avoid an annoying anndata warning that samples have same names

[5]:

df.head()

[5]:

	ENSG00000223972	ENSG00000227232	ENSG00000278267	ENSG00000243485	ENSG00000284332	ENSG00000237613	ENSG00000268020	ENSG00000240361	ENSG00000186092	ENSG00000238009	...	ENSG00000237801	ENSG00000237040	ENSG00000124333	ENSG00000228410	ENSG00000223484	ENSG00000124334	ENSG00000270726	ENSG00000185203	ENSG00000182484	ENSG00000227159
sample1	0.028616	0.030415	0.027783	0.028616	0.028616	0.028616	0.044171	0.036474	0.030784	0.03181	...	0.034435	0.006822	1.413119	0.029424	0.140005	0.049786	0.069296	0.332126	0.028596	0.028616
sample2	0.028616	0.030415	0.027783	0.028616	0.028616	0.028616	0.044171	0.036474	0.030784	0.03181	...	0.034435	0.006822	1.413119	0.029424	0.140005	0.049786	0.069296	0.332126	0.028596	0.028616

2 rows × 62241 columns

Convert data to AnnData object#

AnnData objects are highly flexible and are thus our preferred method of organizing data for age prediction.

[6]:

adata = pya.preprocess.df_to_adata(df)

|-----> 🏗️ Starting df_to_adata function
|-----> ⚙️ Create anndata object started
|-----> ✅ Create anndata object finished [0.0282s]
|-----> ⚙️ Add metadata to anndata started
|-----------? No metadata provided. Leaving adata.obs empty
|-----> ⚠️ Add metadata to anndata finished [0.0004s]
|-----> ⚙️ Log data statistics started
|-----------> There are 2 observations
|-----------> There are 62241 features
|-----------> Total missing values: 0
|-----------> Percentage of missing values: 0.00%
|-----> ✅ Log data statistics finished [0.0010s]
|-----> ⚙️ Impute missing values started
|-----------> No missing values found. No imputation necessary
|-----> ✅ Impute missing values finished [0.0016s]
|-----> 🎉 Done! [0.0333s]

Note that the original DataFrame is stored in X_original under layers. This is what the adata object looks like:

[7]:

adata

[7]:

AnnData object with n_obs × n_vars = 2 × 62241
    var: 'percent_na'
    layers: 'X_original'

Predict age#

We can either predict one clock at once or all at the same time. For convenience, let’s simply input a few clocks of interest at once. The function is invariant to the capitalization of the clock name.

[8]:

pya.pred.predict_age(adata, ['CamilloH3K4me3', 'CamilloH3K9me3', 'CamilloPanHistone'])

|-----> 🏗️ Starting predict_age function
|-----> ⚙️ Set PyTorch device started
|-----------> Using device: cpu
|-----> ✅ Set PyTorch device finished [0.0031s]
|-----> 🕒 Processing clock: camilloh3k4me3
|-----------> ⚙️ Load clock started
|-----------------> Downloading data to pyaging_data/camilloh3k4me3.pt
|-----------------> in progress: 100.0000%
|-----------> ✅ Load clock finished [5.0816s]
|-----------> ⚙️ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------------> Added prepared input matrix to adata.obsm[X_camilloh3k4me3]
|-----------> ✅ Check features in adata finished [0.0076s]
|-----------> ⚙️ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.0250s]
|-----------> ⚙️ Add predicted ages and clock metadata to adata started
|-----------> ✅ Add predicted ages and clock metadata to adata finished [0.0013s]
|-----> 🕒 Processing clock: camilloh3k9me3
|-----------> ⚙️ Load clock started
|-----------------> Downloading data to pyaging_data/camilloh3k9me3.pt
|-----------------> in progress: 100.0000%
|-----------> ✅ Load clock finished [12.2932s]
|-----------> ⚙️ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------------> Added prepared input matrix to adata.obsm[X_camilloh3k9me3]
|-----------> ✅ Check features in adata finished [0.0187s]
|-----------> ⚙️ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.0030s]
|-----------> ⚙️ Add predicted ages and clock metadata to adata started
|-----------> ✅ Add predicted ages and clock metadata to adata finished [0.0007s]
|-----> 🕒 Processing clock: camillopanhistone
|-----------> ⚙️ Load clock started
|-----------------> Downloading data to pyaging_data/camillopanhistone.pt
|-----------------> in progress: 100.0000%
|-----------> ✅ Load clock finished [222.1535s]
|-----------> ⚙️ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------------> Added prepared input matrix to adata.obsm[X_camillopanhistone]
|-----------> ✅ Check features in adata finished [0.0201s]
|-----------> ⚙️ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.0109s]
|-----------> ⚙️ Add predicted ages and clock metadata to adata started
|-----------> ✅ Add predicted ages and clock metadata to adata finished [0.0008s]
|-----> 🎉 Done! [239.8831s]

[9]:

adata.obs.head()

[9]:

	camilloh3k4me3	camilloh3k9me3	camillopanhistone
sample1	53.998544	44.3229	54.021884
sample2	53.998544	44.3229	54.021884

Having so much information printed can be overwhelming, particularly when running several clocks at once. In such cases, just set verbose to False.

[10]:

pya.data.download_example_data('ENCFF386QWG', verbose=False)
df = pya.pp.bigwig_to_df(['pyaging_data/ENCFF386QWG.bigWig', 'pyaging_data/ENCFF386QWG.bigWig'], verbose=False)
df.index = ['sample1', 'sample2']
adata = pya.preprocess.df_to_adata(df, verbose=False)
pya.pred.predict_age(adata, ['CamilloH3K4me3', 'CamilloH3K9me3', 'CamilloPanHistone'], verbose=False)

[11]:

adata.obs.head()

[11]:

	camilloh3k4me3	camilloh3k9me3	camillopanhistone
sample1	53.998544	44.3229	54.021884
sample2	53.998544	44.3229	54.021884

After age prediction, the clocks are added to adata.obs. Moreover, the percent of missing values for each clock and other metadata are included in adata.uns.

[12]:

adata

[12]:

AnnData object with n_obs × n_vars = 2 × 62241
    obs: 'camilloh3k4me3', 'camilloh3k9me3', 'camillopanhistone'
    var: 'percent_na'
    uns: 'camilloh3k4me3_percent_na', 'camilloh3k4me3_missing_features', 'camilloh3k4me3_metadata', 'camilloh3k9me3_percent_na', 'camilloh3k9me3_missing_features', 'camilloh3k9me3_metadata', 'camillopanhistone_percent_na', 'camillopanhistone_missing_features', 'camillopanhistone_metadata'
    layers: 'X_original'

Get citation#

The doi, citation, and some metadata are automatically added to the AnnData object under adata.uns[CLOCKNAME_metadata].

[13]:

adata.uns['camilloh3k4me3_metadata']

[13]:

{'clock_name': 'camilloh3k4me3',
 'data_type': 'histone mark',
 'species': 'Homo sapiens',
 'year': 2023,
 'approved_by_author': '✅',
 'citation': 'de Lima Camillo, Lucas Paulo, et al. "Histone mark age of human tissues and cells." bioRxiv (2023): 2023-08.',
 'doi': 'https://doi.org/10.1101/2023.08.21.554165',
 'notes': None,
 'version': None}