Bulk histone mark ChIP-Seq#
This tutorial is a brief guide for the implementation of the seven histone-mark-specific clocks and the pan-histone-mark clock developed ourselves. Link to preprint.
We just need two packages for this tutorial.
[ ]:
!pip install pybigwig -q
[1]:
import pandas as pd
import pyaging as pya
Download and load example data#
Letβs download an example of H3K4me3 ChIP-Seq bigWig file from the ENCODE project.
[2]:
pya.data.download_example_data('ENCFF386QWG')
|-----> ποΈ Starting download_example_data function
|-----------> Downloading data to pyaging_data/ENCFF386QWG.bigWig
|-----------> in progress: 100.0000%
|-----> π Done! [2075.2977s]
To exemplify that multiple bigWigs can be turned into a df object at once, letβs just repeat the file path.
[3]:
df = pya.pp.bigwig_to_df(['pyaging_data/ENCFF386QWG.bigWig', 'pyaging_data/ENCFF386QWG.bigWig'])
|-----> ποΈ Starting bigwig_to_df function
|-----> βοΈ Load Ensembl genome metadata started
|-----------> Downloading data to pyaging_data/Ensembl-105-EnsDb-for-Homo-sapiens-genes.csv
|-----------> in progress: 100.0000%
|-----> β
Load Ensembl genome metadata finished [20.4846s]
|-----> βοΈ Processing bigWig files started
|-----------> Processing file: pyaging_data/ENCFF386QWG.bigWig
|-----------> in progress: 100.0000%
|-----------> Processing file: pyaging_data/ENCFF386QWG.bigWig
|-----------> in progress: 100.0000%
|-----> β
Processing bigWig files finished [12.4227s]
|-----> π Done! [44.8066s]
[4]:
df.index = ['sample1', 'sample2'] # just to avoid an annoying anndata warning that samples have same names
[5]:
df.head()
[5]:
| ENSG00000223972 | ENSG00000227232 | ENSG00000278267 | ENSG00000243485 | ENSG00000284332 | ENSG00000237613 | ENSG00000268020 | ENSG00000240361 | ENSG00000186092 | ENSG00000238009 | ... | ENSG00000237801 | ENSG00000237040 | ENSG00000124333 | ENSG00000228410 | ENSG00000223484 | ENSG00000124334 | ENSG00000270726 | ENSG00000185203 | ENSG00000182484 | ENSG00000227159 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sample1 | 0.028616 | 0.030415 | 0.027783 | 0.028616 | 0.028616 | 0.028616 | 0.044171 | 0.036474 | 0.030784 | 0.03181 | ... | 0.034435 | 0.006822 | 1.413119 | 0.029424 | 0.140005 | 0.049786 | 0.069296 | 0.332126 | 0.028596 | 0.028616 |
| sample2 | 0.028616 | 0.030415 | 0.027783 | 0.028616 | 0.028616 | 0.028616 | 0.044171 | 0.036474 | 0.030784 | 0.03181 | ... | 0.034435 | 0.006822 | 1.413119 | 0.029424 | 0.140005 | 0.049786 | 0.069296 | 0.332126 | 0.028596 | 0.028616 |
2 rows Γ 62241 columns
Convert data to AnnData object#
AnnData objects are highly flexible and are thus our preferred method of organizing data for age prediction.
[6]:
adata = pya.preprocess.df_to_adata(df)
|-----> ποΈ Starting df_to_adata function
|-----> βοΈ Create anndata object started
|-----> β
Create anndata object finished [0.0282s]
|-----> βοΈ Add metadata to anndata started
|-----------? No metadata provided. Leaving adata.obs empty
|-----> β οΈ Add metadata to anndata finished [0.0004s]
|-----> βοΈ Log data statistics started
|-----------> There are 2 observations
|-----------> There are 62241 features
|-----------> Total missing values: 0
|-----------> Percentage of missing values: 0.00%
|-----> β
Log data statistics finished [0.0010s]
|-----> βοΈ Impute missing values started
|-----------> No missing values found. No imputation necessary
|-----> β
Impute missing values finished [0.0016s]
|-----> π Done! [0.0333s]
Note that the original DataFrame is stored in X_original under layers. This is what the adata object looks like:
[7]:
adata
[7]:
AnnData object with n_obs Γ n_vars = 2 Γ 62241
var: 'percent_na'
layers: 'X_original'
Predict age#
We can either predict one clock at once or all at the same time. For convenience, letβs simply input a few clocks of interest at once. The function is invariant to the capitalization of the clock name.
[8]:
pya.pred.predict_age(adata, ['CamilloH3K4me3', 'CamilloH3K9me3', 'CamilloPanHistone'])
|-----> ποΈ Starting predict_age function
|-----> βοΈ Set PyTorch device started
|-----------> Using device: cpu
|-----> β
Set PyTorch device finished [0.0031s]
|-----> π Processing clock: camilloh3k4me3
|-----------> βοΈ Load clock started
|-----------------> Downloading data to pyaging_data/camilloh3k4me3.pt
|-----------------> in progress: 100.0000%
|-----------> β
Load clock finished [5.0816s]
|-----------> βοΈ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------------> Added prepared input matrix to adata.obsm[X_camilloh3k4me3]
|-----------> β
Check features in adata finished [0.0076s]
|-----------> βοΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> β
Predict ages with model finished [0.0250s]
|-----------> βοΈ Add predicted ages and clock metadata to adata started
|-----------> β
Add predicted ages and clock metadata to adata finished [0.0013s]
|-----> π Processing clock: camilloh3k9me3
|-----------> βοΈ Load clock started
|-----------------> Downloading data to pyaging_data/camilloh3k9me3.pt
|-----------------> in progress: 100.0000%
|-----------> β
Load clock finished [12.2932s]
|-----------> βοΈ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------------> Added prepared input matrix to adata.obsm[X_camilloh3k9me3]
|-----------> β
Check features in adata finished [0.0187s]
|-----------> βοΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> β
Predict ages with model finished [0.0030s]
|-----------> βοΈ Add predicted ages and clock metadata to adata started
|-----------> β
Add predicted ages and clock metadata to adata finished [0.0007s]
|-----> π Processing clock: camillopanhistone
|-----------> βοΈ Load clock started
|-----------------> Downloading data to pyaging_data/camillopanhistone.pt
|-----------------> in progress: 100.0000%
|-----------> β
Load clock finished [222.1535s]
|-----------> βοΈ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------------> Added prepared input matrix to adata.obsm[X_camillopanhistone]
|-----------> β
Check features in adata finished [0.0201s]
|-----------> βοΈ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> β
Predict ages with model finished [0.0109s]
|-----------> βοΈ Add predicted ages and clock metadata to adata started
|-----------> β
Add predicted ages and clock metadata to adata finished [0.0008s]
|-----> π Done! [239.8831s]
[9]:
adata.obs.head()
[9]:
| camilloh3k4me3 | camilloh3k9me3 | camillopanhistone | |
|---|---|---|---|
| sample1 | 53.998544 | 44.3229 | 54.021884 |
| sample2 | 53.998544 | 44.3229 | 54.021884 |
Having so much information printed can be overwhelming, particularly when running several clocks at once. In such cases, just set verbose to False.
[10]:
pya.data.download_example_data('ENCFF386QWG', verbose=False)
df = pya.pp.bigwig_to_df(['pyaging_data/ENCFF386QWG.bigWig', 'pyaging_data/ENCFF386QWG.bigWig'], verbose=False)
df.index = ['sample1', 'sample2']
adata = pya.preprocess.df_to_adata(df, verbose=False)
pya.pred.predict_age(adata, ['CamilloH3K4me3', 'CamilloH3K9me3', 'CamilloPanHistone'], verbose=False)
[11]:
adata.obs.head()
[11]:
| camilloh3k4me3 | camilloh3k9me3 | camillopanhistone | |
|---|---|---|---|
| sample1 | 53.998544 | 44.3229 | 54.021884 |
| sample2 | 53.998544 | 44.3229 | 54.021884 |
After age prediction, the clocks are added to adata.obs. Moreover, the percent of missing values for each clock and other metadata are included in adata.uns.
[12]:
adata
[12]:
AnnData object with n_obs Γ n_vars = 2 Γ 62241
obs: 'camilloh3k4me3', 'camilloh3k9me3', 'camillopanhistone'
var: 'percent_na'
uns: 'camilloh3k4me3_percent_na', 'camilloh3k4me3_missing_features', 'camilloh3k4me3_metadata', 'camilloh3k9me3_percent_na', 'camilloh3k9me3_missing_features', 'camilloh3k9me3_metadata', 'camillopanhistone_percent_na', 'camillopanhistone_missing_features', 'camillopanhistone_metadata'
layers: 'X_original'
Get citation#
The doi, citation, and some metadata are automatically added to the AnnData object under adata.uns[CLOCKNAME_metadata].
[13]:
adata.uns['camilloh3k4me3_metadata']
[13]:
{'clock_name': 'camilloh3k4me3',
'data_type': 'histone mark',
'species': 'Homo sapiens',
'year': 2023,
'approved_by_author': 'β
',
'citation': 'de Lima Camillo, Lucas Paulo, et al. "Histone mark age of human tissues and cells." bioRxiv (2023): 2023-08.',
'doi': 'https://doi.org/10.1101/2023.08.21.554165',
'notes': None,
'version': None}