Preprocess#

Please note that most functions are helper functions and are not meant to be used directly.

pyaging.preprocess._preprocess#

pyaging.preprocess._preprocess.bigwig_to_df(bw_files, dir='pyaging_data', verbose=True)[source][source]#

Convert bigWig files to a DataFrame, extracting signal data for genomic regions.

This function processes a list of bigWig files, extracting signal data (such as chromatin accessibility or histone modification levels) for each gene based on genomic annotations from Ensembl. It computes the mean signal over the genomic region of each gene, applies an arcsinh transformation for normalization, and organizes the data into a DataFrame format.

Parameters:

bw_files Union[str, List[str]]: A list of bigWig file paths. If a single string is provided, it is converted to a list.
dir str (default: 'pyaging_data'): Retained for backward compatibility. Hugging Face files use its standard cache.
verbose bool (default: True): Whether to log the output to console with the logger. Defaults to True.

Return type:

DataFrame

Returns:

pd.DataFrame A DataFrame where each row represents a bigWig file and each column corresponds to a gene. The values in the DataFrame are the transformed signal data for each gene in each bigWig file.

Raises:

ImportError – If pyBigWig is not installed and the function is called.

Notes

The function utilizes Ensembl gene annotations and assumes the presence of genes on standard chromosomes (1-22, X). Non-standard chromosomes or regions outside annotated genes are not processed. The signal transformation uses the arcsinh function for normalization. This function requires pyBigWig to be installed. If pyBigWig is not available, an ImportError will be raised. To use this function, ensure you have installed pyaging with the ‘bigwig’ extra: pip install pyaging[bigwig]

Examples

>>> bigwig_files = ["sample1.bw", "sample2.bw"]
>>> signals_df = bigwig_to_df(bigwig_files)
# This returns a DataFrame where rows are bigWig files and columns are genes, with signal values.

pyaging.preprocess._preprocess.df_to_adata(df, metadata_cols=[], imputer_strategy='knn', verbose=True)[source][source]#

Converts a pandas DataFrame to an AnnData object.

This function transforms a DataFrame containing biological data (such as gene expression levels, methylation data, etc.) into an AnnData object. It includes steps for handling missing values, and logging data statistics. The function is particularly useful in preparing datasets for downstream analyses in bioinformatics and computational biology.

Parameters:

df DataFrame: The DataFrame containing biological data. Rows represent samples, and columns represent features.
metadata_cols List[str] (default: []): A list with the name of the columns in ‘df’ which are part of the metadata. They will be added to adata.obs rather than adata.X.
imputer_strategy str (default: 'knn'): The strategy for imputing missing values in ‘df’. Supported strategies include ‘mean’, ‘median’, ‘constant’ (0 values), and ‘knn’. Defaults to ‘knn’.
verbose bool (default: True): Whether to log the output to console with the logger. Defaults to True.

Return type:

AnnData

Returns:

anndata.AnnData The AnnData object containing the processed data, metadata, and additional annotations.

Raises:

TypeError – If the input ‘df’ is not a pandas DataFrame.

Notes

The AnnData object produced by this function is ready for various computational biology analyses, such as differential expression analysis, clustering, or trajectory inference. The embedded annotations enhance data understanding and facilitate more robust analyses.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame(np.random.rand(5, 3), columns=["gene1", "gene2", "gene3"])
>>> adata = df_to_adata(df)
# This returns an AnnData object with the imputed data from 'df'.

pyaging.preprocess._preprocess.epicv2_probe_aggregation(df, verbose=True)[source][source]#

Aggregates probes targeting the same CpG site in a DataFrame from the Illumina Methylation EPIC array v2.

Probes targeting the same CpG site are identified by their shared prefix (e.g., “cgXXXXXXX”), and their values are averaged to create a single feature for each unique CpG site. This reduces the dimensionality of the data by consolidating multiple probes for the same CpG site into a single value.

Parameters:

df DataFrame: The input DataFrame containing probe data. Each column represents a probe, and the column names are expected to follow the format “cgXXXXXXX_YYYY”.
verbose bool (default: True): Whether to log the output to console with the logger. Defaults to True.

Returns:

pandas.DataFrame: A new DataFrame with averaged values for each unique CpG site. The columns of this DataFrame correspond to unique CpG sites, and the column names are the CpG site identifiers (e.g., “cgXXXXXXX”).

pyaging.preprocess._preprocess_utils#

pyaging.preprocess._preprocess_utils.add_metadata_to_anndata(adata, metadata, logger, indent_level=1)[source][source]#

Adds metadata to an AnnData object’s observation (obs) attribute.

This function enriches an AnnData object by integrating metadata. The metadata, provided as a pandas DataFrame, is aligned with the observation names in the AnnData object, ensuring consistency and completeness of data annotations. This process is crucial for downstream analyses where metadata (e.g., sample conditions, phenotypes) is key for interpretation.

Parameters:

adata AnnData: The AnnData object to which metadata will be added. The obs attribute of this object will be modified.
metadata Optional[DataFrame]: A pandas DataFrame containing the metadata. Each row corresponds to an observation, and columns represent different metadata fields.
logger Logger: A logging object for documenting the process and any observations.
indent_level int (default: 1): The level of indentation for the logger, with 1 being the default.

Return type:

None

Notes

The metadata DataFrame’s index should match the observation names in the AnnData object for proper alignment. This function will reindex the metadata to match the AnnData obs_names, ensuring that each sample in the AnnData object is associated with its corresponding metadata.

Example

>>> import pandas as pd
>>> from anndata import AnnData
>>> adata = AnnData(np.random.rand(5, 3))
>>> metadata = pd.DataFrame(
...     {"Condition": ["A", "B", "A", "B", "A"]},
...     index=[f"Sample_{i}" for i in range(5)],
... )
>>> add_metadata_to_anndata(adata, metadata, logger)
# Adds the 'Condition' metadata to the AnnData object.

pyaging.preprocess._preprocess_utils.add_unstructured_data(adata, imputer_strategy, logger, indent_level=1)[source][source]#

Adds unstructured data, such as imputer strategy and data type, to an AnnData object.

This function is designed to annotate an AnnData object with additional unstructured information, enhancing data transparency and traceability. Key information, like the imputation strategy used and the type of biological data represented, is stored in the unstructured (uns) attribute of the AnnData object. This enrichment is vital for ensuring clarity and reproducibility in bioinformatics analyses.

Parameters:

adata AnnData: The AnnData object to which the unstructured data will be added.
imputer_strategy str: The strategy used for imputing missing values in the dataset, which will be recorded in the AnnData object for reference.
logger Logger: A logging object for documenting the process and any important observations.
indent_level int (default: 1): The level of indentation for the logger, with 1 being the default.

Return type:

None

Notes

This function updates the ‘uns’ attribute of the AnnData object with the ‘imputer_strategy’ key.

Example

>>> from anndata import AnnData
>>> adata = AnnData(np.random.rand(5, 3))
>>> adata = add_unstructured_data(adata, "mean", logger)
# This will add the imputer strategy 'mean' and the data type 'dna_methylation' to the AnnData object.

pyaging.preprocess._preprocess_utils.create_anndata_object(df, logger, indent_level=1)[source][source]#

Creates an AnnData object from a pandas DataFrame.

This function constructs an AnnData object, a central data structure for storing and manipulating high-dimensional biological data such as single-cell genomics data. It takes a pandas DataFrame and returns an AnnData object suitable for downstream analyses in bioinformatics pipelines.

Parameters:

df DataFrame: A pandas DataFrame with sample names as the index and the feature names as columns.
logger Logger: A logging object for documenting the process and any relevant observations.
indent_level int (default: 1): The level of indentation for the logger, with 1 being the default.

Return type:

AnnData

Returns:

anndata.AnnData An AnnData object populated with the data, observation names, and variable names.

Notes

AnnData objects are widely used in computational biology for storing large, annotated datasets. Their structured format ensures easy access and manipulation of data for various analytical purposes.

This function is essential for converting raw or processed data into a format readily usable with tools and libraries that support AnnData objects, facilitating a seamless integration into existing bioinformatics workflows.

Example

>>> data = pd.DataFrame(np.random.rand(100, 5))
>>> ann_data = create_anndata_object(data, logger)
# Creates an AnnData object with 100 observations and 5 variables.

pyaging.preprocess._preprocess_utils.impute_missing_values(adata, strategy, logger, indent_level=1)[source][source]#

Imputes missing values in a given adata object using a specified strategy.

This function handles missing data in by applying various imputation strategies. It checks the .X in the adata object for missing values and applies the chosen imputation method, which can be mean, median, constant, or K-nearest neighbors (KNN). The function is useful in preprocessing steps for datasets where missing data could affect subsequent analyses. It also adds the number of missing values for each sample and each feature.

Parameters:

adata AnnData: An adata object containing .X with potential missing values.
strategy str: The imputation strategy to apply. Valid options are ‘mean’, ‘median’, ‘constant’, and ‘knn’.
logger Logger: A logging object for tracking the progress and outcomes of the function.
indent_level int (default: 1): The level of indentation for the logger, with 1 being the default.

Raises:

ValueError – If an invalid imputation strategy is specified.

Return type:

None

Notes

The ‘constant’ strategy fills missing values with 0 by default. The ‘knn’ strategy uses the K-nearest neighbors algorithm to estimate missing values based on similar samples. This function is particularly useful in datasets where missing values are common, such as in biological or medical data.

The function ensures that no imputation is performed if there are no missing values in the dataset, thus preserving the original data integrity.

Examples

>>> imputed_adata = impute_missing_values(adata, "mean")
# Imputes missing values using the mean of each column.

pyaging.preprocess._preprocess_utils.load_ensembl_metadata(dir, logger, indent_level=1)[source][source]#

Load and filter Ensembl genome metadata specific to Homo sapiens.

This function downloads the Ensembl gene metadata for Homo sapiens from the public pyaging Hugging Face data repository and filters it to include only genes on specified chromosomes.

Parameters:

dir str: Retained for backward compatibility. Hugging Face files use its standard cache.
logger Logger: A logging object for recording the progress and status of the download and filtering process.
indent_level int (default: 1): The indentation level for logging messages. It helps to organize the log output when this function is part of larger workflows. Defaults to 1.

Return type:

DataFrame

Returns:

pd.DataFrame A DataFrame containing filtered gene metadata from Ensembl. Rows correspond to genes, indexed by their Ensembl gene IDs, and columns include various gene attributes.

Notes

The function currently filters genes based on a predefined set of chromosomes (1-22, X). If different chromosomes or additional filtering criteria are needed, modifications to the function will be required.

Examples

>>> logger = LoggerManager.gen_logger("ensembl_metadata")
>>> ensembl_genes = load_ensembl_metadata("pyaging_data", logger)
# This returns a DataFrame with Ensembl gene metadata for Homo sapiens filtered by specified chromosomes.

pyaging.preprocess._preprocess_utils.log_data_statistics(X, logger, indent_level=1)[source][source]#

Logs various statistical properties of a given dataset.

This function provides a quick summary of key statistics for a numpy array. It calculates and logs the number of observations (rows), features (columns), total missing values, and the percentage of missing values in the dataset. This function is particularly useful for initial data exploration and quality assessment in data analysis workflows.

Parameters:

X ndarray: A numpy array containing the dataset to be analyzed.
logger Logger: A logging object for documenting the statistics and observations.
indent_level int (default: 1): The level of indentation for the logger, with 1 being the default.

Return type:

None

Notes

Understanding the basic statistics of a dataset is crucial in data preprocessing and analysis. This function highlights potential issues with data, like high levels of missing values, which could impact subsequent analyses.

The function is designed to work seamlessly with datasets of varying sizes and complexities. The statistical summary provided helps in making informed decisions about further steps in data processing, such as imputation or feature selection.

Example

>>> data = np.random.rand(100, 5)
>>> log_data_statistics(data, logger)
# Logs number of observations, features, and details about missing values.