API Reference#
- coregtor.create_model_input(raw_ge_data, target_gene, t_factors)[source]#
Prepare gene expression data for training.
This function splits the gene expression DataFrame into features (X) and target (Y) for supervised learning.
- Parameters:
raw_ge_data (pd.DataFrame) – Gene expression data in samples x genes format
target_gene (str) – Name of the target gene to predict. Must be present in raw_ge_data columns
t_factors (pd.DataFrame) – A DataFrame containing transcription factor gene names. It must have a column named ‘gene_name’ listing the TF genes. This DataFrame is used to filter the input gene expression data.
- Returns:
X - Feature matrix (samples x genes) excluding target gene; Y - Target vector (samples x 1) containing only target gene expression
- Return type:
tuple[pd.DataFrame, pd.DataFrame]
- coregtor.create_model(X, Y, model='rf', model_options={'max_depth': 5, 'n_estimators': 1000})[source]#
Train an ensemble regression model to predict the expression of a target gene using transcription factors in the gene expression data as input features.
Currently 2 ensemble model are supported : sklearn.ensemble.RandomForestRegressor and sklearn.ensemble.ExtraTreesRegressor
- Parameters:
X (pd.DataFrame) – Gene expression data (sample by genes) of transcription factors. This can be generated using the create_model_input method
Y (pd.DataFrame) – Gene expression data for the target gene. This can be generated using the create_model_input method.
model (str) – The type of ensemble based model. This must be a valid model in sklearn.ensemble module. Use rf (default) for random forest regressor, et for extra trees regressor
model_options (dict,optional) – Dictionary of key value pairs to specify training options. See the scikit-learn model docs for options: RandomForestRegressor or ExtraTreesRegressor
- Returns:
Trained sklearn ensemble model
- coregtor.tree_paths(model, X, Y)[source]#
Extract all root-to-leaf decision paths from a trained ensemble model.
This is the main entry point for path extraction. It processes a trained ensemble model to extract all unique decision paths.
- Parameters:
model – sklearn ensemble model
ge_data (pd.DataFrame) – Gene expression data used to train the model. This is required to extract gene names.
X (pd.DataFrame) – Input of the model. This is required to get gene names
Y (pd.DataFrame) – Training output of the model. This is required to get gene name of the target
- Returns:
- DataFrame containing all decision tree paths with columns:
tree: Tree index within the ensemble (0-based)
source: First gene in the decision path (root decision)
target: Target gene being predicted (constant across all rows)
path_length: Number of decision nodes in the path
node1, node2, …: Genes used at each decision level
Unused node columns are filled with None
- Return type:
pd.DataFrame
- coregtor.create_context(data, method='tree_paths', **kwargs)[source]#
Generates context for all unique roots in the tree using the specified method
By default, tree_paths are used. Given a table of all paths in a random forest, this function generates a dictionary of all possible sub paths between each root gene and the target gene at the leaf. The key is the name of the gene on the root of the path (source) and value is the list of sub paths in the table from the root to the leaf excluding the root and the leaf.
- Parameters:
data (
Union[DataFrame,Any]) – Input data in format appropriate for the method: tree_paths: DataFrame with ‘source’ and ‘node*’ columns.method (
str) – One of ‘tree_paths’, ‘tree’ (default: ‘tree_paths’)**kwargs – Method-specific arguments
- Returns:
{source_gene: [list of subpaths]}
- Return type:
dict
- Raises:
ValueError – If method is unknown
- coregtor.transform_context(context_set, method='gene_frequency', **kwargs)[source]#
Transform context sets into feature representations easier for comparison.
- Parameters:
context_set (
dict) – Dictionary with structure {source: [[gene1, gene2, …], …]} (output from the create_context method)method (
str) – Transformation method to apply. Currently available: - “gene_frequency”: Returns a gene frequency histogram**kwargs – Method-specific parameters passed to the transformation function
- Returns:
Transformed representation (format depends on method)
- Return type:
pd.DataFrame
- Raises:
ValueError – If method is unknown
- coregtor.compare_context(transformed_data, method, transformation_type=None, **kwargs)[source]#
Compare contexts using specified similarity/distance metric.
- Parameters:
transformed_data (
DataFrame) – Output from transform_context() - DataFrame with sources as rowsmethod (
str) – Similarity/distance metric name (e.g., ‘cosine’)transformation_type (
str) – Type of transformation used (for validation).If None, attempts to read from DataFrame metadata.**kwargs – Metric-specific parameters - convert_to_distance (bool): Convert similarity to distance (1 - similarity)
- Returns:
Symmetric pairwise similarity/distance matrix (sources x sources)
- Return type:
pd.DataFrame
- Raises:
ValueError – If method is unknown or incompatible with transformation type
- coregtor.identify_coregulators(comparison_matrix, target_gene, method='hierarchical', **kwargs)[source]#
Identify putative co-regulatory modules from context similarity matrix of root genes.
Clusters genes with similar contexts to generate hypotheses about genes that may function as co-regulators.
- Parameters:
comparison_matrix (
DataFrame) – Output from compare_context()target_gene (
str) – Name of the target gene being regulatedmethod (
str) – Clustering method (‘hierarchical’)**kwargs – Method-specific parameters.For method=’hierarchical’: - n_clusters (int): Number of clusters - distance_threshold (float): Distance threshold - linkage (str): Linkage method (‘average’, ‘complete’, ‘ward’, ‘single’) - min_module_size (int): Minimum genes per module
- Return type:
Tuple[DataFrame,any]- Returns:
Tuple of (modules_df, model) - modules_df: DataFrame with columns [‘target_gene’, ‘gene_cluster’, ‘n_genes’, ‘cluster_id’] - model: Fitted clustering model
- Raises:
ValueError – If method is unknown
NotImplementedError – If method is not yet implemented