API Reference#
- coregtor.create_model_input(raw_ge_data, target_gene, t_factors=None)[source]#
Prepare gene expression data for training.
This function splits the gene expression DataFrame into features (X) and target (Y) for supervised learning.
Additionally, if a list of t_factors is provided, X is filtered to include only Transcription factors.
- Parameters:
raw_ge_data (pd.DataFrame) – Gene expression data in samples x genes format. NO processing done.
target_gene (str) – Name of the target gene to predict. Must be present in raw_ge_data columns
t_factors (list) – A list of transcription factor gene names.
- Returns:
X - Feature matrix (samples x genes) excluding target gene; Y - Target vector (samples x 1) containing only target gene expression
- Return type:
tuple[pd.DataFrame, pd.DataFrame]
- coregtor.create_model(X, Y, method='rf', options={'max_depth': 5, 'n_estimators': 1000})[source]#
Train an ensemble regression model to predict the expression of the target gene Y using the expression values of other genes in the gene expression data X.
Currently 2 ensemble model are supported : sklearn.ensemble.RandomForestRegressor and sklearn.ensemble.ExtraTreesRegressor
- Parameters:
X (pd.DataFrame) – Gene expression data (sample by genes). This can be generated using the
create_model_input()methodY (pd.DataFrame) – Gene expression data for the target gene. This can be generated using the create_model_input method.
method (str) – The type of ensemble based model. This must be a valid model in sklearn.ensemble module. Use rf (default) for random forest regressor, et for extra trees regressor
options (dict,optional) – Dictionary of key value pairs to specify training options. See the scikit-learn model docs for options: RandomForestRegressor or ExtraTreesRegressor
- Returns:
Trained sklearn ensemble model
- coregtor.tree_paths(model, X, Y)[source]#
Extract all root-to-leaf decision paths from a trained ensemble model.
This is the main entry point for path extraction. It processes a trained ensemble model to extract all unique decision paths.
- Parameters:
model – sklearn ensemble model
X (pd.DataFrame) – Input of the model. This is required to get gene names
Y (pd.DataFrame) – Training output of the model. This is required to get gene name of the target
- Returns:
- DataFrame containing all decision tree paths with columns:
tree: Tree index within the ensemble (0-based)
source: First gene in the decision path (root decision)
target: Target gene being predicted (constant across all rows)
path_length: Number of decision nodes in the path
node1, node2, …: Genes used at each decision level (excluding source)
Unused node columns are filled with None
- Return type:
pd.DataFrame
- coregtor.create_context(data, method='tree_paths', **kwargs)[source]#
Generates context for all unique roots in the tree using the specified method
By default, tree_paths are used. Given a table of all paths in a random forest, this function generates a dictionary of all possible sub paths between each root gene and the target gene at the leaf. The key is the name of the gene on the root of the path (source) and value is the list of sub paths in the table from the root to the leaf excluding the root and the leaf.
- Parameters:
data (
Union[DataFrame,Any]) – Input data in format appropriate for the method: tree_paths: DataFrame with ‘source’ and ‘node*’ columns.method (
str) – One of ‘tree_paths’, ‘tree’ (default: ‘tree_paths’)**kwargs – Method-specific arguments
- Returns:
{source_gene: [list of subpaths]}
- Return type:
dict
- Raises:
CoRegTorError – If method is unknown
- coregtor.transform_context(context_set, method='gene_frequency', **kwargs)[source]#
Transform context sets into feature representations easier for comparison.
- Parameters:
context_set (
dict) – Dictionary with structure {source: [[gene1, gene2, …], …]} (output from the create_context method)method (
str) – Transformation method to apply. Currently available: - “gene_frequency”: Returns a gene frequency histogram**kwargs – Method-specific parameters passed to the transformation function
- Returns:
Transformed representation (format depends on method)
- Return type:
pd.DataFrame
- Raises:
CoRegTorError – If method is unknown
- coregtor.compare_context(transformed_data, method, transformation_type='gene_frequency', **kwargs)[source]#
Compare contexts using specified similarity/distance metric.
- Parameters:
transformed_data – Output from transform_context() - DataFrame with sources as rows
method – Similarity/distance metric name (e.g., ‘cosine’)
transformation_type – Type of transformation used .If None, attempts to read from DataFrame metadata.
**kwargs – Metric-specific parameters - convert_to_distance (bool): Convert similarity to distance (1 - similarity)
- Returns:
Symmetric pairwise similarity/distance matrix (sources x sources)
- Return type:
pd.DataFrame
- Raises:
CoRegTorError – If method is unknown or incompatible with transformation type
- coregtor.identify_coregulators(distance_matrix, target_gene, method='hierarchical', options={}, note='')[source]#
Identify co-regulatory modules from gene distance matrix.
- Parameters:
distance_matrix – distance_matrix DataFrame from context comparison.
target_gene – Target gene identifier.
method – Clustering method name. (hierarchical)
options – Method-specific parameters dictionary.
- Returns:
model: Fitted clustering model or None
clusters_df: DataFrame of all clusters
best: Best cluster information dict or None
best_df: Best cluster as single-row DataFrame or None
methodology: Complete parameter string
validation_scores: Validation scores dict (validation_index only)
- Return type:
Dict containing