API Reference#

coregtor.create_model_input(raw_ge_data, target_gene, t_factors=None)[source]#

Prepare gene expression data for training.

This function splits the gene expression DataFrame into features (X) and target (Y) for supervised learning.

Additionally, if a list of t_factors is provided, X is filtered to include only Transcription factors.

Parameters:
  • raw_ge_data (pd.DataFrame) – Gene expression data in samples x genes format. NO processing done.

  • target_gene (str) – Name of the target gene to predict. Must be present in raw_ge_data columns

  • t_factors (list) – A list of transcription factor gene names.

Returns:

X - Feature matrix (samples x genes) excluding target gene; Y - Target vector (samples x 1) containing only target gene expression

Return type:

tuple[pd.DataFrame, pd.DataFrame]

coregtor.create_model(X, Y, method='rf', options={'max_depth': 5, 'n_estimators': 1000})[source]#

Train an ensemble regression model to predict the expression of the target gene Y using the expression values of other genes in the gene expression data X.

Currently 2 ensemble model are supported : sklearn.ensemble.RandomForestRegressor and sklearn.ensemble.ExtraTreesRegressor

Parameters:
  • X (pd.DataFrame) – Gene expression data (sample by genes). This can be generated using the create_model_input() method

  • Y (pd.DataFrame) – Gene expression data for the target gene. This can be generated using the create_model_input method.

  • method (str) – The type of ensemble based model. This must be a valid model in sklearn.ensemble module. Use rf (default) for random forest regressor, et for extra trees regressor

  • options (dict,optional) – Dictionary of key value pairs to specify training options. See the scikit-learn model docs for options: RandomForestRegressor or ExtraTreesRegressor

Returns:

Trained sklearn ensemble model

coregtor.tree_paths(model, X, Y)[source]#

Extract all root-to-leaf decision paths from a trained ensemble model.

This is the main entry point for path extraction. It processes a trained ensemble model to extract all unique decision paths.

Parameters:
  • model – sklearn ensemble model

  • X (pd.DataFrame) – Input of the model. This is required to get gene names

  • Y (pd.DataFrame) – Training output of the model. This is required to get gene name of the target

Returns:

DataFrame containing all decision tree paths with columns:
  • tree: Tree index within the ensemble (0-based)

  • source: First gene in the decision path (root decision)

  • target: Target gene being predicted (constant across all rows)

  • path_length: Number of decision nodes in the path

  • node1, node2, …: Genes used at each decision level (excluding source)

  • Unused node columns are filled with None

Return type:

pd.DataFrame

coregtor.create_context(data, method='tree_paths', **kwargs)[source]#

Generates context for all unique roots in the tree using the specified method

By default, tree_paths are used. Given a table of all paths in a random forest, this function generates a dictionary of all possible sub paths between each root gene and the target gene at the leaf. The key is the name of the gene on the root of the path (source) and value is the list of sub paths in the table from the root to the leaf excluding the root and the leaf.

Parameters:
  • data (Union[DataFrame, Any]) – Input data in format appropriate for the method: tree_paths: DataFrame with ‘source’ and ‘node*’ columns.

  • method (str) – One of ‘tree_paths’, ‘tree’ (default: ‘tree_paths’)

  • **kwargs – Method-specific arguments

Returns:

{source_gene: [list of subpaths]}

Return type:

dict

Raises:

CoRegTorError – If method is unknown

coregtor.transform_context(context_set, method='gene_frequency', **kwargs)[source]#

Transform context sets into feature representations easier for comparison.

Parameters:
  • context_set (dict) – Dictionary with structure {source: [[gene1, gene2, …], …]} (output from the create_context method)

  • method (str) – Transformation method to apply. Currently available: - “gene_frequency”: Returns a gene frequency histogram

  • **kwargs – Method-specific parameters passed to the transformation function

Returns:

Transformed representation (format depends on method)

Return type:

pd.DataFrame

Raises:

CoRegTorError – If method is unknown

coregtor.compare_context(transformed_data, method, transformation_type='gene_frequency', **kwargs)[source]#

Compare contexts using specified similarity/distance metric.

Parameters:
  • transformed_data – Output from transform_context() - DataFrame with sources as rows

  • method – Similarity/distance metric name (e.g., ‘cosine’)

  • transformation_type – Type of transformation used .If None, attempts to read from DataFrame metadata.

  • **kwargs – Metric-specific parameters - convert_to_distance (bool): Convert similarity to distance (1 - similarity)

Returns:

Symmetric pairwise similarity/distance matrix (sources x sources)

Return type:

pd.DataFrame

Raises:

CoRegTorError – If method is unknown or incompatible with transformation type

coregtor.identify_coregulators(distance_matrix, target_gene, method='hierarchical', options={}, note='')[source]#

Identify co-regulatory modules from gene distance matrix.

Parameters:
  • distance_matrix – distance_matrix DataFrame from context comparison.

  • target_gene – Target gene identifier.

  • method – Clustering method name. (hierarchical)

  • options – Method-specific parameters dictionary.

Returns:

  • model: Fitted clustering model or None

  • clusters_df: DataFrame of all clusters

  • best: Best cluster information dict or None

  • best_df: Best cluster as single-row DataFrame or None

  • methodology: Complete parameter string

  • validation_scores: Validation scores dict (validation_index only)

Return type:

Dict containing