API Reference#

coregtor.create_model_input(raw_ge_data, target_gene, t_factors)[source]#

Prepare gene expression data for training.

This function splits the gene expression DataFrame into features (X) and target (Y) for supervised learning.

Parameters:
  • raw_ge_data (pd.DataFrame) – Gene expression data in samples x genes format

  • target_gene (str) – Name of the target gene to predict. Must be present in raw_ge_data columns

  • t_factors (pd.DataFrame) – A DataFrame containing transcription factor gene names. It must have a column named ‘gene_name’ listing the TF genes. This DataFrame is used to filter the input gene expression data.

Returns:

X - Feature matrix (samples x genes) excluding target gene; Y - Target vector (samples x 1) containing only target gene expression

Return type:

tuple[pd.DataFrame, pd.DataFrame]

coregtor.create_model(X, Y, model='rf', model_options={'max_depth': 5, 'n_estimators': 1000})[source]#

Train an ensemble regression model to predict the expression of a target gene using transcription factors in the gene expression data as input features.

Currently 2 ensemble model are supported : sklearn.ensemble.RandomForestRegressor and sklearn.ensemble.ExtraTreesRegressor

Parameters:
  • X (pd.DataFrame) – Gene expression data (sample by genes) of transcription factors. This can be generated using the create_model_input method

  • Y (pd.DataFrame) – Gene expression data for the target gene. This can be generated using the create_model_input method.

  • model (str) – The type of ensemble based model. This must be a valid model in sklearn.ensemble module. Use rf (default) for random forest regressor, et for extra trees regressor

  • model_options (dict,optional) – Dictionary of key value pairs to specify training options. See the scikit-learn model docs for options: RandomForestRegressor or ExtraTreesRegressor

Returns:

Trained sklearn ensemble model

coregtor.tree_paths(model, X, Y)[source]#

Extract all root-to-leaf decision paths from a trained ensemble model.

This is the main entry point for path extraction. It processes a trained ensemble model to extract all unique decision paths.

Parameters:
  • model – sklearn ensemble model

  • ge_data (pd.DataFrame) – Gene expression data used to train the model. This is required to extract gene names.

  • X (pd.DataFrame) – Input of the model. This is required to get gene names

  • Y (pd.DataFrame) – Training output of the model. This is required to get gene name of the target

Returns:

DataFrame containing all decision tree paths with columns:
  • tree: Tree index within the ensemble (0-based)

  • source: First gene in the decision path (root decision)

  • target: Target gene being predicted (constant across all rows)

  • path_length: Number of decision nodes in the path

  • node1, node2, …: Genes used at each decision level

  • Unused node columns are filled with None

Return type:

pd.DataFrame

coregtor.create_context(data, method='tree_paths', **kwargs)[source]#

Generates context for all unique roots in the tree using the specified method

By default, tree_paths are used. Given a table of all paths in a random forest, this function generates a dictionary of all possible sub paths between each root gene and the target gene at the leaf. The key is the name of the gene on the root of the path (source) and value is the list of sub paths in the table from the root to the leaf excluding the root and the leaf.

Parameters:
  • data (Union[DataFrame, Any]) – Input data in format appropriate for the method: tree_paths: DataFrame with ‘source’ and ‘node*’ columns.

  • method (str) – One of ‘tree_paths’, ‘tree’ (default: ‘tree_paths’)

  • **kwargs – Method-specific arguments

Returns:

{source_gene: [list of subpaths]}

Return type:

dict

Raises:

ValueError – If method is unknown

coregtor.transform_context(context_set, method='gene_frequency', **kwargs)[source]#

Transform context sets into feature representations easier for comparison.

Parameters:
  • context_set (dict) – Dictionary with structure {source: [[gene1, gene2, …], …]} (output from the create_context method)

  • method (str) – Transformation method to apply. Currently available: - “gene_frequency”: Returns a gene frequency histogram

  • **kwargs – Method-specific parameters passed to the transformation function

Returns:

Transformed representation (format depends on method)

Return type:

pd.DataFrame

Raises:

ValueError – If method is unknown

coregtor.compare_context(transformed_data, method, transformation_type=None, **kwargs)[source]#

Compare contexts using specified similarity/distance metric.

Parameters:
  • transformed_data (DataFrame) – Output from transform_context() - DataFrame with sources as rows

  • method (str) – Similarity/distance metric name (e.g., ‘cosine’)

  • transformation_type (str) – Type of transformation used (for validation).If None, attempts to read from DataFrame metadata.

  • **kwargs – Metric-specific parameters - convert_to_distance (bool): Convert similarity to distance (1 - similarity)

Returns:

Symmetric pairwise similarity/distance matrix (sources x sources)

Return type:

pd.DataFrame

Raises:

ValueError – If method is unknown or incompatible with transformation type

coregtor.identify_coregulators(comparison_matrix, target_gene, method='hierarchical', **kwargs)[source]#

Identify putative co-regulatory modules from context similarity matrix of root genes.

Clusters genes with similar contexts to generate hypotheses about genes that may function as co-regulators.

Parameters:
  • comparison_matrix (DataFrame) – Output from compare_context()

  • target_gene (str) – Name of the target gene being regulated

  • method (str) – Clustering method (‘hierarchical’)

  • **kwargs – Method-specific parameters.For method=’hierarchical’: - n_clusters (int): Number of clusters - distance_threshold (float): Distance threshold - linkage (str): Linkage method (‘average’, ‘complete’, ‘ward’, ‘single’) - min_module_size (int): Minimum genes per module

Return type:

Tuple[DataFrame, any]

Returns:

Tuple of (modules_df, model) - modules_df: DataFrame with columns [‘target_gene’, ‘gene_cluster’, ‘n_genes’, ‘cluster_id’] - model: Fitted clustering model

Raises:
  • ValueError – If method is unknown

  • NotImplementedError – If method is not yet implemented