Indices#

PPI Based#

Protein-Protein Interaction (PPI) related indices.

The core methods accept a NetworkX graph as the PPI input, which enables the use of multiple PPI network sources. Three sources are currently supported:

  • HIPPIE PPI (Alanis-Lobato et al. [BALANS17])

  • StringDB PPI (Szklarczyk et al. [BSKK+22])

  • BioGRID Database (Oughtred et al. [BORC+21])

Three wrapper functions are provided that accept a shared dataset cache. The cache is described in the load_datasets() documentation.

tfitpy.indices.ppi.ppi_all_scores(sources, datasets=None, pairs=None, **kwargs)[source]#

Compute all 6 PPI scores in a single pass per database.

For each of the three PPI databases (hippie, stringdb, biogrid), makes one pass over all TF pairs to compute both the shortest-path proximity score and the shared-partners hypergeometric score simultaneously.

This is equivalent to calling shortest_path_score and shared_partners separately for each database, but with half the graph traversals.

Parameters:
  • sources (list) – List of source TF identifiers in the regulatory module.

  • datasets (dict) – Dataset cache dict containing ‘hippie’, ‘stringdb’, ‘biogrid’ NetworkX graphs. Must be provided.

  • pairs (list) – Optional precomputed list of (tf1, tf2) tuples. If None, generated from sources via generate_tf_pairs().

Returns:

shortest_PPI_path_score_hippie shortest_PPI_path_score_stringdb shortest_PPI_path_score_biogrid shared_PPI_partners_score_hippie shared_PPI_partners_score_stringdb shared_PPI_partners_score_biogrid

Return type:

Dict with 6 keys

Raises:

ValueError – If datasets is None or any required db key is missing.

tfitpy.indices.ppi.shared_partners(sources, ppi_network=None, pairs=None)[source]#

Computes the shared PPI partners score for a TF regulatory module.

For each TF pair derived from sources, computes the hypergeometric shared-partners score and aggregates all pairwise scores into a single module-level index using by taking the mean. Based on Lai et al. [BLCHW14].

Parameters:
  • sources (list) – List of source TF identifiers in the regulatory module.

  • ppi_network (Graph) – An undirected NetworkX graph representing the PPI network. Must be provided.

Returns:

final_score (float): The mean hypergeometric score across all valid

TF pairs. Returns 0.0 if no valid scores exist.

pairs_df (DataFrame): A DataFrame with one row per TF pair,

containing columns: tf1, tf2, score, p_value , common_partners

Return type:

A tuple (final_score, pairs_df) where

Raises:

ValueError – If ppi_network is None.

tfitpy.indices.ppi.shared_partners_biogrid(sources, datasets=None, pairs=None, **kwargs)[source]#

PPI Shared Partner score using the BioGRID PPI network.

tfitpy.indices.ppi.shared_partners_hippie(sources, datasets=None, pairs=None, **kwargs)[source]#

PPI Shared Partner score using the HIPPIE PPI network.

tfitpy.indices.ppi.shared_partners_pairwise(tf1, tf2, ppi_graph, background_size)[source]#

Computes the hypergeometric shared-partners score for a single TF pair.

For a given pair of transcription factors, retrieves their respective PPI partner sets, computes the overlap, and returns a significance score S = -log10(P) where P is the upper-tail hypergeometric p-value. Based on Lai et al. [BLCHW14].

Parameters:
  • tf1 (str) – identifier of TF1.

  • tf2 (str) – identifier of TF2.

  • ppi_graph (Graph) – An undirected NetworkX graph representing the PPI network.

  • background_size (int) – Total number of proteins used as the population size for the hypergeometric test. Typically the number of nodes in the PPI graph.

Returns:

S (float): The significance score -log10(P). Returns 0.0 if either TF has no partners or if there is no overlap. Returns inf if P rounds to zero. p (float): The p-value c (int): The number of common partners

Return type:

A tuple (S,p,c) where

tfitpy.indices.ppi.shared_partners_stringdb(sources, datasets=None, pairs=None, **kwargs)[source]#

PPI Shared Partner score using the STRING PPI network.

tfitpy.indices.ppi.shortest_path_pairwise(tf1, tf2, ppi_graph)[source]#

Compute proximity score from shortest path length.

Return type:

Tuple[float, float]

tfitpy.indices.ppi.shortest_path_score(sources, ppi_network=None, pairs=None)[source]#

Computes the aggregate shortest-path score for a TF regulatory module.

For each TF pair derived from sources, computes the shortest-path score and aggregates all pairwise scores into a single module-level index by taking the mean. Based on Lai et al. [BLCHW14]

Parameters:
  • sources (list) – List of source TF identifiers in the regulatory module.

  • ppi_network (Graph) – An undirected NetworkX graph representing the PPI network. Must be provided.

  • pairs – Optional precomputed list of (tf1, tf2) tuples. If None, all unique pairs are generated from sources via generate_tf_pairs().

Returns:

final_score (float): The mean proximity score across all TF pairs. Returns 0.0 if no valid scores exist. pairs_df (DataFrame): A DataFrame with one row per TF pair, containing columns: tf1, tf2, score, path_length.

Return type:

A tuple (final_score, pairs_df) where

Raises:

ValueError – If ppi_network is None.

Gene Ontology Based#

Gene Ontology (GO) functional similarity indices.

The core methods accept a GODag and gene2go dict as inputs, enabling reuse across any organism or annotation source. One source is currently supported:

Three semantic similarity methods are implemented using Best-Match Average (BMA):

  • Lin similarity

  • Resnik similarity

  • Jiang-Conrath similarity

tfitpy.indices.go.go_all_scores(sources, datasets=None, pairs=None, **kwargs)[source]#

Compute all 3 GO similarity scores in a single pass over pairs.

For each TF pair, fetches GO term lists once and computes lin, resnik, and jc similarity in sequence — avoiding 3 separate pair loops and repeated gene2go lookups. Uses a row-level terms cache so each gene’s GO terms are fetched only once regardless of how many pairs it appears in.

TermCounts is built once per call rather than once per method.

Parameters:
  • sources (list) – List of gene identifiers in the regulatory module.

  • datasets (dict) – Dataset cache dict containing ‘go’ with keys: ‘godag’, ‘gene2go’. Must be provided.

  • pairs (list) – Optional precomputed list of (g1, g2) tuples. If None, generated from sources via generate_tf_pairs().

Returns:

goa_similarity_lin goa_similarity_resnik goa_similarity_jc

Return type:

Dict with 3 keys

Raises:

ValueError – If datasets is None or ‘go’ key is missing.

tfitpy.indices.go.similarity_score(sources, method, godag=None, gene2go=None, termcounts=None, pairs=None)[source]#

Compute GO semantic similarity for a gene module and aggregate by mean.

For each pair derived from sources, computes BMA similarity using the specified method and aggregates into a single module-level score.

Parameters:
  • sources (list) – List of gene identifiers in the regulatory module.

  • method (str) – Similarity method — one of ‘lin’, ‘resnik’, ‘jc’.

  • godag (GODag) – Loaded GODag object. Must be provided.

  • gene2go (dict) – Mapping of gene symbol → set of GO term IDs. Must be provided.

  • termcounts (TermCounts) – Pre-computed TermCounts. Computed from gene2go if None.

  • pairs – Optional precomputed list of (g1, g2) tuples. If None, all unique pairs are generated from sources via generate_tf_pairs().

Returns:

final_score (float): Mean similarity across all pairs. 0.0 if none. pairs_df (DataFrame): One row per pair with columns:

tf1, tf2, score, n_terms_tf1, n_terms_tf2.

Return type:

A tuple (final_score, pairs_df) where

Raises:

ValueError – If godag or gene2go is None, or method is invalid.

tfitpy.indices.go.similarity_score_pairwise(gene1, gene2, method, godag, gene2go, termcounts)[source]#

Compute GO semantic similarity for a single gene pair.

Parameters:
  • gene1 (str) – First gene identifier (HGNC symbol).

  • gene2 (str) – Second gene identifier (HGNC symbol).

  • method (str) – Similarity method — one of ‘lin’, ‘resnik’, ‘jc’.

  • godag (GODag) – Loaded GODag object.

  • gene2go (dict) – Mapping of gene symbol → set of GO term IDs.

  • termcounts (TermCounts) – Pre-computed TermCounts object for IC calculation.

Return type:

float

Returns:

BMA similarity score as a float in [0, 1]. 0.0 if either gene has no annotations or no valid term-level scores exist.

Raises:

ValueError – If method is not one of ‘lin’, ‘resnik’, ‘jc’.

GRN Dataset Based#

tfitpy.indices.grn.grn_set_metrics(source, target, grn_data)[source]#

Calculate set-based metrics treating predictions as a single set.

Parameters:
  • source (List[str]) – Predicted regulator genes

  • target (str) – Target gene name (for validation/reference).

  • grn_data (DataFrame) – Ground truth regulators for this target. Must contain columns: [‘regulator’, ‘target’, ‘score’].

Returns:

  • precision: Precision score (TP / (TP + FP)).

  • recall: Recall score (TP / (TP + FN)).

  • jaccard: Jaccard index (intersection over union).

Return type:

A dictionary containing