SINr Core

Graph Embeddings

exception sinr.graph_embeddings.DimensionFilteredException

Bases: Exception

Exception raised when trying to access a dimension removed by filtering.

class sinr.graph_embeddings.InterpretableDimension(idx, type)

Bases: object

Internal class : should be used to encapsulate data about a dimension instead of using a simple dict.

add_interpreter(obj, value)

Adding an element that would help to interpret the meaning of the dimension

Parameters:
  • obj (str) – a descriptor or a stereotype of the dimension

  • value (float) – a value describing the relevance of the descriptor for this dimension

get_dict()

The dict that can be processed with the interpreters

Returns:

a dict of interpreters for the dimension

Return type:

dict

get_idx()

Getter of the idx attribute

Returns:

the id of the dimension

Return type:

int

get_interpreter(id)

Get a specific interpreter

Parameters:

id (int) – id of the interpreter

Returns:

the interpreter of id for this dimension

Return type:

an interpreter as a tuple (obj: str, value: float) if there is a value

get_interpreters()

Getting the list of interpreters, object that allows to describe the dimension

Returns:

the list of interpreters

Return type:

list

get_value()

Getter for the value parameter, which is a boolean to detect if numerical values are used in the interpreters or not

Returns:

the value attribute

Return type:

bool

sort(on_value=True)

Sorting the interpreters, according to values if values is True, according to the str described of the interpreters instead if False

Parameters:

on_value (bool, optional) – sorting on values or not, defaults to True

topk(topk)

Selecting only the topk interpreters

Parameters:

topk (int) – number of interpreters to keep

with_value()

Seeting the value to True

Returns:

the self object

Return type:

InterpretableDimension

class sinr.graph_embeddings.InterpretableWordsModelBuilder(sinr, name, n_jobs=-1, n_neighbors=31)

Bases: ModelBuilder

Object that should be used after training word or graph embeddings using the SINr object to get interpretable word vectors. The InterpretableWordsModelBuilder will make use of the SINr object to build a SINrVectors object that will allow to use the resulting vectors efficiently. No need to use parent methods starting by “with”, those are included in the build function. Just provide the name of the model and build it.

build()

Build InterpretableWordsModelBuilder which contains the vocabulary, the embeddings and the communities.

class sinr.graph_embeddings.ModelBuilder(sinr, name, n_jobs=-1, n_neighbors=31)

Bases: object

Object that should be used after the training of word or graph embeddings using the SINr object. The ModelBuilder will make use of the SINr object to build a SINrVectors object that will allow to use the resulting vectors efficiently. ..

Attributes

Attributes should not be read

build()

To get the SINrVectors object

with_all()
with_communities()

To keep the interpretability of the model using the communities.

with_embeddings_nfm()

Adding NFM (Node Recall + Node Predominance) vectors to the SINrVectors object.

with_embeddings_nr(threshold=0)

Adding Node Recall vectors to the SINrVectors object.

Parameters:

threshold (float) – (Default value = 0)

with_graph()

To keep the underlying graph ; useful to get co-occ statistics, degree of nodes or to label communities with central nodes.

with_np()

Storing Node predominance values in order to label dimensions for instance.

with_vocabulary()

To deal with word vectors or graph when nodes have labels.

exception sinr.graph_embeddings.NoCommunityDetectedException

Bases: Exception

Exception raised when no community detection has been performed thus leaving self.communities to its default value None.

exception sinr.graph_embeddings.NoEmbeddingExtractedException

Bases: Exception

Exception raised when no embedding extraction has been performed thus leaving self.nr and self.np`and `self.nfm to their default value None.

exception sinr.graph_embeddings.NoInterpretabilityException

Bases: Exception

Raised when the communities were not included in the model that was built. It is thus not interpretable anymore.

exception sinr.graph_embeddings.NoIntruderPickableException

Bases: Exception

Raised when no intruder could be found with the percentages provided

exception sinr.graph_embeddings.NoVocabularyException

Bases: Exception

Raised when no vocabulary was included in the model that was built. One cannot play with words.

class sinr.graph_embeddings.OnlyGraphModelBuilder(sinr, name, n_jobs=-1, n_neighbors=31)

Bases: ModelBuilder

Object that should be used after training word or graph embeddings using the SINr object to get interpretable vectors. The OnlyGraphModelBuilder will make use of the SINr object to build a SINrVectors object that will allow to use the resulting vectors efficiently. No need to use parent methods starting by “with”, those are included in the “build” function. Just provide the name of the model and build it.

build()

Build OnlyGraphModelBuilder which contains solely the embeddings.

class sinr.graph_embeddings.SINr(graph, lgcc, wrd_to_idx, n_jobs=-1)

Bases: object

Object that can be used to extract word or graph embeddings using the SINr approach. This object cannot then be used to inspect the resulting vectors. Instead, using the ModelBuilder class, a SINrVectors object should be created that will allow to use the resulting vectors.

Attributes

Attributes should not be read

detect_communities(gamma=1, algo=None, inspect=True, par='balanced')

Runs community detection on the graph

Parameters:
  • gamma (int, optional) – For Louvain algorithm which is the default algorithm (ignore this parameter if param algo is used), allows to control the size of the communities. The greater it is, the smaller the communities. The default is 1.

  • algo (networkit.algo.community, optional) – Community detection algorithm. The default, None allorws to run a Louvain algorithm

  • inspect (boolean, optional) – Whether or not one wants to get insight about the communities extracted. The default is True.

  • par – Parallelisation strategy for networkit community detection (Louvain), see https://networkit.github.io/dev-docs/python_api/community.html#networkit.community.PLM for more details, “none randomized” allows randomness in Louvain in single thread mode. To force determinism pass the “none” parallelisation strategy. The default is balanced.

extract_embeddings(communities=None)

Extract the embeddings based on the graph and the partition in communities previously detected.

Parameters:

communities (networkit.Partition) – Community structures (Default value = None)

get_communities()

Return the `networkit.Patrtion`community object.

get_cooc_graph()

Return the graph.

get_nfm()

Return the NFM matrix.

get_np()

Return the NP matrix.

get_nr()

Return the NR matrix.

get_out_of_LgCC_coms(communities)

Get communities that are not in the Largest Connected Component (LgCC).

Parameters:

communities (Partition) – Partition object of the communities as obtained by calling a Networkit community detection algorithm

Returns:

Indices of the comunnities outside the LgCC

Return type:

list[int]

get_vocabulary()

Return the vocabulary.

get_wrd_to_id()

Return the word to index map.

classmethod load_from_adjacency_matrix(matrix_object, labels=None, n_jobs=-1)

Build a sinr object from an adjacency matrix as a sparse one (csr)

Parameters:
  • matrix_object (csr_matrix) – Matrix describing the graph.

  • labels – (Default value = None)

  • n_jobs (int, optional) – Number of jobs that should be used The default is -1.

classmethod load_from_cooc_pkl(cooc_matrix_path, n_jobs=-1)

Build a sinr object from a co-occurrence matrix stored as a pickle : useful to deal with textual data. Co-occurrence matrices should for instance be generated using sinr.text.cooccurrence

Parameters:
  • cooc_matrix_path (string) – Path to the cooccurrence matrix generated using sinr.text.cooccurrence : the file should be a pickle

  • n_jobs (int, optional) – Number of jobs that should be used The default is -1.

classmethod load_from_graph(graph, n_jobs=-1)

Build a sinr object from a networkit graph object

Parameters:
  • graph (networkit) – Networkit graph object.

  • n_jobs (int, optional) – Number of jobs that should be used The default is -1.

run(algo=None)

Runs the training of the embedding, i.e. community detection + vectors extraction

Parameters:

algo (networkit.algo.community, optional) – Community detection algorithm. The default, None allorws to run a Louvain algorithm

size_of_voc()

Returns the size of the vocabulary.

transfert_communities_labels(community_labels, refine=False)

Transfer communities computed on one graph to another, used mainly with co-occurence graphs.

Parameters:
  • community_labels – a list of communities described by sets of labels describing the nodes

  • refine (bool) – (Default value = False)

Typev community_labels:

list[set[str]]

Returns:

Initializes a partition where nodes are all singletons. Then, when communities in parameters contain labels

that are in the graph at hand, these communities are transferred.

class sinr.graph_embeddings.SINrVectors(name, n_jobs=-1, n_neighbors=20)

Bases: object

After training word or graph embeddings using SINr object, use the ModelBuilder object to build SINrVectors. SINrVectors is the object to manipulate the model, explore the embedding space and its interpretability

binarize()

Binarize the vectors

cosine_dist(obj1, obj2)

Return cosine distance between specified item of the model

Parameters:
  • obj1 (int or str) – first object to get embedding

  • obj2 (int or str) – second object to get embedding

Returns:

cosine distance between obj1 and obj2

Return type:

float

cosine_sim(obj1, obj2)

Return cosine similarity between specified item of the model

Parameters:
  • obj1 (int or str) – first object to get embedding

  • obj2 (int or str) – second object to get embedding

Returns:

cosine similarity between obj1`and `obj2

Return type:

float

dim_nnz_count(dim)

Count the number of non zero values in a dimension. :param dim: index of the dimension :type dim: int

Returns:

the number of non zero values in the dimension

Return type:

int

dim_nnz_thresholds(step=100, diff_tol=0.005)

Give the minimal and the maximal number of non zero values to have for a dimension to be kept and not lower the model’s similarity. Taking into account the datasets MEN, WS353, SCWS and SimLex-999.

Parameters:

step – step to search thresholds (default value : 100)

Param:

diff_tol: difference of similarity tolerated with the low threshold (default value : 0.005)

Returns:

thresholds (low, high)

Return type:

tuple of int

get_communities_as_labels_sets()

Get partition of communities as a list of sets each containing the label associates to the node in the community.

Returns:

List of communities each represented by a set of labels associated to the node in each subset

Return type:

list[set[str]]

Raises:

NoInterpretabilityExceptionSINrVectors was not exported with interpretable dimensions

get_community_membership(obj)

Get the community index of a node or label.

Parameters:

obj (int or str) – an integer of the node or of its label

Returns:

the community of a specific object

get_community_sets(idx)

Get the indices of the nodes in for a specific community.

Parameters:
  • obj (int or str) – an integer index of a community

  • idx (int) – index of the community

Returns:

the set of ids of nodes belonging to this community

get_dimension_descriptors(obj, topk=-1)

Returns the objects that constitute the dimension of obj, i.e. the members of the community of obj

Parameters:
  • obj (int or str) – an object for which to return the descriptors

  • topk – top values to retrieve for obj (Default value = -1)

Returns:

a set of object, the community of obj

get_dimension_descriptors_idx(index, topk=-1)

Returns the objects that constitute the dimension of obj, i.e. the members of the community of obj

Parameters:

topk – 1 returns all the members of the community, a positive int returns juste the topk members with

highest nr values on the community (Default value = -1) :type topk: int :param index: the index of the dimension :type index: int :returns: a set of object, the community of obj

get_dimension_stereotypes(obj, topk=5)

Get the words with the highest values on dimension obj.

Parameters:
  • obj (int or str) – id of a word, or label of a word (then turned into the id of its community)

  • topk (int) – topk value to consider on the dimension (Default value = 5)

Returns:

the topk words that describe this dimension (highest values)

get_dimension_stereotypes_idx(idx, topk=5)

Get the indices of the words with the highest values on dimension obj.

Parameters:
  • obj (int or str) – id of a dimension, or label of a word (then turned into the id of its community)

  • topk (int) – topk value to consider on the dimension (Default value = 5)

  • idx (int) – dimension to fetch topk on

Returns:

the topk words that describe this dimension (highest values)

get_matching_communities(sinr_vector)

Get the matching between two partitions with common vocabularies

Parameters:

sinr_vector (SINrVectors) – Small model (target)

Returns:

Lists. The first indicating, at each of its index corresponding to the community’s index of the self object (src), its matching number in the parameter sinr_vector’s communities (tgt) if it exists. The second indicating, at each of its index corresponding to the community’s index of the object in parameter, its matching number in the self object.

Return type:

(list[int],list[int])

get_my_vector(obj, row=True)

Get the column or the row obj.

Parameters:
  • obj (int) – Index of the row/column to return.

  • row (bool) – Return a row if True else a column. Defaults to True.

Returns:

A row/column.

Return type:

np.ndarray

get_nnv()

Get the number of null-vetors in the embedding matrix.

Returns:

number of null vectors

get_nnz()

Get the count of non-zero values in the embedding matrix.

Returns:

number of non zero values

get_number_of_dimensions()

Get the number of dimensions of model.

Returns:

Number of dimensions of the model.

Return type:

int

get_nz_dims(obj)

Get the indices of non-zero dimensions.

Parameters:

obj – An int or string for which to get non-zero dimensions

Returns:

set of indices of non zero dimensions

get_obj_descriptors(obj, topk_dim=5, topk_val=-1)

Returns the descriptors of the dimensions of obj.

Parameters:
  • topk_dim (int) – int, topk dimensions to consider to describe obj (Default value = 5)

  • obj (int or str) – an id or a word/label

  • topk_val – 1 returns all the members of the community, a positive int returns juste the topk members with

highest nr values on the community (Default value = -1) :type topk_val: int :returns: the dimensions (and the objects that constitute these dimensions) that matter to describe obj

get_obj_stereotypes(obj, topk_dim=5, topk_val=3)

Get the top dimensions for a word.

Parameters:
  • obj (int or str) – the word to consider

  • topk_dim (int) – topk dimension to consider (Default value = 5)

  • topk_val (int) – topk values to describe each dimension (Default value = 3)

Returns:

the most useful dimensions to describe a word and for each dimension,

the topk words that describe this dimension (highest values)

get_obj_stereotypes_and_descriptors(obj, topk_dim=5, topk_val=3)

Get the stereotypes and descriptors for obj.

Parameters:
  • obj (int or str) – object for which to fetch stereotypes and descriptors

  • topk_dim (int) – number of dimensions to consider (Default value = 5)

  • topk_val (int) – number of values per dimension (Default value = 3)

Returns:

both stereotypes and descriptors

get_topk_dims(obj, topk=5)

Get topk dimensions for an object.

Parameters:
  • obj (int or str) – the object for which to get topk dimensions

  • topk (int) – (Default value = 5)

Returns:

the topk dimensions for obj

Return type:

list[int]

get_union_topk(prct: int)
Parameters:

prct (int) – percentage of the vocabulary among the top for each dimension

Returns:

list of the ids of words that are among the top prct of the dims, can be useful to pick intruders

Return type:

int list

get_value_dim_per_word(obj, dim_index)

Get the value of a dimension for a word.

Parameters:
  • obj (str or int) – a word or its index

  • dim_index (int) – the index of the dimension to retrieve

Returns:

the value for a given vector on a given dimension

get_value_obj_dim(obj, dim)

Get the value of obj in dimension dim.

Parameters:
  • obj (int or str) – an object for which to return the value

  • dim (int) – the index of the dimension for which to return the value

Returns:

The value of obj at dimension dim

Return type:

float

get_vectors_using_self_space(sinr_vector)

Transpose the vectors of the sinr_vector object in parameter in the embedding space of the self object, using matching communities

Parameters:

sinr_vector (SINrVectors) – Small model (target)

Returns:

Copy of the self model (the big one) with vectors of the parameter (small one) transposed to its referential

Return type:

SINrVectors

get_vocabulary_size()
Returns:

Number of words that constitute the vocabulary

Return type:

int

inter_sim(intruder, topk, dist=True)

Get the average cosine distance (or cosine similarity) between top words and the intruder word

Parameters:
  • intruder (int) – id of the intruder word

  • topk (int) – number of top words to consider

  • dist (boolean) – set to True (default) to use cosine distance and False to use cosine similarity

Returns:

average cosine distance (or cosine similarity) between top words and the intruder word

Return type:

float

intra_sim(topks, dist=True)

Get the average cosine distance (or cosine similarity) between top words

Parameters:
  • topks (int) – number of top words to pick

  • dist (boolean) – set to True (default) to use cosine distance and False to use cosine similarity

Returns:

average cosine distance (or cosine similarity) between top words

Return type:

float

labels: bool
light_model_save()

Save a minimal version of the model that is readable as a dict for evaluation on word-embeddings-benchmark https://github.com/kudkudak/word-embeddings-benchmarks

load(path=None)

Load a SINrVectors model.

Parameters:

path (string) – Path of the pickle file of the model.

classmethod load_from_w2v(w2v_path, name, n_jobs=-1, n_neighbors=20)

Initializing a SINrVectors object using a file at the word2vec format :param w2v_path: path of the file at word2vec format which contains vectors :type w2v_path: str :param name: name of the model, useful to save it :type name: str

most_similar(obj)

Get the most similar objects of the one passed as a parameter using the cosine of their vectors.

Parameters:

obj (int or str) – the object for which to fetch the nearest neighbors

obj_nnz_count(obj)

Count the number of non zero values in a word vector. :param obj: word :type obj: string

Returns:

the number of non zero values in the word vector

Return type:

int

pct_nnz()

Get the percentage of non-zero values in the embedding matrix.

Returns:

percentage of non-zero values in the embedding matrix

pick_intruder(dim, union=None, prctbot=50, prcttop=10)

Pick an intruder word for a dimension

Parameters:
  • dim (int) – the index of the dimension for which to return intruders

  • union (int list) – ids of words that are among the top prct of at least one dimension (defaults to None)

  • prctbot (int) – bottom prctbot to pick (defaults to 50)

  • prcttop (int) – top prcttop to pick (defaults to 10)

Returns:

ids of an intruder word from the dimension

Return type:

int

remove_communities_dim_nnz(threshold_min=None, threshold_max=None)

Remove dimensions (communities) which are the less activated and those which are the most activated.

Parameters:
  • threshold_min (int) – minimal number of non zero values to have for a dimension to be kept

  • threshold_max (int) – maximal number of non zero values to have for a dimension to be kept

save(path=None)

Save a SINrVectors model.

Parameters:

path (string) – Path of the pickle file of the model.

set_communities(com)

Set the communities from the partition in communities.

Parameters:

com (networkit.Partition) – partition in communities

set_graph(G)

Set the graph property.

Parameters:

G (networkit.Graph) – A networkit graph

set_n_jobs(n_jobs)

Set the number of jobs.

Parameters:

n_jobs – number of jobs

set_np(np)

Set the embedding matrix.

Parameters:

np (scipy.sparse.csr_matrix) – a sparse matrix of the embeddings

set_vectors(embeddings)

Set the embedding vectors and initialize nearest neighbors.

Parameters:

embeddings (scipy.sparse.csr_matrix) – initialize the vectors and build the nearest neighbors data structure using sklearn

set_vocabulary(voc)

Set the vocabulary for word-co-occurrence graphs.

Parameters:

voc – set the vocabulary when dealing with words or nodes with labels. label parameter is set to True.

By default, labels from the vocab will be used.

sparsify(k)

Sparsify the vectors keeping activated the top k dimensions

Parameters:

k – int

class sinr.graph_embeddings.ThresholdedModelBuilder(sinr, name, n_jobs=-1, n_neighbors=31)

Bases: ModelBuilder

Object that should be used after the training of word or graph embeddings using the SINr object to get interpretable word vectors. The ThresholdedModelBuilder will make use of the SINr object to build a SINrVectors object that will allow to use the resulting vectors efficiently. Values in the vectors that are lower than the threshold will be discarded. Vectors are then sparser and more interpretable. No need to use parent methods starting by “with”, those are included in the build function. Just provide the name of the model and build it.

build(threshold=0.01)

Build ThresholdedModelBuilder which contains the vocabulary, the embeddings with values thresholded above a minimum and the communities.

Parameters:

threshold – (Default value = 0.01)

sinr.graph_embeddings.get_compact_lgcc(graph, word_to_idx)

Get a compacted graph with only nodes inside the largest connected component. Get the words with ids corresponding to the new node ids.

Parameters:
  • graph (networkit graph) – The input graph

  • word_to_idx (dictionnary) – The words mapped to their initial ids

Returns:

The new graph and dictionnary of words

Return type:

networkit graph, dictionnary

sinr.graph_embeddings.get_graph_from_matrix(matrix)

Build a graph from a sparse adjacency matrix.

Parameters:

matrix (scipy.sparse.coo_matrix) – A sparse matrix describing a graph

sinr.graph_embeddings.get_lgcc(graph)

Return the nodes that are outside the Largest Connected Component (LgCC) of the graph.

Parameters:

graph (networkit graph) – The graph for which to retrieve out of LgCC nodes

SINr NFM

sinr.nfm.compute_NP(adjacency, membership_matrix)

Compute the node-predominance based on the adjacency matrix and the community-membership matrix of the graph.

Parameters:
  • adjacency (Scipy.sparse.csr_matrix) – Adjacency matrix of the graph.

  • membership_matrix (Scipy.sparse.csr_matrix) – Community membership matrix.

Returns:

NP measures for each node and each community

Return type:

Scipy.sparse.csr_matrix

sinr.nfm.compute_NR(adjacency, membership_matrix)

Compute the node-recall based on the adjacency matrix and the community-membership matrix of the graph.

Parameters:
  • adjacency (Scipy.sparse.csr_matrix) – Adjacency matrix of the graph.

  • membership_matrix (Scipy.sparse.csr_matrix) – Community membership matrix.

Returns:

NR measures for each node and each community

Return type:

Scipy.sparse.csr_matrix

sinr.nfm.distributed_degree(adjacency)

Make values in the adjacency matrix be between 0 and 1 depending on how the degree of the node is distributed over each community.

Parameters:

adjacency (Scipy.sparse.csr_matrix) – Adjacency matrix of the graph.

Returns:

l1 normalized adjacency matrix.

Return type:

Scipy.sparse.csr_matrix

sinr.nfm.get_community_weights(adjacency, membership_matrix)

Get the total weight of each community in terms of degree.

Parameters:
  • adjacency (Scipy.sparse.csr_matrix) – Adjacency matrix of the graph.

  • membership_matrix (Scipy.sparse.csr_matrix) – Community membership matrix.

Returns:

Degree-based weight of each community.

Return type:

Scipy.sparse.csr_matrix

sinr.nfm.get_membership(vector)

Return the membership matrix based on the community membership vector.

Parameters:

vector (list[int]) – The vector of community index for each node

Returns:

The community membership matrix of shape (#nodes x #communities).

Return type:

Scipy.sparse.csr_matrix

sinr.nfm.get_nfm_embeddings(G, vector, compute_np=False, merge=False)

Compute the Node F-Measure metrics to build the embedding matrix using the graph and community structure detected.

Parameters:
  • G (networkit.Graph) – Graph on which to compute the embeddings

  • vector (list[int]) – The node-community membership vector

  • compute_np (bool, optional) – Compute the node predominance metric, defaults to False

  • merge (bool, optional) – Merge the NR and NP measure in a common matrix, defaults to False

Returns:

The node predominance, node recall and merged matrix (nfm) if applicable.

Return type:

tuple[Scipy.sparse.csr_matrix, Scipy.sparse.csr_matrix, Scipy.sparse.csr_matrix]

Loader

sinr.strategy_loader.load_adj_mat(matrix, labels=None)

Load a cooccurrence matrix.

Parameters:
  • matrix (csr_matrix) – an adjacency matrix

  • matrix – a dict matching labels with nodes

  • labels – (Default value = None)

Returns:

The loaded cooccurrence matrix and the word index.

Return type:

tuple(dict(), scipy.sparse.coo_matrix)`

sinr.strategy_loader.load_pkl_text(mat_path)

Load a cooccurrence matrix.

Parameters:
  • cooc_mat_path (str) – Path to coocurrence matrix.

  • mat_path

Returns:

The loaded cooccurrence matrix and the word index.

Return type:

tuple(dict(), scipy.sparse.coo_matrix)`

Visualization

class sinr.viz.SINrViz(sinr_vectors: SINrVectors)

Bases: object

Visualization package for SINr embdeddings. The goal is to visualize and interpret the diemnesions of the embeddings produced.

compare_stereotypes(args, topk_dim=5)

Make a heatmap comparing top dimensions for elements in args (words).

Parameters:
  • args (list[int]) – A list of indices (words).

  • topk_dim (int, optional) – Number of top dimensions to fetch, defaults to 5

Logger

Module contents

Top-level package for SINr Embeddings.