SINr Core
Graph Embeddings
- exception sinr.graph_embeddings.DimensionFilteredException
Bases:
Exception
Exception raised when trying to access a dimension removed by filtering.
- class sinr.graph_embeddings.InterpretableDimension(idx, type)
Bases:
object
Internal class : should be used to encapsulate data about a dimension instead of using a simple dict.
- add_interpreter(obj, value)
Adding an element that would help to interpret the meaning of the dimension
- Parameters:
obj (str) – a descriptor or a stereotype of the dimension
value (float) – a value describing the relevance of the descriptor for this dimension
- get_dict()
The dict that can be processed with the interpreters
- Returns:
a dict of interpreters for the dimension
- Return type:
dict
- get_idx()
Getter of the idx attribute
- Returns:
the id of the dimension
- Return type:
int
- get_interpreter(id)
Get a specific interpreter
- Parameters:
id (int) – id of the interpreter
- Returns:
the interpreter of id for this dimension
- Return type:
an interpreter as a tuple (obj: str, value: float) if there is a value
- get_interpreters()
Getting the list of interpreters, object that allows to describe the dimension
- Returns:
the list of interpreters
- Return type:
list
- get_value()
Getter for the value parameter, which is a boolean to detect if numerical values are used in the interpreters or not
- Returns:
the value attribute
- Return type:
bool
- sort(on_value=True)
Sorting the interpreters, according to values if values is True, according to the str described of the interpreters instead if False
- Parameters:
on_value (bool, optional) – sorting on values or not, defaults to True
- topk(topk)
Selecting only the topk interpreters
- Parameters:
topk (int) – number of interpreters to keep
- with_value()
Seeting the value to True
- Returns:
the self object
- Return type:
- class sinr.graph_embeddings.InterpretableWordsModelBuilder(sinr, name, n_jobs=-1, n_neighbors=31)
Bases:
ModelBuilder
Object that should be used after training word or graph embeddings using the SINr object to get interpretable word vectors. The InterpretableWordsModelBuilder will make use of the SINr object to build a SINrVectors object that will allow to use the resulting vectors efficiently. No need to use parent methods starting by “with”, those are included in the build function. Just provide the name of the model and build it.
- build()
Build InterpretableWordsModelBuilder which contains the vocabulary, the embeddings and the communities.
- class sinr.graph_embeddings.ModelBuilder(sinr, name, n_jobs=-1, n_neighbors=31)
Bases:
object
Object that should be used after the training of word or graph embeddings using the SINr object. The ModelBuilder will make use of the SINr object to build a SINrVectors object that will allow to use the resulting vectors efficiently. ..
Attributes
Attributes should not be read
- build()
To get the SINrVectors object
- with_all()
- with_communities()
To keep the interpretability of the model using the communities.
- with_embeddings_nfm()
Adding NFM (Node Recall + Node Predominance) vectors to the SINrVectors object.
- with_embeddings_nr(threshold=0)
Adding Node Recall vectors to the SINrVectors object.
- Parameters:
threshold (float) – (Default value = 0)
- with_graph()
To keep the underlying graph ; useful to get co-occ statistics, degree of nodes or to label communities with central nodes.
- with_np()
Storing Node predominance values in order to label dimensions for instance.
- with_vocabulary()
To deal with word vectors or graph when nodes have labels.
- exception sinr.graph_embeddings.NoCommunityDetectedException
Bases:
Exception
Exception raised when no community detection has been performed thus leaving self.communities to its default value None.
- exception sinr.graph_embeddings.NoEmbeddingExtractedException
Bases:
Exception
Exception raised when no embedding extraction has been performed thus leaving self.nr and self.np`and `self.nfm to their default value None.
- exception sinr.graph_embeddings.NoInterpretabilityException
Bases:
Exception
Raised when the communities were not included in the model that was built. It is thus not interpretable anymore.
- exception sinr.graph_embeddings.NoIntruderPickableException
Bases:
Exception
Raised when no intruder could be found with the percentages provided
- exception sinr.graph_embeddings.NoVocabularyException
Bases:
Exception
Raised when no vocabulary was included in the model that was built. One cannot play with words.
- class sinr.graph_embeddings.OnlyGraphModelBuilder(sinr, name, n_jobs=-1, n_neighbors=31)
Bases:
ModelBuilder
Object that should be used after training word or graph embeddings using the SINr object to get interpretable vectors. The OnlyGraphModelBuilder will make use of the SINr object to build a SINrVectors object that will allow to use the resulting vectors efficiently. No need to use parent methods starting by “with”, those are included in the “build” function. Just provide the name of the model and build it.
- build()
Build OnlyGraphModelBuilder which contains solely the embeddings.
- class sinr.graph_embeddings.SINr(graph, lgcc, wrd_to_idx, n_jobs=-1)
Bases:
object
Object that can be used to extract word or graph embeddings using the SINr approach. This object cannot then be used to inspect the resulting vectors. Instead, using the ModelBuilder class, a SINrVectors object should be created that will allow to use the resulting vectors.
…
Attributes
Attributes should not be read
- detect_communities(gamma=1, algo=None, inspect=True, par='balanced')
Runs community detection on the graph
- Parameters:
gamma (int, optional) – For Louvain algorithm which is the default algorithm (ignore this parameter if param algo is used), allows to control the size of the communities. The greater it is, the smaller the communities. The default is 1.
algo (networkit.algo.community, optional) – Community detection algorithm. The default, None allorws to run a Louvain algorithm
inspect (boolean, optional) – Whether or not one wants to get insight about the communities extracted. The default is True.
par – Parallelisation strategy for networkit community detection (Louvain), see https://networkit.github.io/dev-docs/python_api/community.html#networkit.community.PLM for more details, “none randomized” allows randomness in Louvain in single thread mode. To force determinism pass the “none” parallelisation strategy. The default is balanced.
- extract_embeddings(communities=None)
Extract the embeddings based on the graph and the partition in communities previously detected.
- Parameters:
communities (networkit.Partition) – Community structures (Default value = None)
- get_cooc_graph()
Return the graph.
- get_nfm()
Return the NFM matrix.
- get_np()
Return the NP matrix.
- get_nr()
Return the NR matrix.
- get_out_of_LgCC_coms(communities)
Get communities that are not in the Largest Connected Component (LgCC).
- Parameters:
communities (Partition) – Partition object of the communities as obtained by calling a Networkit community detection algorithm
- Returns:
Indices of the comunnities outside the LgCC
- Return type:
list[int]
- get_vocabulary()
Return the vocabulary.
- get_wrd_to_id()
Return the word to index map.
- classmethod load_from_adjacency_matrix(matrix_object, labels=None, n_jobs=-1)
Build a sinr object from an adjacency matrix as a sparse one (csr)
- Parameters:
matrix_object (csr_matrix) – Matrix describing the graph.
labels – (Default value = None)
n_jobs (int, optional) – Number of jobs that should be used The default is -1.
- classmethod load_from_cooc_pkl(cooc_matrix_path, n_jobs=-1)
Build a sinr object from a co-occurrence matrix stored as a pickle : useful to deal with textual data. Co-occurrence matrices should for instance be generated using sinr.text.cooccurrence
- Parameters:
cooc_matrix_path (string) – Path to the cooccurrence matrix generated using sinr.text.cooccurrence : the file should be a pickle
n_jobs (int, optional) – Number of jobs that should be used The default is -1.
- classmethod load_from_graph(graph, n_jobs=-1)
Build a sinr object from a networkit graph object
- Parameters:
graph (networkit) – Networkit graph object.
n_jobs (int, optional) – Number of jobs that should be used The default is -1.
- run(algo=None)
Runs the training of the embedding, i.e. community detection + vectors extraction
- Parameters:
algo (networkit.algo.community, optional) – Community detection algorithm. The default, None allorws to run a Louvain algorithm
- size_of_voc()
Returns the size of the vocabulary.
- transfert_communities_labels(community_labels, refine=False)
Transfer communities computed on one graph to another, used mainly with co-occurence graphs.
- Parameters:
community_labels – a list of communities described by sets of labels describing the nodes
refine (bool) – (Default value = False)
- Typev community_labels:
list[set[str]]
- Returns:
Initializes a partition where nodes are all singletons. Then, when communities in parameters contain labels
that are in the graph at hand, these communities are transferred.
- class sinr.graph_embeddings.SINrVectors(name, n_jobs=-1, n_neighbors=20)
Bases:
object
After training word or graph embeddings using SINr object, use the ModelBuilder object to build SINrVectors. SINrVectors is the object to manipulate the model, explore the embedding space and its interpretability
- binarize()
Binarize the vectors
- cosine_dist(obj1, obj2)
Return cosine distance between specified item of the model
- Parameters:
obj1 (int or str) – first object to get embedding
obj2 (int or str) – second object to get embedding
- Returns:
cosine distance between obj1 and obj2
- Return type:
float
- cosine_sim(obj1, obj2)
Return cosine similarity between specified item of the model
- Parameters:
obj1 (int or str) – first object to get embedding
obj2 (int or str) – second object to get embedding
- Returns:
cosine similarity between obj1`and `obj2
- Return type:
float
- dim_nnz_count(dim)
Count the number of non zero values in a dimension. :param dim: index of the dimension :type dim: int
- Returns:
the number of non zero values in the dimension
- Return type:
int
- dim_nnz_thresholds(step=100, diff_tol=0.005)
Give the minimal and the maximal number of non zero values to have for a dimension to be kept and not lower the model’s similarity. Taking into account the datasets MEN, WS353, SCWS and SimLex-999.
- Parameters:
step – step to search thresholds (default value : 100)
- Param:
diff_tol: difference of similarity tolerated with the low threshold (default value : 0.005)
- Returns:
thresholds (low, high)
- Return type:
tuple of int
- get_communities_as_labels_sets()
Get partition of communities as a list of sets each containing the label associates to the node in the community.
- Returns:
List of communities each represented by a set of labels associated to the node in each subset
- Return type:
list[set[str]]
- Raises:
NoInterpretabilityException – SINrVectors was not exported with interpretable dimensions
- get_community_membership(obj)
Get the community index of a node or label.
- Parameters:
obj (int or str) – an integer of the node or of its label
- Returns:
the community of a specific object
- get_community_sets(idx)
Get the indices of the nodes in for a specific community.
- Parameters:
obj (int or str) – an integer index of a community
idx (int) – index of the community
- Returns:
the set of ids of nodes belonging to this community
- get_dimension_descriptors(obj, topk=-1)
Returns the objects that constitute the dimension of obj, i.e. the members of the community of obj
- Parameters:
obj (int or str) – an object for which to return the descriptors
topk – top values to retrieve for obj (Default value = -1)
- Returns:
a set of object, the community of obj
- get_dimension_descriptors_idx(index, topk=-1)
Returns the objects that constitute the dimension of obj, i.e. the members of the community of obj
- Parameters:
topk – 1 returns all the members of the community, a positive int returns juste the topk members with
highest nr values on the community (Default value = -1) :type topk: int :param index: the index of the dimension :type index: int :returns: a set of object, the community of obj
- get_dimension_stereotypes(obj, topk=5)
Get the words with the highest values on dimension obj.
- Parameters:
obj (int or str) – id of a word, or label of a word (then turned into the id of its community)
topk (int) – topk value to consider on the dimension (Default value = 5)
- Returns:
the topk words that describe this dimension (highest values)
- get_dimension_stereotypes_idx(idx, topk=5)
Get the indices of the words with the highest values on dimension obj.
- Parameters:
obj (int or str) – id of a dimension, or label of a word (then turned into the id of its community)
topk (int) – topk value to consider on the dimension (Default value = 5)
idx (int) – dimension to fetch topk on
- Returns:
the topk words that describe this dimension (highest values)
- get_matching_communities(sinr_vector)
Get the matching between two partitions with common vocabularies
- Parameters:
sinr_vector (SINrVectors) – Small model (target)
- Returns:
Lists. The first indicating, at each of its index corresponding to the community’s index of the self object (src), its matching number in the parameter sinr_vector’s communities (tgt) if it exists. The second indicating, at each of its index corresponding to the community’s index of the object in parameter, its matching number in the self object.
- Return type:
(list[int],list[int])
- get_my_vector(obj, row=True)
Get the column or the row obj.
- Parameters:
obj (int) – Index of the row/column to return.
row (bool) – Return a row if True else a column. Defaults to True.
- Returns:
A row/column.
- Return type:
np.ndarray
- get_nnv()
Get the number of null-vetors in the embedding matrix.
- Returns:
number of null vectors
- get_nnz()
Get the count of non-zero values in the embedding matrix.
- Returns:
number of non zero values
- get_number_of_dimensions()
Get the number of dimensions of model.
- Returns:
Number of dimensions of the model.
- Return type:
int
- get_nz_dims(obj)
Get the indices of non-zero dimensions.
- Parameters:
obj – An int or string for which to get non-zero dimensions
- Returns:
set of indices of non zero dimensions
- get_obj_descriptors(obj, topk_dim=5, topk_val=-1)
Returns the descriptors of the dimensions of obj.
- Parameters:
topk_dim (int) – int, topk dimensions to consider to describe obj (Default value = 5)
obj (int or str) – an id or a word/label
topk_val – 1 returns all the members of the community, a positive int returns juste the topk members with
highest nr values on the community (Default value = -1) :type topk_val: int :returns: the dimensions (and the objects that constitute these dimensions) that matter to describe obj
- get_obj_stereotypes(obj, topk_dim=5, topk_val=3)
Get the top dimensions for a word.
- Parameters:
obj (int or str) – the word to consider
topk_dim (int) – topk dimension to consider (Default value = 5)
topk_val (int) – topk values to describe each dimension (Default value = 3)
- Returns:
the most useful dimensions to describe a word and for each dimension,
the topk words that describe this dimension (highest values)
- get_obj_stereotypes_and_descriptors(obj, topk_dim=5, topk_val=3)
Get the stereotypes and descriptors for obj.
- Parameters:
obj (int or str) – object for which to fetch stereotypes and descriptors
topk_dim (int) – number of dimensions to consider (Default value = 5)
topk_val (int) – number of values per dimension (Default value = 3)
- Returns:
both stereotypes and descriptors
- get_topk_dims(obj, topk=5)
Get topk dimensions for an object.
- Parameters:
obj (int or str) – the object for which to get topk dimensions
topk (int) – (Default value = 5)
- Returns:
the topk dimensions for obj
- Return type:
list[int]
- get_union_topk(prct: int)
- Parameters:
prct (int) – percentage of the vocabulary among the top for each dimension
- Returns:
list of the ids of words that are among the top prct of the dims, can be useful to pick intruders
- Return type:
int list
- get_value_dim_per_word(obj, dim_index)
Get the value of a dimension for a word.
- Parameters:
obj (str or int) – a word or its index
dim_index (int) – the index of the dimension to retrieve
- Returns:
the value for a given vector on a given dimension
- get_value_obj_dim(obj, dim)
Get the value of obj in dimension dim.
- Parameters:
obj (int or str) – an object for which to return the value
dim (int) – the index of the dimension for which to return the value
- Returns:
The value of obj at dimension dim
- Return type:
float
- get_vectors_using_self_space(sinr_vector)
Transpose the vectors of the sinr_vector object in parameter in the embedding space of the self object, using matching communities
- Parameters:
sinr_vector (SINrVectors) – Small model (target)
- Returns:
Copy of the self model (the big one) with vectors of the parameter (small one) transposed to its referential
- Return type:
- get_vocabulary_size()
- Returns:
Number of words that constitute the vocabulary
- Return type:
int
- inter_sim(intruder, topk, dist=True)
Get the average cosine distance (or cosine similarity) between top words and the intruder word
- Parameters:
intruder (int) – id of the intruder word
topk (int) – number of top words to consider
dist (boolean) – set to True (default) to use cosine distance and False to use cosine similarity
- Returns:
average cosine distance (or cosine similarity) between top words and the intruder word
- Return type:
float
- intra_sim(topks, dist=True)
Get the average cosine distance (or cosine similarity) between top words
- Parameters:
topks (int) – number of top words to pick
dist (boolean) – set to True (default) to use cosine distance and False to use cosine similarity
- Returns:
average cosine distance (or cosine similarity) between top words
- Return type:
float
- labels: bool
- light_model_save()
Save a minimal version of the model that is readable as a dict for evaluation on word-embeddings-benchmark https://github.com/kudkudak/word-embeddings-benchmarks
- load(path=None)
Load a SINrVectors model.
- Parameters:
path (string) – Path of the pickle file of the model.
- classmethod load_from_w2v(w2v_path, name, n_jobs=-1, n_neighbors=20)
Initializing a SINrVectors object using a file at the word2vec format :param w2v_path: path of the file at word2vec format which contains vectors :type w2v_path: str :param name: name of the model, useful to save it :type name: str
- most_similar(obj)
Get the most similar objects of the one passed as a parameter using the cosine of their vectors.
- Parameters:
obj (int or str) – the object for which to fetch the nearest neighbors
- obj_nnz_count(obj)
Count the number of non zero values in a word vector. :param obj: word :type obj: string
- Returns:
the number of non zero values in the word vector
- Return type:
int
- pct_nnz()
Get the percentage of non-zero values in the embedding matrix.
- Returns:
percentage of non-zero values in the embedding matrix
- pick_intruder(dim, union=None, prctbot=50, prcttop=10)
Pick an intruder word for a dimension
- Parameters:
dim (int) – the index of the dimension for which to return intruders
union (int list) – ids of words that are among the top prct of at least one dimension (defaults to None)
prctbot (int) – bottom prctbot to pick (defaults to 50)
prcttop (int) – top prcttop to pick (defaults to 10)
- Returns:
ids of an intruder word from the dimension
- Return type:
int
- remove_communities_dim_nnz(threshold_min=None, threshold_max=None)
Remove dimensions (communities) which are the less activated and those which are the most activated.
- Parameters:
threshold_min (int) – minimal number of non zero values to have for a dimension to be kept
threshold_max (int) – maximal number of non zero values to have for a dimension to be kept
- save(path=None)
Save a SINrVectors model.
- Parameters:
path (string) – Path of the pickle file of the model.
- set_communities(com)
Set the communities from the partition in communities.
- Parameters:
com (networkit.Partition) – partition in communities
- set_graph(G)
Set the graph property.
- Parameters:
G (networkit.Graph) – A networkit graph
- set_n_jobs(n_jobs)
Set the number of jobs.
- Parameters:
n_jobs – number of jobs
- set_np(np)
Set the embedding matrix.
- Parameters:
np (scipy.sparse.csr_matrix) – a sparse matrix of the embeddings
- set_vectors(embeddings)
Set the embedding vectors and initialize nearest neighbors.
- Parameters:
embeddings (scipy.sparse.csr_matrix) – initialize the vectors and build the nearest neighbors data structure using sklearn
- set_vocabulary(voc)
Set the vocabulary for word-co-occurrence graphs.
- Parameters:
voc – set the vocabulary when dealing with words or nodes with labels. label parameter is set to True.
By default, labels from the vocab will be used.
- sparsify(k)
Sparsify the vectors keeping activated the top k dimensions
- Parameters:
k – int
- class sinr.graph_embeddings.ThresholdedModelBuilder(sinr, name, n_jobs=-1, n_neighbors=31)
Bases:
ModelBuilder
Object that should be used after the training of word or graph embeddings using the SINr object to get interpretable word vectors. The ThresholdedModelBuilder will make use of the SINr object to build a SINrVectors object that will allow to use the resulting vectors efficiently. Values in the vectors that are lower than the threshold will be discarded. Vectors are then sparser and more interpretable. No need to use parent methods starting by “with”, those are included in the build function. Just provide the name of the model and build it.
- build(threshold=0.01)
Build ThresholdedModelBuilder which contains the vocabulary, the embeddings with values thresholded above a minimum and the communities.
- Parameters:
threshold – (Default value = 0.01)
- sinr.graph_embeddings.get_compact_lgcc(graph, word_to_idx)
Get a compacted graph with only nodes inside the largest connected component. Get the words with ids corresponding to the new node ids.
- Parameters:
graph (networkit graph) – The input graph
word_to_idx (dictionnary) – The words mapped to their initial ids
- Returns:
The new graph and dictionnary of words
- Return type:
networkit graph, dictionnary
- sinr.graph_embeddings.get_graph_from_matrix(matrix)
Build a graph from a sparse adjacency matrix.
- Parameters:
matrix (scipy.sparse.coo_matrix) – A sparse matrix describing a graph
- sinr.graph_embeddings.get_lgcc(graph)
Return the nodes that are outside the Largest Connected Component (LgCC) of the graph.
- Parameters:
graph (networkit graph) – The graph for which to retrieve out of LgCC nodes
SINr NFM
- sinr.nfm.compute_NP(adjacency, membership_matrix)
Compute the node-predominance based on the adjacency matrix and the community-membership matrix of the graph.
- Parameters:
adjacency (Scipy.sparse.csr_matrix) – Adjacency matrix of the graph.
membership_matrix (Scipy.sparse.csr_matrix) – Community membership matrix.
- Returns:
NP measures for each node and each community
- Return type:
Scipy.sparse.csr_matrix
- sinr.nfm.compute_NR(adjacency, membership_matrix)
Compute the node-recall based on the adjacency matrix and the community-membership matrix of the graph.
- Parameters:
adjacency (Scipy.sparse.csr_matrix) – Adjacency matrix of the graph.
membership_matrix (Scipy.sparse.csr_matrix) – Community membership matrix.
- Returns:
NR measures for each node and each community
- Return type:
Scipy.sparse.csr_matrix
- sinr.nfm.distributed_degree(adjacency)
Make values in the adjacency matrix be between 0 and 1 depending on how the degree of the node is distributed over each community.
- Parameters:
adjacency (Scipy.sparse.csr_matrix) – Adjacency matrix of the graph.
- Returns:
l1 normalized adjacency matrix.
- Return type:
Scipy.sparse.csr_matrix
- sinr.nfm.get_community_weights(adjacency, membership_matrix)
Get the total weight of each community in terms of degree.
- Parameters:
adjacency (Scipy.sparse.csr_matrix) – Adjacency matrix of the graph.
membership_matrix (Scipy.sparse.csr_matrix) – Community membership matrix.
- Returns:
Degree-based weight of each community.
- Return type:
Scipy.sparse.csr_matrix
- sinr.nfm.get_membership(vector)
Return the membership matrix based on the community membership vector.
- Parameters:
vector (list[int]) – The vector of community index for each node
- Returns:
The community membership matrix of shape (#nodes x #communities).
- Return type:
Scipy.sparse.csr_matrix
- sinr.nfm.get_nfm_embeddings(G, vector, compute_np=False, merge=False)
Compute the Node F-Measure metrics to build the embedding matrix using the graph and community structure detected.
- Parameters:
G (networkit.Graph) – Graph on which to compute the embeddings
vector (list[int]) – The node-community membership vector
compute_np (bool, optional) – Compute the node predominance metric, defaults to False
merge (bool, optional) – Merge the NR and NP measure in a common matrix, defaults to False
- Returns:
The node predominance, node recall and merged matrix (nfm) if applicable.
- Return type:
tuple[Scipy.sparse.csr_matrix, Scipy.sparse.csr_matrix, Scipy.sparse.csr_matrix]
Loader
- sinr.strategy_loader.load_adj_mat(matrix, labels=None)
Load a cooccurrence matrix.
- Parameters:
matrix (csr_matrix) – an adjacency matrix
matrix – a dict matching labels with nodes
labels – (Default value = None)
- Returns:
The loaded cooccurrence matrix and the word index.
- Return type:
tuple(dict(), scipy.sparse.coo_matrix)`
- sinr.strategy_loader.load_pkl_text(mat_path)
Load a cooccurrence matrix.
- Parameters:
cooc_mat_path (str) – Path to coocurrence matrix.
mat_path –
- Returns:
The loaded cooccurrence matrix and the word index.
- Return type:
tuple(dict(), scipy.sparse.coo_matrix)`
Visualization
- class sinr.viz.SINrViz(sinr_vectors: SINrVectors)
Bases:
object
Visualization package for SINr embdeddings. The goal is to visualize and interpret the diemnesions of the embeddings produced.
- compare_stereotypes(args, topk_dim=5)
Make a heatmap comparing top dimensions for elements in args (words).
- Parameters:
args (list[int]) – A list of indices (words).
topk_dim (int, optional) – Number of top dimensions to fetch, defaults to 5
Logger
Module contents
Top-level package for SINr Embeddings.