SINr Text

Co-occurrence

class sinr.text.cooccurrence.Cooccurrence

Bases: object

Class for constructing a cooccurrence matrix from a corpus.

A dictionnary mapping the vocabulary of the corpus in lexicographic order will be constructed as well as the cooccurrence matrix.

fit(corpus, window=2)

Perform a pass through the corpus to construct the cooccurrence matrix.

Parameters:
  • corpus (list[list[str]]) – List of lists of strings (words) from the corpus.

  • window (int) – The length of the (symmetric) context window used for cooccurrence, defaults to 2

classmethod load(filename)

Load Cooccurrence object from pickle.

Parameters:

filename (str) – Path to the pickle file.

Returns:

An instance of the :class: Cooccurrence.

Return type:

SINr.cooccurrence.Cooccurrence

save(filename)

Save cooccurrence object to a pickle file.

Parameters:

filename (str) – Output path to the filename of the pickle file.

Co-occurrence Cython

sinr.text.cooccurrence_cython.construct_cooccurrence_matrix()

Construct the word-id dictionary and cooccurrence matrix for a given corpus, using a given window size.

The dictionary is constructed by lexicographix order. The matrix accounts for the number of cooccurrences of words, no matter the order in which they appear in the corpus. Consequently, the cooccurrence matrix is upper triangular (undirected graph in SINr) .

Parameters:
  • corpus (list[str]) – The sentences from the corpus.

  • dictionary (dictionary: dict[str:int]) – A dictionary mapping of words to their ids in the vocabulary (in lexicographic order)

  • window_size (int) – The size of the symmetric moving window.

  • window_size

Returns:

The cooccurrence matrix built from the corpus.

Return type:

scipy.sparse.coo

PMI Filtering

sinr.text.pmi.pmi(X, py=None, min_pmi=0, alpha=0.0, beta=1)
Parameters:
  • X (Scipy.sparse.csr_matrix) – word, word) sparse matrix

  • py (numpy.ndarray) – 1, word) shape, probability of context words. (Default value = None)

  • min_pmi (int) – Minimum value of PMI. all the values that smaller than min_pmi are reset to zero, defaults to 0

  • alpha (float) – Smoothing factor. pmi(x,y; alpha) = p_xy /(p_x * (p_y + alpha)), defaults to 0.0

  • beta (int) – Smoothing factor. pmi(x,y) = log ( Pxy / (Px x Py^beta) ), defaults to 1.0

Returns:

A dictionary containing the PPMI matrix, the probability of words

and the exponential PMI matrix ‘(pmi, px, py, exp_pmi)’ . (word, word) pmi value sparse matrix if beta > 1 or beta < 0:

raise ValueError(“beta value {} is not in range ]0,1]”.format(beta))

if beta > 1 or beta < 0:

raise ValueError(“beta value {} is not in range ]0,1]”.format(beta))

Return type:

list[scipy.sparse.csr_matrix, numpy.ndarray, numpy.ndarray, scipy.sparse.csr_matrix]

sinr.text.pmi.pmi_filter(X, py=None, min_pmi=0, alpha=0.0, beta=1)

Filter a matrix (word, word) by computing the PMI. Exclude the records for which the PMI is lower than a thershold min_pmi.

Parameters:
  • X (scipy.sparse.csr_matrix) – word, word) sparse matrix

  • py (numpy.ndarray) – 1, word) shape, probability of context words. (Default value = None)

  • min_pmi (int) – Minimum value of PMI. all the values that smaller than min_pmi are reset to zero, defaults to 0

  • alpha (float) – Smoothing factor. pmi(x,y; alpha) = p_xy /(p_x * (p_y + alpha)), defaults to 0.0

  • beta (int) – Smoothing factor. pmi(x,y) = log ( Pxy / (Px x Py^beta) ), defaults to 1.0

Returns:

A dictionary containing the PPMI matrix, the probability of words

and the exponential PMI matrix ‘(pmi, px, py, exp_pmi)’ . (word, word) pmi value sparse matrix if beta > 1 or beta < 0:

raise ValueError(“beta value {} is not in range ]0,1]”.format(beta))

if beta > 1 or beta < 0:

raise ValueError(“beta value {} is not in range ]0,1]”.format(beta))

Return type:

scipy.sparse.coo_matrix

Preprocess Text

class sinr.text.preprocess.Corpus(register, language, input_path)

Bases: object

LANGUAGE_EN = 'en'
LANGUAGE_FR = 'fr'
REGISTER_NEWS = 'news'
REGISTER_WEB = 'web'
class sinr.text.preprocess.VRTMaker(corpus: Corpus, output_path, n_jobs=1, spacy_size='lg')

Bases: object

do_txt_to_vrt(separator='sentence')

Build VRT format file and write to output filepath.

Parameters:

separator (str) – If a preprocessing by sentences (defaults value) or by personnalized documents is needed (separator tag)

sinr.text.preprocess.extract_text(corpus_path, exceptions_path=None, lemmatize=True, stop_words=False, lower_words=True, number=False, punct=False, exclude_pos=[], en='chunking', min_freq=50, alpha=True, exclude_en=[], min_length_word=3, min_length_doc=2, dict_filt=[])

Extracts the text from a VRT corpus file.

Parameters:
  • corpus_path – str

  • lemmatize – bool (Default value = True)

  • stop_words – bool (Default value = False)

  • lower – bool

  • number – bool (Default value = False)

  • punct – bool (Default value = False)

  • exclude_pos – list (Default value = [])

  • en – str (“chunking”, “tagging”, “deleting”) (Default value = “chunking”)

  • min_freq – int (Default value = 50)

  • alpha – bool (Default value = True)

  • exclude_en – list (Default value = [])

  • lower_words – (Default value = True)

  • min_length_word – (Default value = 3)

  • min_length_doc (int) – The minimal number of token for a document (or sentence) to be kept (Default value = 2)

  • dict_filt (list) – List of words to keep only specific vocabulary

Returns:

text (list(list(str))): A list of documents containing words

sinr.text.preprocess.open_corpus(corpus_path)
Parameters:

corpus_path

Evaluate

sinr.text.evaluate.best_predicted_word(sinr_vec, word_a, word_b, word_c)

Solve analogy of the type A is to B as C is to D

Parameters:
  • sinr_vec – SINrVectors object

  • word_a – string

  • word_b – string

  • word_c – string

Returns:

best predicted word of the dataset (word D) or None if not in the vocab.

sinr.text.evaluate.best_predicted_word_k(sinr_vec, word_a, word_b, word_c, k=1)

Predict the best word for the analogy A is to B as C is to D with k best words.

Parameters:
  • sinr_vec – SINrVectors object

  • word_a – string

  • word_b – string

  • word_c – string

  • k – int, number of best words to return (default is 1)

Returns:

list of k best predicted words of the dataset or None if not in the vocab.

Return type:

list of strings

sinr.text.evaluate.clf_fit(X_train, y_train, clf=XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, random_state=None, ...))

Fit a classification model according to the given training data. :param X_train: training data :type X_train: list of vectors :param y_train: labels :type y_train: numpy.ndarray :param clf: classifier :type clf: classifier (ex.: xgboost.XGBClassifier, sklearn.svm.SVC)

Returns:

Fitted classifier

Return type:

classifier

sinr.text.evaluate.clf_score(clf, X_test, y_test, scoring='accuracy', params={})

Evaluate classification on given test data. :param clf: classifier :type clf: classifier (ex.: xgboost.XGBClassifier, sklearn.svm.SVC) :param X_test: test data :type X_test: list of vectors :param y_test: labels :type y_test: numpy.ndarray :param scoring: scikit-learn scorer object, default=’accuracy’ :type scoring: str :param params: parameters for the scorer object :type params: dictionary

Returns:

Score

Return type:

float

sinr.text.evaluate.clf_xgb_interpretability(sinr_vec, xgb, interpreter, topk_dim=10, topk=5, importance_type='gain')

Interpretability of main dimensions used by the xgboost classifier :param sinr_vec: SINrVectors object from which datas were vectorized :type sinr_vec: SINrVectors :param xgb: fitted xgboost classifier :type xgb: xgboost.XGBClassifier :param interpreter: whether stereotypes or descriptors are requested :type interpreter: str :param topk_dim: Number of features requested among the main features used by the classifier (Default value = 10) :type topk_dim: int :param topk: topk value to consider on each dimension (Default value = 5) :type topk: int :param importance_type: ‘weight’: the number of times a feature is used to split the data across all trees,

‘gain’: the average gain across all splits the feature is used in, ‘cover’: the average coverage across all splits the feature is used in, ‘total_gain’: the total gain across all splits the feature is used in ‘total_cover’: the total coverage across all splits the feature is used in

Returns:

Interpreters of dimensions, importance of dimensions

Return type:

list of set of object, list of tuple

sinr.text.evaluate.compute_analogy_normalized(sinr_vec, word_a, word_b, word_c)

Solve analogy of the type A is to B as C is to D with normalized values

Parameters:
  • sinr_vec – SINrVectors object

  • word_a – string

  • word_b – string

  • word_c – string

Returns:

best predicted word of the dataset

Return type:

string

sinr.text.evaluate.compute_analogy_sparse_normalized(sinr_vec, word_a, word_b, word_c, n=100)

Solve analogy of the type A is to B as C is to D with sparsification and normalization.

Parameters:
  • sinr_vec – SINrVectors object

  • word_a – string

  • word_b – string

  • word_c – string

  • n – int, number of dimensions to keep after sparsification

Returns:

best predicted word of the dataset (word D) or None if not in the vocab.

Return type:

string

sinr.text.evaluate.compute_analogy_value_zero(sinr_vec, word_a, word_b, word_c)

Solve analogy of the type A is to B as C is to D with only positives values in the resulting vector

Parameters:
  • sinr_vec – SINrVectors object

  • word_a – string

  • word_b – string

  • word_c – string

Returns:

best predicted word of the dataset (word D) or None if not in the vocab.

Return type:

string

sinr.text.evaluate.compute_direct_bias_sinr(sinr_vec, word_list, gender_direction, c=1)

Computes the direct bias of a set of words with respect to the gender direction using cosine similarity.

Args:

sinr_vec: SINr model. word_list: List of words to analyze. (professions in config.json) gender_direction: Gender direction vector. c: Exponent applied to cosine similarity (default is c=1).

Returns:

float: Direct bias value.

sinr.text.evaluate.compute_indirect_bias_sinr(sinr_vec, word1, word2, direction)

Compute the indirect bias SINr model.

Parameters:
  • sinr_vec – SINr model.

  • word1 – The first word.

  • word2 – The second word.

  • direction – The gender direction.

Returns:

The gender component of the similarity between the two words.

sinr.text.evaluate.dist_ratio(sinr_vec, union=None, prctbot=50, prcttop=10, nbtopk=5, dist=True)

DistRatio of the model

Parameters:
  • sinr_vec (SINrVectors) – SINrVectors object

  • union (int list) – ids of words that are among the top prct of at least one dimension (defaults to None)

  • prctbot (int) – bottom prctbot to pick (defaults to 50)

  • prcttop (int) – top prcttop to pick (defaults to 10)

Returns:

DisRatio of the model

Return type:

float

sinr.text.evaluate.dist_ratio_dim(sinr_vec, dim, union=None, prctbot=50, prcttop=10, nbtopk=5, dist=True)

DistRatio for one dimension of the model

Parameters:
  • sinr_vec (SINrVectors) – SINrVectors object

  • dim (int) – the index of the dimension for which to get the DistRatio

  • union (int list) – ids of words that are among the top prct of at least one dimension (defaults to None)

  • prctbot (int) – bottom prctbot to pick (defaults to 50)

  • prcttop (int) – top prcttop to pick (defaults to 10)

  • nbtopk (int) – number of top words to pick (defaults to 5)

  • dist (boolean) – set to True (default) to use cosine distance and False to use cosine similarity

Returns:

DistRatio for dimension dim

Return type:

float

sinr.text.evaluate.eval_analogy(sinr_vec, dataset, analogy_func)

Compare the predicted with the expected word.

Parameters:
  • sinr_vec – SINrVectors object

  • dataset – sklearn.datasets.base.Bunch dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores

Returns:

error rate

Return type:

float

sinr.text.evaluate.eval_analogy_by_category_k(sinr_vec, dataset, analogy_func, k=1)

Evaluate analogy by category with k best words

Parameters:
  • sinr_vec – SINrVectors object

  • dataset – sklearn.datasets.base.Bunch

  • analogy_func – function to use for analogy prediction

  • k – int, number of best words to return (default is 1)

Returns:

dictionary with categories as keys and error rates as values

Return type:

dict

sinr.text.evaluate.eval_analogy_k(sinr_vec, dataset, analogy_func, k=1)

Compare the predicted with the expected word with k best words.

Parameters:
  • sinr_vec – SINrVectors object

  • dataset – sklearn.datasets.base.Bunch dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores

Returns:

error rate

Return type:

float

sinr.text.evaluate.eval_similarity(sinr_vec, dataset, print_missing=True)

Evaluate similarity with Spearman correlation

Parameters:
  • sinr_vec – SINrVectors object

  • dataset – sklearn.datasets.base.Bunch dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores

  • print_missing – boolean (default : True)

Returns:

Spearman correlation between cosine similarity and human rated similarity

Return type:

float

sinr.text.evaluate.fetch_SimLex(which='665')

Fetch SimLex datasets for testing relatedness similarity

Parameters:

which (str) – dataset (default value = “665”)

Returns:

dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores

Return type:

sklearn.datasets.base.Bunch

sinr.text.evaluate.fetch_analogy(langage)

Fetch dataset for testing analogies

Parameters:

langage (str) – language of the dataset

Returns:

dictionary-like object. Keys of interest: ‘X’: matrix of 4 words per column

Return type:

sklearn.datasets.base.Bunch

sinr.text.evaluate.fetch_data_MEN()

Fetch MEN dataset for testing relatedness similarity

Returns:

dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores

Return type:

sklearn.datasets.base.Bunch

sinr.text.evaluate.fetch_data_SCWS()

Fetch SCWS dataset for testing relatedness similarity

Returns:

dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores

Return type:

sklearn.datasets.base.Bunch

sinr.text.evaluate.fetch_data_WS353()

Fetch WS353 dataset for testing relatedness similarity

Returns:

dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores

Return type:

sklearn.datasets.base.Bunch

sinr.text.evaluate.find_txt_files(directory)

Find all text files in a directory and its subdirectories.

Parameters:

directory (str) – path to the directory

Returns:

list of text files

Return type:

list

sinr.text.evaluate.format_lines(content)

Format lines of the content.

Parameters:

content (str) – content of the file

Returns:

formatted content

Return type:

str

sinr.text.evaluate.identify_gender_direction_sinr(sinr_vec, definitional_pairs, method='pca', positive_end='brother', negative_end='sister')

Identifies the gender direction in a SINr model.

Parameters: - sinr_vec: SINr model. - positive_end: word representing the masculine gender. - negative_end: word representing the feminine gender. - definitional_pairs: list of word pairs defining gender. - method: method used to compute the gender direction (‘single’, ‘sum’, ‘pca’).

Returns: - A vector representing the gender direction.

sinr.text.evaluate.load_config(path)
sinr.text.evaluate.normalize_vector(vector)

Normalize a vector.

Parameters:

vector (numpy.ndarray) – vector to normalize

Returns:

normalized vector

Return type:

numpy.ndarray

sinr.text.evaluate.plot_category_error_rates(sinr_vec, file_path, best_predicted_word_k, ks)

Plot error rates by category for different values of k

Parameters:
  • sinr_vec – SINrVectors object

  • file_path – path to the dataset file

  • best_predicted_word_k – function to use for analogy prediction

  • ks – list of k values to evaluate - [1, 2, 5, 10]

sinr.text.evaluate.plot_global_error_rates(sinr_vec, file_path, best_predicted_word_k, ks)

Plot global error rates for different values of k

Parameters:
  • sinr_vec – SINrVectors object

  • file_path – path to the dataset file

  • best_predicted_word_k – function to use for analogy prediction

  • ks – list of k values to evaluate - [1, 2, 5, 10]

sinr.text.evaluate.project_vector(v, u)
sinr.text.evaluate.reject_vector(v, u)

Compute the orthogonal projection of a vector onto a given direction.

sinr.text.evaluate.remove_invalid_lines(content)

Remove invalid lines from the content.

Parameters:

content (str) – content of the file

Returns:

cleaned content

Return type:

str

sinr.text.evaluate.similarity_MEN_WS353_SCWS(sinr_vec, print_missing=True)

Evaluate similarity with MEN, WS353 and SCWS datasets

Parameters:
  • sinr_vec – SINrVectors object

  • print_missing – boolean (default : True)

Returns:

Spearman correlation for MEN, WS353 and SCWS datasets

Return type:

dict

sinr.text.evaluate.vectorizer(sinr_vec, X, y=[])

Vectorize preprocessed documents to sinr embeddings

Parameters:
  • sinr_vec (SINrVectors) – SINrVectors object

  • X (text (list(list(str))): A list of documents containing words) – preprocessed documents

  • y (numpy.ndarray) – documents labels

Returns:

list of vectors

Module contents

Package to preprocess text into word co-occurrence graphs prior to running SINr.