SINr Text
Co-occurrence
- class sinr.text.cooccurrence.Cooccurrence
Bases:
object
Class for constructing a cooccurrence matrix from a corpus.
A dictionnary mapping the vocabulary of the corpus in lexicographic order will be constructed as well as the cooccurrence matrix.
- fit(corpus, window=2)
Perform a pass through the corpus to construct the cooccurrence matrix.
- Parameters:
corpus (list[list[str]]) – List of lists of strings (words) from the corpus.
window (int) – The length of the (symmetric) context window used for cooccurrence, defaults to 2
- classmethod load(filename)
Load Cooccurrence object from pickle.
- Parameters:
filename (str) – Path to the pickle file.
- Returns:
An instance of the :class: Cooccurrence.
- Return type:
SINr.cooccurrence.Cooccurrence
- save(filename)
Save cooccurrence object to a pickle file.
- Parameters:
filename (str) – Output path to the filename of the pickle file.
Co-occurrence Cython
- sinr.text.cooccurrence_cython.construct_cooccurrence_matrix()
Construct the word-id dictionary and cooccurrence matrix for a given corpus, using a given window size.
The dictionary is constructed by lexicographix order. The matrix accounts for the number of cooccurrences of words, no matter the order in which they appear in the corpus. Consequently, the cooccurrence matrix is upper triangular (undirected graph in SINr) .
- Parameters:
corpus (list[str]) – The sentences from the corpus.
dictionary (dictionary: dict[str:int]) – A dictionary mapping of words to their ids in the vocabulary (in lexicographic order)
window_size (int) – The size of the symmetric moving window.
window_size –
- Returns:
The cooccurrence matrix built from the corpus.
- Return type:
scipy.sparse.coo
PMI Filtering
- sinr.text.pmi.pmi(X, py=None, min_pmi=0, alpha=0.0, beta=1)
- Parameters:
X (Scipy.sparse.csr_matrix) – word, word) sparse matrix
py (numpy.ndarray) – 1, word) shape, probability of context words. (Default value = None)
min_pmi (int) – Minimum value of PMI. all the values that smaller than min_pmi are reset to zero, defaults to 0
alpha (float) – Smoothing factor. pmi(x,y; alpha) = p_xy /(p_x * (p_y + alpha)), defaults to 0.0
beta (int) – Smoothing factor. pmi(x,y) = log ( Pxy / (Px x Py^beta) ), defaults to 1.0
- Returns:
A dictionary containing the PPMI matrix, the probability of words
and the exponential PMI matrix ‘(pmi, px, py, exp_pmi)’ . (word, word) pmi value sparse matrix if beta > 1 or beta < 0:
raise ValueError(“beta value {} is not in range ]0,1]”.format(beta))
- if beta > 1 or beta < 0:
raise ValueError(“beta value {} is not in range ]0,1]”.format(beta))
- Return type:
list[scipy.sparse.csr_matrix, numpy.ndarray, numpy.ndarray, scipy.sparse.csr_matrix]
- sinr.text.pmi.pmi_filter(X, py=None, min_pmi=0, alpha=0.0, beta=1)
Filter a matrix (word, word) by computing the PMI. Exclude the records for which the PMI is lower than a thershold min_pmi.
- Parameters:
X (scipy.sparse.csr_matrix) – word, word) sparse matrix
py (numpy.ndarray) – 1, word) shape, probability of context words. (Default value = None)
min_pmi (int) – Minimum value of PMI. all the values that smaller than min_pmi are reset to zero, defaults to 0
alpha (float) – Smoothing factor. pmi(x,y; alpha) = p_xy /(p_x * (p_y + alpha)), defaults to 0.0
beta (int) – Smoothing factor. pmi(x,y) = log ( Pxy / (Px x Py^beta) ), defaults to 1.0
- Returns:
A dictionary containing the PPMI matrix, the probability of words
and the exponential PMI matrix ‘(pmi, px, py, exp_pmi)’ . (word, word) pmi value sparse matrix if beta > 1 or beta < 0:
raise ValueError(“beta value {} is not in range ]0,1]”.format(beta))
- if beta > 1 or beta < 0:
raise ValueError(“beta value {} is not in range ]0,1]”.format(beta))
- Return type:
scipy.sparse.coo_matrix
Preprocess Text
- class sinr.text.preprocess.Corpus(register, language, input_path)
Bases:
object
- LANGUAGE_EN = 'en'
- LANGUAGE_FR = 'fr'
- REGISTER_NEWS = 'news'
- REGISTER_WEB = 'web'
- class sinr.text.preprocess.VRTMaker(corpus: Corpus, output_path, n_jobs=1, spacy_size='lg')
Bases:
object
- do_txt_to_vrt(separator='sentence')
Build VRT format file and write to output filepath.
- Parameters:
separator (str) – If a preprocessing by sentences (defaults value) or by personnalized documents is needed (separator tag)
- sinr.text.preprocess.extract_text(corpus_path, exceptions_path=None, lemmatize=True, stop_words=False, lower_words=True, number=False, punct=False, exclude_pos=[], en='chunking', min_freq=50, alpha=True, exclude_en=[], min_length_word=3, min_length_doc=2, dict_filt=[])
Extracts the text from a VRT corpus file.
- Parameters:
corpus_path – str
lemmatize – bool (Default value = True)
stop_words – bool (Default value = False)
lower – bool
number – bool (Default value = False)
punct – bool (Default value = False)
exclude_pos – list (Default value = [])
en – str (“chunking”, “tagging”, “deleting”) (Default value = “chunking”)
min_freq – int (Default value = 50)
alpha – bool (Default value = True)
exclude_en – list (Default value = [])
lower_words – (Default value = True)
min_length_word – (Default value = 3)
min_length_doc (int) – The minimal number of token for a document (or sentence) to be kept (Default value = 2)
dict_filt (list) – List of words to keep only specific vocabulary
- Returns:
text (list(list(str))): A list of documents containing words
- sinr.text.preprocess.open_corpus(corpus_path)
- Parameters:
corpus_path –
Evaluate
- sinr.text.evaluate.best_predicted_word(sinr_vec, word_a, word_b, word_c)
Solve analogy of the type A is to B as C is to D
- Parameters:
sinr_vec – SINrVectors object
word_a – string
word_b – string
word_c – string
- Returns:
best predicted word of the dataset (word D) or None if not in the vocab.
- sinr.text.evaluate.best_predicted_word_k(sinr_vec, word_a, word_b, word_c, k=1)
Predict the best word for the analogy A is to B as C is to D with k best words.
- Parameters:
sinr_vec – SINrVectors object
word_a – string
word_b – string
word_c – string
k – int, number of best words to return (default is 1)
- Returns:
list of k best predicted words of the dataset or None if not in the vocab.
- Return type:
list of strings
- sinr.text.evaluate.clf_fit(X_train, y_train, clf=XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, random_state=None, ...))
Fit a classification model according to the given training data. :param X_train: training data :type X_train: list of vectors :param y_train: labels :type y_train: numpy.ndarray :param clf: classifier :type clf: classifier (ex.: xgboost.XGBClassifier, sklearn.svm.SVC)
- Returns:
Fitted classifier
- Return type:
classifier
- sinr.text.evaluate.clf_score(clf, X_test, y_test, scoring='accuracy', params={})
Evaluate classification on given test data. :param clf: classifier :type clf: classifier (ex.: xgboost.XGBClassifier, sklearn.svm.SVC) :param X_test: test data :type X_test: list of vectors :param y_test: labels :type y_test: numpy.ndarray :param scoring: scikit-learn scorer object, default=’accuracy’ :type scoring: str :param params: parameters for the scorer object :type params: dictionary
- Returns:
Score
- Return type:
float
- sinr.text.evaluate.clf_xgb_interpretability(sinr_vec, xgb, interpreter, topk_dim=10, topk=5, importance_type='gain')
Interpretability of main dimensions used by the xgboost classifier :param sinr_vec: SINrVectors object from which datas were vectorized :type sinr_vec: SINrVectors :param xgb: fitted xgboost classifier :type xgb: xgboost.XGBClassifier :param interpreter: whether stereotypes or descriptors are requested :type interpreter: str :param topk_dim: Number of features requested among the main features used by the classifier (Default value = 10) :type topk_dim: int :param topk: topk value to consider on each dimension (Default value = 5) :type topk: int :param importance_type: ‘weight’: the number of times a feature is used to split the data across all trees,
‘gain’: the average gain across all splits the feature is used in, ‘cover’: the average coverage across all splits the feature is used in, ‘total_gain’: the total gain across all splits the feature is used in ‘total_cover’: the total coverage across all splits the feature is used in
- Returns:
Interpreters of dimensions, importance of dimensions
- Return type:
list of set of object, list of tuple
- sinr.text.evaluate.compute_analogy_normalized(sinr_vec, word_a, word_b, word_c)
Solve analogy of the type A is to B as C is to D with normalized values
- Parameters:
sinr_vec – SINrVectors object
word_a – string
word_b – string
word_c – string
- Returns:
best predicted word of the dataset
- Return type:
string
- sinr.text.evaluate.compute_analogy_sparse_normalized(sinr_vec, word_a, word_b, word_c, n=100)
Solve analogy of the type A is to B as C is to D with sparsification and normalization.
- Parameters:
sinr_vec – SINrVectors object
word_a – string
word_b – string
word_c – string
n – int, number of dimensions to keep after sparsification
- Returns:
best predicted word of the dataset (word D) or None if not in the vocab.
- Return type:
string
- sinr.text.evaluate.compute_analogy_value_zero(sinr_vec, word_a, word_b, word_c)
Solve analogy of the type A is to B as C is to D with only positives values in the resulting vector
- Parameters:
sinr_vec – SINrVectors object
word_a – string
word_b – string
word_c – string
- Returns:
best predicted word of the dataset (word D) or None if not in the vocab.
- Return type:
string
- sinr.text.evaluate.compute_direct_bias_sinr(sinr_vec, word_list, gender_direction, c=1)
Computes the direct bias of a set of words with respect to the gender direction using cosine similarity.
- Args:
sinr_vec: SINr model. word_list: List of words to analyze. (professions in config.json) gender_direction: Gender direction vector. c: Exponent applied to cosine similarity (default is c=1).
- Returns:
float: Direct bias value.
- sinr.text.evaluate.compute_indirect_bias_sinr(sinr_vec, word1, word2, direction)
Compute the indirect bias SINr model.
- Parameters:
sinr_vec – SINr model.
word1 – The first word.
word2 – The second word.
direction – The gender direction.
- Returns:
The gender component of the similarity between the two words.
- sinr.text.evaluate.dist_ratio(sinr_vec, union=None, prctbot=50, prcttop=10, nbtopk=5, dist=True)
DistRatio of the model
- Parameters:
sinr_vec (SINrVectors) – SINrVectors object
union (int list) – ids of words that are among the top prct of at least one dimension (defaults to None)
prctbot (int) – bottom prctbot to pick (defaults to 50)
prcttop (int) – top prcttop to pick (defaults to 10)
- Returns:
DisRatio of the model
- Return type:
float
- sinr.text.evaluate.dist_ratio_dim(sinr_vec, dim, union=None, prctbot=50, prcttop=10, nbtopk=5, dist=True)
DistRatio for one dimension of the model
- Parameters:
sinr_vec (SINrVectors) – SINrVectors object
dim (int) – the index of the dimension for which to get the DistRatio
union (int list) – ids of words that are among the top prct of at least one dimension (defaults to None)
prctbot (int) – bottom prctbot to pick (defaults to 50)
prcttop (int) – top prcttop to pick (defaults to 10)
nbtopk (int) – number of top words to pick (defaults to 5)
dist (boolean) – set to True (default) to use cosine distance and False to use cosine similarity
- Returns:
DistRatio for dimension dim
- Return type:
float
- sinr.text.evaluate.eval_analogy(sinr_vec, dataset, analogy_func)
Compare the predicted with the expected word.
- Parameters:
sinr_vec – SINrVectors object
dataset – sklearn.datasets.base.Bunch dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores
- Returns:
error rate
- Return type:
float
- sinr.text.evaluate.eval_analogy_by_category_k(sinr_vec, dataset, analogy_func, k=1)
Evaluate analogy by category with k best words
- Parameters:
sinr_vec – SINrVectors object
dataset – sklearn.datasets.base.Bunch
analogy_func – function to use for analogy prediction
k – int, number of best words to return (default is 1)
- Returns:
dictionary with categories as keys and error rates as values
- Return type:
dict
- sinr.text.evaluate.eval_analogy_k(sinr_vec, dataset, analogy_func, k=1)
Compare the predicted with the expected word with k best words.
- Parameters:
sinr_vec – SINrVectors object
dataset – sklearn.datasets.base.Bunch dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores
- Returns:
error rate
- Return type:
float
- sinr.text.evaluate.eval_similarity(sinr_vec, dataset, print_missing=True)
Evaluate similarity with Spearman correlation
- Parameters:
sinr_vec – SINrVectors object
dataset – sklearn.datasets.base.Bunch dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores
print_missing – boolean (default : True)
- Returns:
Spearman correlation between cosine similarity and human rated similarity
- Return type:
float
- sinr.text.evaluate.fetch_SimLex(which='665')
Fetch SimLex datasets for testing relatedness similarity
- Parameters:
which (str) – dataset (default value = “665”)
- Returns:
dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores
- Return type:
sklearn.datasets.base.Bunch
- sinr.text.evaluate.fetch_analogy(langage)
Fetch dataset for testing analogies
- Parameters:
langage (str) – language of the dataset
- Returns:
dictionary-like object. Keys of interest: ‘X’: matrix of 4 words per column
- Return type:
sklearn.datasets.base.Bunch
- sinr.text.evaluate.fetch_data_MEN()
Fetch MEN dataset for testing relatedness similarity
- Returns:
dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores
- Return type:
sklearn.datasets.base.Bunch
- sinr.text.evaluate.fetch_data_SCWS()
Fetch SCWS dataset for testing relatedness similarity
- Returns:
dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores
- Return type:
sklearn.datasets.base.Bunch
- sinr.text.evaluate.fetch_data_WS353()
Fetch WS353 dataset for testing relatedness similarity
- Returns:
dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores
- Return type:
sklearn.datasets.base.Bunch
- sinr.text.evaluate.find_txt_files(directory)
Find all text files in a directory and its subdirectories.
- Parameters:
directory (str) – path to the directory
- Returns:
list of text files
- Return type:
list
- sinr.text.evaluate.format_lines(content)
Format lines of the content.
- Parameters:
content (str) – content of the file
- Returns:
formatted content
- Return type:
str
- sinr.text.evaluate.identify_gender_direction_sinr(sinr_vec, definitional_pairs, method='pca', positive_end='brother', negative_end='sister')
Identifies the gender direction in a SINr model.
Parameters: - sinr_vec: SINr model. - positive_end: word representing the masculine gender. - negative_end: word representing the feminine gender. - definitional_pairs: list of word pairs defining gender. - method: method used to compute the gender direction (‘single’, ‘sum’, ‘pca’).
Returns: - A vector representing the gender direction.
- sinr.text.evaluate.load_config(path)
- sinr.text.evaluate.normalize_vector(vector)
Normalize a vector.
- Parameters:
vector (numpy.ndarray) – vector to normalize
- Returns:
normalized vector
- Return type:
numpy.ndarray
- sinr.text.evaluate.plot_category_error_rates(sinr_vec, file_path, best_predicted_word_k, ks)
Plot error rates by category for different values of k
- Parameters:
sinr_vec – SINrVectors object
file_path – path to the dataset file
best_predicted_word_k – function to use for analogy prediction
ks – list of k values to evaluate - [1, 2, 5, 10]
- sinr.text.evaluate.plot_global_error_rates(sinr_vec, file_path, best_predicted_word_k, ks)
Plot global error rates for different values of k
- Parameters:
sinr_vec – SINrVectors object
file_path – path to the dataset file
best_predicted_word_k – function to use for analogy prediction
ks – list of k values to evaluate - [1, 2, 5, 10]
- sinr.text.evaluate.project_vector(v, u)
- sinr.text.evaluate.reject_vector(v, u)
Compute the orthogonal projection of a vector onto a given direction.
- sinr.text.evaluate.remove_invalid_lines(content)
Remove invalid lines from the content.
- Parameters:
content (str) – content of the file
- Returns:
cleaned content
- Return type:
str
- sinr.text.evaluate.similarity_MEN_WS353_SCWS(sinr_vec, print_missing=True)
Evaluate similarity with MEN, WS353 and SCWS datasets
- Parameters:
sinr_vec – SINrVectors object
print_missing – boolean (default : True)
- Returns:
Spearman correlation for MEN, WS353 and SCWS datasets
- Return type:
dict
- sinr.text.evaluate.vectorizer(sinr_vec, X, y=[])
Vectorize preprocessed documents to sinr embeddings
- Parameters:
sinr_vec (SINrVectors) – SINrVectors object
X (text (list(list(str))): A list of documents containing words) – preprocessed documents
y (numpy.ndarray) – documents labels
- Returns:
list of vectors
Module contents
Package to preprocess text into word co-occurrence graphs prior to running SINr.