SINr Text

Co-occurrence

class sinr.text.cooccurrence.Cooccurrence

Bases: object

Class for constructing a cooccurrence matrix from a corpus.

A dictionnary mapping the vocabulary of the corpus in lexicographic order will be constructed as well as the cooccurrence matrix.

fit(corpus, window=2)

Perform a pass through the corpus to construct the cooccurrence matrix.

Parameters:
  • corpus (list[list[str]]) – List of lists of strings (words) from the corpus.

  • window (int) – The length of the (symmetric) context window used for cooccurrence, defaults to 2

classmethod load(filename)

Load Cooccurrence object from pickle.

Parameters:

filename (str) – Path to the pickle file.

Returns:

An instance of the :class: Cooccurrence.

Return type:

SINr.cooccurrence.Cooccurrence

save(filename)

Save cooccurrence object to a pickle file.

Parameters:

filename (str) – Output path to the filename of the pickle file.

Co-occurrence Cython

sinr.text.cooccurrence_cython.construct_cooccurrence_matrix()

Construct the word-id dictionary and cooccurrence matrix for a given corpus, using a given window size.

The dictionary is constructed by lexicographix order. The matrix accounts for the number of cooccurrences of words, no matter the order in which they appear in the corpus. Consequently, the cooccurrence matrix is upper triangular (undirected graph in SINr) .

Parameters:
  • corpus (list[str]) – The sentences from the corpus.

  • dictionary (dictionary: dict[str:int]) – A dictionary mapping of words to their ids in the vocabulary (in lexicographic order)

  • window_size (int) – The size of the symmetric moving window.

  • window_size

Returns:

The cooccurrence matrix built from the corpus.

Return type:

scipy.sparse.coo

PMI Filtering

sinr.text.pmi.pmi(X, py=None, min_pmi=0, alpha=0.0, beta=1)
Parameters:
  • X (Scipy.sparse.csr_matrix) – word, word) sparse matrix

  • py (numpy.ndarray) – 1, word) shape, probability of context words. (Default value = None)

  • min_pmi (int) – Minimum value of PMI. all the values that smaller than min_pmi are reset to zero, defaults to 0

  • alpha (float) – Smoothing factor. pmi(x,y; alpha) = p_xy /(p_x * (p_y + alpha)), defaults to 0.0

  • beta (int) – Smoothing factor. pmi(x,y) = log ( Pxy / (Px x Py^beta) ), defaults to 1.0

Returns:

A dictionary containing the PPMI matrix, the probability of words

and the exponential PMI matrix ‘(pmi, px, py, exp_pmi)’ . (word, word) pmi value sparse matrix if beta > 1 or beta < 0:

raise ValueError(“beta value {} is not in range ]0,1]”.format(beta))

if beta > 1 or beta < 0:

raise ValueError(“beta value {} is not in range ]0,1]”.format(beta))

Return type:

list[scipy.sparse.csr_matrix, numpy.ndarray, numpy.ndarray, scipy.sparse.csr_matrix]

sinr.text.pmi.pmi_filter(X, py=None, min_pmi=0, alpha=0.0, beta=1)

Filter a matrix (word, word) by computing the PMI. Exclude the records for which the PMI is lower than a thershold min_pmi.

Parameters:
  • X (scipy.sparse.csr_matrix) – word, word) sparse matrix

  • py (numpy.ndarray) – 1, word) shape, probability of context words. (Default value = None)

  • min_pmi (int) – Minimum value of PMI. all the values that smaller than min_pmi are reset to zero, defaults to 0

  • alpha (float) – Smoothing factor. pmi(x,y; alpha) = p_xy /(p_x * (p_y + alpha)), defaults to 0.0

  • beta (int) – Smoothing factor. pmi(x,y) = log ( Pxy / (Px x Py^beta) ), defaults to 1.0

Returns:

A dictionary containing the PPMI matrix, the probability of words

and the exponential PMI matrix ‘(pmi, px, py, exp_pmi)’ . (word, word) pmi value sparse matrix if beta > 1 or beta < 0:

raise ValueError(“beta value {} is not in range ]0,1]”.format(beta))

if beta > 1 or beta < 0:

raise ValueError(“beta value {} is not in range ]0,1]”.format(beta))

Return type:

scipy.sparse.coo_matrix

Preprocess Text

class sinr.text.preprocess.Corpus(register, language, input_path)

Bases: object

LANGUAGE_EN = 'en'
LANGUAGE_FR = 'fr'
REGISTER_NEWS = 'news'
REGISTER_WEB = 'web'
class sinr.text.preprocess.VRTMaker(corpus: Corpus, output_path, n_jobs=1, spacy_size='lg')

Bases: object

do_txt_to_vrt(separator='sentence')

Build VRT format file and write to output filepath.

Parameters:

separator (str) – If a preprocessing by sentences (defaults value) or by personnalized documents is needed (separator tag)

sinr.text.preprocess.extract_text(corpus_path, exceptions_path=None, lemmatize=True, stop_words=False, lower_words=True, number=False, punct=False, exclude_pos=[], en='chunking', min_freq=50, alpha=True, exclude_en=[], min_length_word=3, min_length_doc=2, dict_filt=[])

Extracts the text from a VRT corpus file.

Parameters:
  • corpus_path – str

  • lemmatize – bool (Default value = True)

  • stop_words – bool (Default value = False)

  • lower – bool

  • number – bool (Default value = False)

  • punct – bool (Default value = False)

  • exclude_pos – list (Default value = [])

  • en – str (“chunking”, “tagging”, “deleting”) (Default value = “chunking”)

  • min_freq – int (Default value = 50)

  • alpha – bool (Default value = True)

  • exclude_en – list (Default value = [])

  • lower_words – (Default value = True)

  • min_length_word – (Default value = 3)

  • min_length_doc (int) – The minimal number of token for a document (or sentence) to be kept (Default value = 2)

  • dict_filt (list) – List of words to keep only specific vocabulary

Returns:

text (list(list(str))): A list of documents containing words

sinr.text.preprocess.open_corpus(corpus_path)
Parameters:

corpus_path

Evaluate

sinr.text.evaluate.clf_fit(X_train, y_train, clf=XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, random_state=None, ...))

Fit a classification model according to the given training data. :param X_train: training data :type X_train: list of vectors :param y_train: labels :type y_train: numpy.ndarray :param clf: classifier :type clf: classifier (ex.: xgboost.XGBClassifier, sklearn.svm.SVC)

Returns:

Fitted classifier

Return type:

classifier

sinr.text.evaluate.clf_score(clf, X_test, y_test, scoring='accuracy', params={})

Evaluate classification on given test data. :param clf: classifier :type clf: classifier (ex.: xgboost.XGBClassifier, sklearn.svm.SVC) :param X_test: test data :type X_test: list of vectors :param y_test: labels :type y_test: numpy.ndarray :param scoring: scikit-learn scorer object, default=’accuracy’ :type scoring: str :param params: parameters for the scorer object :type params: dictionary

Returns:

Score

Return type:

float

sinr.text.evaluate.clf_xgb_interpretability(sinr_vec, xgb, interpreter, topk_dim=10, topk=5, importance_type='gain')

Interpretability of main dimensions used by the xgboost classifier :param sinr_vec: SINrVectors object from which datas were vectorized :type sinr_vec: SINrVectors :param xgb: fitted xgboost classifier :type xgb: xgboost.XGBClassifier :param interpreter: whether stereotypes or descriptors are requested :type interpreter: str :param topk_dim: Number of features requested among the main features used by the classifier (Default value = 10) :type topk_dim: int :param topk: topk value to consider on each dimension (Default value = 5) :type topk: int :param importance_type: ‘weight’: the number of times a feature is used to split the data across all trees,

‘gain’: the average gain across all splits the feature is used in, ‘cover’: the average coverage across all splits the feature is used in, ‘total_gain’: the total gain across all splits the feature is used in ‘total_cover’: the total coverage across all splits the feature is used in

Returns:

Interpreters of dimensions, importance of dimensions

Return type:

list of set of object, list of tuple

sinr.text.evaluate.dist_ratio(sinr_vec, union=None, prctbot=50, prcttop=10, nbtopk=5, dist=True)

DistRatio of the model

Parameters:
  • sinr_vec (SINrVectors) – SINrVectors object

  • union (int list) – ids of words that are among the top prct of at least one dimension (defaults to None)

  • prctbot (int) – bottom prctbot to pick (defaults to 50)

  • prcttop (int) – top prcttop to pick (defaults to 10)

Returns:

DisRatio of the model

Return type:

float

sinr.text.evaluate.dist_ratio_dim(sinr_vec, dim, union=None, prctbot=50, prcttop=10, nbtopk=5, dist=True)

DistRatio for one dimension of the model

Parameters:
  • sinr_vec (SINrVectors) – SINrVectors object

  • dim (int) – the index of the dimension for which to get the DistRatio

  • union (int list) – ids of words that are among the top prct of at least one dimension (defaults to None)

  • prctbot (int) – bottom prctbot to pick (defaults to 50)

  • prcttop (int) – top prcttop to pick (defaults to 10)

  • nbtopk (int) – number of top words to pick (defaults to 5)

  • dist (boolean) – set to True (default) to use cosine distance and False to use cosine similarity

Returns:

DistRatio for dimension dim

Return type:

float

sinr.text.evaluate.eval_similarity(sinr_vec, dataset, print_missing=True)

Evaluate similarity with Spearman correlation

Parameters:
  • sinr_vec – SINrVectors object

  • dataset – sklearn.datasets.base.Bunch dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores

  • print_missing – boolean (default : True)

Returns:

Spearman correlation between cosine similarity and human rated similarity

Return type:

float

sinr.text.evaluate.fetch_SimLex(which='665')

Fetch SimLex datasets for testing relatedness similarity

Parameters:

which (str) – dataset (default value = “665”)

Returns:

dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores

Return type:

sklearn.datasets.base.Bunch

sinr.text.evaluate.fetch_data_MEN()

Fetch MEN dataset for testing relatedness similarity

Returns:

dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores

Return type:

sklearn.datasets.base.Bunch

sinr.text.evaluate.fetch_data_SCWS()

Fetch SCWS dataset for testing relatedness similarity

Returns:

dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores

Return type:

sklearn.datasets.base.Bunch

sinr.text.evaluate.fetch_data_WS353()

Fetch WS353 dataset for testing relatedness similarity

Returns:

dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores

Return type:

sklearn.datasets.base.Bunch

sinr.text.evaluate.similarity_MEN_WS353_SCWS(sinr_vec, print_missing=True)

Evaluate similarity with MEN, WS353 and SCWS datasets

Parameters:
  • sinr_vec – SINrVectors object

  • print_missing – boolean (default : True)

Returns:

Spearman correlation for MEN, WS353 and SCWS datasets

Return type:

dict

sinr.text.evaluate.vectorizer(sinr_vec, X, y=[])

Vectorize preprocessed documents to sinr embeddings

Parameters:
  • sinr_vec (SINrVectors) – SINrVectors object

  • X (text (list(list(str))): A list of documents containing words) – preprocessed documents

  • y (numpy.ndarray) – documents labels

Returns:

list of vectors

Module contents

Package to preprocess text into word co-occurrence graphs prior to running SINr.