SINr Text
Co-occurrence
- class sinr.text.cooccurrence.Cooccurrence
Bases:
object
Class for constructing a cooccurrence matrix from a corpus.
A dictionnary mapping the vocabulary of the corpus in lexicographic order will be constructed as well as the cooccurrence matrix.
- fit(corpus, window=2)
Perform a pass through the corpus to construct the cooccurrence matrix.
- Parameters:
corpus (list[list[str]]) – List of lists of strings (words) from the corpus.
window (int) – The length of the (symmetric) context window used for cooccurrence, defaults to 2
- classmethod load(filename)
Load Cooccurrence object from pickle.
- Parameters:
filename (str) – Path to the pickle file.
- Returns:
An instance of the :class: Cooccurrence.
- Return type:
SINr.cooccurrence.Cooccurrence
- save(filename)
Save cooccurrence object to a pickle file.
- Parameters:
filename (str) – Output path to the filename of the pickle file.
Co-occurrence Cython
- sinr.text.cooccurrence_cython.construct_cooccurrence_matrix()
Construct the word-id dictionary and cooccurrence matrix for a given corpus, using a given window size.
The dictionary is constructed by lexicographix order. The matrix accounts for the number of cooccurrences of words, no matter the order in which they appear in the corpus. Consequently, the cooccurrence matrix is upper triangular (undirected graph in SINr) .
- Parameters:
corpus (list[str]) – The sentences from the corpus.
dictionary (dictionary: dict[str:int]) – A dictionary mapping of words to their ids in the vocabulary (in lexicographic order)
window_size (int) – The size of the symmetric moving window.
window_size –
- Returns:
The cooccurrence matrix built from the corpus.
- Return type:
scipy.sparse.coo
PMI Filtering
- sinr.text.pmi.pmi(X, py=None, min_pmi=0, alpha=0.0, beta=1)
- Parameters:
X (Scipy.sparse.csr_matrix) – word, word) sparse matrix
py (numpy.ndarray) – 1, word) shape, probability of context words. (Default value = None)
min_pmi (int) – Minimum value of PMI. all the values that smaller than min_pmi are reset to zero, defaults to 0
alpha (float) – Smoothing factor. pmi(x,y; alpha) = p_xy /(p_x * (p_y + alpha)), defaults to 0.0
beta (int) – Smoothing factor. pmi(x,y) = log ( Pxy / (Px x Py^beta) ), defaults to 1.0
- Returns:
A dictionary containing the PPMI matrix, the probability of words
and the exponential PMI matrix ‘(pmi, px, py, exp_pmi)’ . (word, word) pmi value sparse matrix if beta > 1 or beta < 0:
raise ValueError(“beta value {} is not in range ]0,1]”.format(beta))
- if beta > 1 or beta < 0:
raise ValueError(“beta value {} is not in range ]0,1]”.format(beta))
- Return type:
list[scipy.sparse.csr_matrix, numpy.ndarray, numpy.ndarray, scipy.sparse.csr_matrix]
- sinr.text.pmi.pmi_filter(X, py=None, min_pmi=0, alpha=0.0, beta=1)
Filter a matrix (word, word) by computing the PMI. Exclude the records for which the PMI is lower than a thershold min_pmi.
- Parameters:
X (scipy.sparse.csr_matrix) – word, word) sparse matrix
py (numpy.ndarray) – 1, word) shape, probability of context words. (Default value = None)
min_pmi (int) – Minimum value of PMI. all the values that smaller than min_pmi are reset to zero, defaults to 0
alpha (float) – Smoothing factor. pmi(x,y; alpha) = p_xy /(p_x * (p_y + alpha)), defaults to 0.0
beta (int) – Smoothing factor. pmi(x,y) = log ( Pxy / (Px x Py^beta) ), defaults to 1.0
- Returns:
A dictionary containing the PPMI matrix, the probability of words
and the exponential PMI matrix ‘(pmi, px, py, exp_pmi)’ . (word, word) pmi value sparse matrix if beta > 1 or beta < 0:
raise ValueError(“beta value {} is not in range ]0,1]”.format(beta))
- if beta > 1 or beta < 0:
raise ValueError(“beta value {} is not in range ]0,1]”.format(beta))
- Return type:
scipy.sparse.coo_matrix
Preprocess Text
- class sinr.text.preprocess.Corpus(register, language, input_path)
Bases:
object
- LANGUAGE_EN = 'en'
- LANGUAGE_FR = 'fr'
- REGISTER_NEWS = 'news'
- REGISTER_WEB = 'web'
- class sinr.text.preprocess.VRTMaker(corpus: Corpus, output_path, n_jobs=1, spacy_size='lg')
Bases:
object
- do_txt_to_vrt(separator='sentence')
Build VRT format file and write to output filepath.
- Parameters:
separator (str) – If a preprocessing by sentences (defaults value) or by personnalized documents is needed (separator tag)
- sinr.text.preprocess.extract_text(corpus_path, exceptions_path=None, lemmatize=True, stop_words=False, lower_words=True, number=False, punct=False, exclude_pos=[], en='chunking', min_freq=50, alpha=True, exclude_en=[], min_length_word=3, min_length_doc=2, dict_filt=[])
Extracts the text from a VRT corpus file.
- Parameters:
corpus_path – str
lemmatize – bool (Default value = True)
stop_words – bool (Default value = False)
lower – bool
number – bool (Default value = False)
punct – bool (Default value = False)
exclude_pos – list (Default value = [])
en – str (“chunking”, “tagging”, “deleting”) (Default value = “chunking”)
min_freq – int (Default value = 50)
alpha – bool (Default value = True)
exclude_en – list (Default value = [])
lower_words – (Default value = True)
min_length_word – (Default value = 3)
min_length_doc (int) – The minimal number of token for a document (or sentence) to be kept (Default value = 2)
dict_filt (list) – List of words to keep only specific vocabulary
- Returns:
text (list(list(str))): A list of documents containing words
- sinr.text.preprocess.open_corpus(corpus_path)
- Parameters:
corpus_path –
Evaluate
- sinr.text.evaluate.clf_fit(X_train, y_train, clf=XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, random_state=None, ...))
Fit a classification model according to the given training data. :param X_train: training data :type X_train: list of vectors :param y_train: labels :type y_train: numpy.ndarray :param clf: classifier :type clf: classifier (ex.: xgboost.XGBClassifier, sklearn.svm.SVC)
- Returns:
Fitted classifier
- Return type:
classifier
- sinr.text.evaluate.clf_score(clf, X_test, y_test, scoring='accuracy', params={})
Evaluate classification on given test data. :param clf: classifier :type clf: classifier (ex.: xgboost.XGBClassifier, sklearn.svm.SVC) :param X_test: test data :type X_test: list of vectors :param y_test: labels :type y_test: numpy.ndarray :param scoring: scikit-learn scorer object, default=’accuracy’ :type scoring: str :param params: parameters for the scorer object :type params: dictionary
- Returns:
Score
- Return type:
float
- sinr.text.evaluate.clf_xgb_interpretability(sinr_vec, xgb, interpreter, topk_dim=10, topk=5, importance_type='gain')
Interpretability of main dimensions used by the xgboost classifier :param sinr_vec: SINrVectors object from which datas were vectorized :type sinr_vec: SINrVectors :param xgb: fitted xgboost classifier :type xgb: xgboost.XGBClassifier :param interpreter: whether stereotypes or descriptors are requested :type interpreter: str :param topk_dim: Number of features requested among the main features used by the classifier (Default value = 10) :type topk_dim: int :param topk: topk value to consider on each dimension (Default value = 5) :type topk: int :param importance_type: ‘weight’: the number of times a feature is used to split the data across all trees,
‘gain’: the average gain across all splits the feature is used in, ‘cover’: the average coverage across all splits the feature is used in, ‘total_gain’: the total gain across all splits the feature is used in ‘total_cover’: the total coverage across all splits the feature is used in
- Returns:
Interpreters of dimensions, importance of dimensions
- Return type:
list of set of object, list of tuple
- sinr.text.evaluate.dist_ratio(sinr_vec, union=None, prctbot=50, prcttop=10, nbtopk=5, dist=True)
DistRatio of the model
- Parameters:
sinr_vec (SINrVectors) – SINrVectors object
union (int list) – ids of words that are among the top prct of at least one dimension (defaults to None)
prctbot (int) – bottom prctbot to pick (defaults to 50)
prcttop (int) – top prcttop to pick (defaults to 10)
- Returns:
DisRatio of the model
- Return type:
float
- sinr.text.evaluate.dist_ratio_dim(sinr_vec, dim, union=None, prctbot=50, prcttop=10, nbtopk=5, dist=True)
DistRatio for one dimension of the model
- Parameters:
sinr_vec (SINrVectors) – SINrVectors object
dim (int) – the index of the dimension for which to get the DistRatio
union (int list) – ids of words that are among the top prct of at least one dimension (defaults to None)
prctbot (int) – bottom prctbot to pick (defaults to 50)
prcttop (int) – top prcttop to pick (defaults to 10)
nbtopk (int) – number of top words to pick (defaults to 5)
dist (boolean) – set to True (default) to use cosine distance and False to use cosine similarity
- Returns:
DistRatio for dimension dim
- Return type:
float
- sinr.text.evaluate.eval_similarity(sinr_vec, dataset, print_missing=True)
Evaluate similarity with Spearman correlation
- Parameters:
sinr_vec – SINrVectors object
dataset – sklearn.datasets.base.Bunch dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores
print_missing – boolean (default : True)
- Returns:
Spearman correlation between cosine similarity and human rated similarity
- Return type:
float
- sinr.text.evaluate.fetch_SimLex(which='665')
Fetch SimLex datasets for testing relatedness similarity
- Parameters:
which (str) – dataset (default value = “665”)
- Returns:
dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores
- Return type:
sklearn.datasets.base.Bunch
- sinr.text.evaluate.fetch_data_MEN()
Fetch MEN dataset for testing relatedness similarity
- Returns:
dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores
- Return type:
sklearn.datasets.base.Bunch
- sinr.text.evaluate.fetch_data_SCWS()
Fetch SCWS dataset for testing relatedness similarity
- Returns:
dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores
- Return type:
sklearn.datasets.base.Bunch
- sinr.text.evaluate.fetch_data_WS353()
Fetch WS353 dataset for testing relatedness similarity
- Returns:
dictionary-like object. Keys of interest: ‘X’: matrix of 2 words per column, ‘y’: vector with scores
- Return type:
sklearn.datasets.base.Bunch
- sinr.text.evaluate.similarity_MEN_WS353_SCWS(sinr_vec, print_missing=True)
Evaluate similarity with MEN, WS353 and SCWS datasets
- Parameters:
sinr_vec – SINrVectors object
print_missing – boolean (default : True)
- Returns:
Spearman correlation for MEN, WS353 and SCWS datasets
- Return type:
dict
- sinr.text.evaluate.vectorizer(sinr_vec, X, y=[])
Vectorize preprocessed documents to sinr embeddings
- Parameters:
sinr_vec (SINrVectors) – SINrVectors object
X (text (list(list(str))): A list of documents containing words) – preprocessed documents
y (numpy.ndarray) – documents labels
- Returns:
list of vectors
Module contents
Package to preprocess text into word co-occurrence graphs prior to running SINr.