Interface for loading stopwords from a set of 12 languages that are packaged along libmunin.
The stopwords can be used to split text into important and unimportant words. Additionally text language can be guessed through the guess_language module.
This module contains code for automatic keywordextraction.
The algorithm used is RAKE (Rapid Automatic Keyword Extraction) as described in:
Rose, S., D. Engel, N. Cramer, and W. Cowley (2010). Automatic keyword extraction from indi-vidual documents. Text Mining: Applications and Theory
Paper can be found here:
The original code is based on aneesha’s Python implemenation of RAKE, but has been extended with automatic stopwordlist retrieval, stemming and duplicate keyword filtering:
While adding these features all code was rewritten.
Note
The in the paper mentioned “Adjoining of Keywords” is not implemented, since this implementation is targeted to short text (i.e. lyrics) anyway.
Note
Of all functions below, you’ll probably only need extract_keywords().
Generate the actual results dictionary out of the phrases and wordscore.
Parameters: |
|
---|---|
Returns: | a mapping of keyword_sets to their rating |
Return the longer of two sets, or if they have the same size, the one with the shorter (and thus more comparable) words.
Extract the keywords from a certain text.
Parameters: | use_stemmer – If True a Snowball Stemmer will be used for all words. |
---|---|
Returns: | A sorted mapping between a set of keywords and their rating. |
Return type: | collections.OrderedDict |
Extract the phrases from all sentences.
A phrase is a sequence of words that do not contain a stopword.
Parameters: |
|
---|---|
Returnes: | An iterable of phrases. (str) |
Remove keywordsets that are a subset of larger sets.
This modifies it’s input, but returns it for convinience.
Returns: | keywords, the modified input. |
---|
Compare two sets of strings, return True if b is a subset of a.
Strings are compared with levenshtein.
Splits a sentence into phrases (sequece of stopwordfree words).
Parameters: |
|
---|---|
Return type: | [str] |
Returns: | An iterator that yields a list of words. |
Separate a text (or rather sentece) into words
Every non-alphanumeric character except + - / is considered as word-delim. Words that look like numbers are ignored too.
Returns: | an iterable of words. |
---|