Stopwords and Keyword Extraction

Stopword Retrieval

Overview

Interface for loading stopwords from a set of 12 languages that are packaged along libmunin.

The stopwords can be used to split text into important and unimportant words. Additionally text language can be guessed through the guess_language module.

Reference

munin.stopwords.load_stopwords(language_code)[source]

Load a stopwordlist from the data directory.

Returns a frozenset with all stopwords or an empty set if the language_code was not recognized.

Parameters:language_code – A ISO-639 Alpha2 language code
Returns:A frozenset of words.
munin.stopwords.parse_stopwords(handle)[source]

Parse a file with stopwords in it into a list of stopwords.

Parameters:handle – an readable file handle.
Returns:An iterator that will yield stopwords.

Keyword Extraction with RAKE

Overview

This module contains code for automatic keywordextraction.

The algorithm used is RAKE (Rapid Automatic Keyword Extraction) as described in:

Rose, S., D. Engel, N. Cramer, and W. Cowley (2010). Automatic keyword extraction from indi-vidual documents. Text Mining: Applications and Theory

Paper can be found here:

The original code is based on aneesha’s Python implemenation of RAKE, but has been extended with automatic stopwordlist retrieval, stemming and duplicate keyword filtering:

While adding these features all code was rewritten.

Note

The in the paper mentioned “Adjoining of Keywords” is not implemented, since this implementation is targeted to short text (i.e. lyrics) anyway.

Note

Of all functions below, you’ll probably only need extract_keywords().

Reference

class munin.rake.DummyStemmer[source]

Stemmer class that does not modify it’s input

munin.rake.candidate_keywordscores(phrases, wordscore)[source]

Generate the actual results dictionary out of the phrases and wordscore.

Parameters:
Returns:

a mapping of keyword_sets to their rating

munin.rake.decide_which_to_delete(set_a, set_b)[source]

Return the longer of two sets, or if they have the same size, the one with the shorter (and thus more comparable) words.

munin.rake.extract_keywords(text, use_stemmer=True)[source]

Extract the keywords from a certain text.

Parameters:use_stemmer – If True a Snowball Stemmer will be used for all words.
Returns:A sorted mapping between a set of keywords and their rating.
Return type:collections.OrderedDict
munin.rake.extract_phrases(sentences, language_code, use_stemmer)[source]

Extract the phrases from all sentences.

A phrase is a sequence of words that do not contain a stopword.

Parameters:
  • sentences – An iterable of sentences (str)
  • language_code – an ISO 639 language code
  • use_stemmer – If True words in the phrases are also stemmed.
Returnes:

An iterable of phrases. (str)

munin.rake.filter_subsets(keywords)[source]

Remove keywordsets that are a subset of larger sets.

This modifies it’s input, but returns it for convinience.

Returns:keywords, the modified input.
munin.rake.issubset_levenshtein(set_a, set_b, threshold=0.4)[source]

Compare two sets of strings, return True if b is a subset of a.

Strings are compared with levenshtein.

munin.rake.phrase_iter(sentence, stopwords, stemmer)[source]

Splits a sentence into phrases (sequece of stopwordfree words).

Parameters:
  • sentence – The sentece to split into phrases.
  • stopwords – A set of stopwords/
  • stemmer – A stemmer class to be used (language aware)
Return type:

[str]

Returns:

An iterator that yields a list of words.

munin.rake.separate_words(text)[source]

Separate a text (or rather sentece) into words

Every non-alphanumeric character except + - / is considered as word-delim. Words that look like numbers are ignored too.

Returns:an iterable of words.
munin.rake.split_sentences(text)[source]

Split a text into individual sentences.

Newline is not considererd to be a new sentence.

Returns:an iterable of strings.
munin.rake.word_scores(phrases)[source]

Calculate the scores of each individual word, depending on the phrase length.

Parameters:phrases – An iterable of phrases
Returns:A mapping from a word to it’s score (degree(w) / freq(w))

Table Of Contents

Related Topics

This Page

Useful links:

Package:

Github: