.
A corpus is a large collection of written or spoken texts that is used for language research (Collins Dictionary - Corpus). Given its size, it constitutes a representative sample of language, while its machine-readable format allows annotation, as well as various types of analysis based on the criteria set and the tools used (e.g., part-of-speech, frequencies, key-word-in-context, etc.) (ICI Global - Corpus).
.
Advantage of NLTK Corpus Package.
- The nltk.corpus package contains a set of class readers that can retrieve the contents of various corpora.
- The package contain utilities for reading corpus files in various formats.
- The utilities can read both the corpus files and external corpus files.
Practical uses of NLTK Corpus Package.
The Corpus Reader Classes analyses the challenges that arise when building the new object of corpus reader classes.
Commonly used NLTK Corpus Package methods.
NLTK provides the following methods to word with the corpus:
- Word: The return type of word function is a list of str.
- Sents: The return type of the sents function is a list of (list of str).
- Paras: The return type of paras function is a list of (list of (list of str)).
- Tagged_words: The return type of the tagged_words function is a list of the (str, str) tuple.
- Tagged_sents: The return type of the tagged_sents function is a list of (list of (str, str)).
- Tagged_paras: The return type of tagged_paras function is list of (list of (list of (str, str))).
- Chunked_sents: The return type of the chunked_sents function is a list of (Tree w/ (str, str) leaves).
- Parsed_sents: The return type of the parsed_sents function is a list of (Tree with str leaves).
- Xml: The return type of xml function is a single element tree of xml.
- Raw: The return type of basic function is the content of the corpus.
.
0 Comments