GENSIM-01--Core Concepts

[1] Document

In Gensim, a document is an object of the text sequence type (commonly known as str in Python 3). A document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book.

[2] Corpus

A corpus is a collection of Document objects. Corpora serve two roles in Gensim:

(1) Input for training a core_concepts_model. During training, the models use this training corpus to look for common themes and topics, initializing their internal model parameters. Gensim focuses on unsupervised models so that no human intervention, such as costly annotations or tagging documents by hand, is required.

(2) Documents to organize. After training, a topic model can be used to extract topics from new documents (documents not seen in the training corpus). Such corpora can be indexed for sphx_glr_auto_examples_core_run_similarity_queries.py, queried by semantic similarity, clustered etc.

Here is an example corpus. It consists of 9 documents, where each document is a string consisting of a single sentence.

[3] Vector

To infer the latent structure in our corpus we need a way to represent documents that we can manipulate mathematically. One approach is to represent each document as a vector of features. For example, a single feature may be thought of as a question-answer pair:

How many times does the word splonge appear in the document? Zero.

How many paragraphs does the document consist of? Two.

How many fonts does the document use? Five.

The question is usually represented only by its integer id (such as 1, 2 and 3). The representation of this document then becomes a series of pairs like (1, 0.0), (2, 2.0), (3, 5.0). This is known as a dense vector, because it contains an explicit answer to each of the above questions.

If we know all the questions in advance, we may leave them implicit and simply represent the document as (0, 2, 5). This sequence of answers is the vector for our document (in this case a 3-dimensional dense vector). For practical purposes, only questions to which the answer is (or can be converted to) a single floating point number are allowed in Gensim.

The distinction between a document and a vector is that the former is text, and the latter is a mathematically convenient representation of the text. Sometimes, people will use the terms interchangeably: for example, given some arbitrary document D, instead of saying “the vector that corresponds to document D”, they will just say “the vector D” or the “document D”. This achieves brevity at the cost of ambiguity. As long as you remember that documents exist in document space, and that vectors exist in vector space, the above ambiguity is acceptable.

[4] Model

Now that we have vectorized our corpus we can begin to transform it using models. We use model as an abstract term referring to a transformation from one document representation to another. In gensim documents are represented as vectors so a model can be thought of as a transformation between two vector spaces. The model learns the details of this transformation during training, when it reads the training Corpus.

One simple example of a model is tf-idf _. The tf-idf model transforms vectors from the bag-of-words representation to a vector space where the frequency counts are weighted according to the relative rarity of each word in the corpus.

Here's a simple example. Let's initialize the tf-idf model, training it on our corpus and transforming the string "system minors":