Current location - Trademark Inquiry Complete Network - Tian Tian Fund - Text similarity calculation based on Gensim
Text similarity calculation based on Gensim

Gensim is a Python natural language processing library that uses algorithms such as TF-IDF (Term Frequency–Inverse Document Frequency), Latent Dirichlet Allocation (LDA),? Latent Semantic Analysis (LSA) or Random Projections, etc., discover the semantic structure of the document by checking the statistical occurrence patterns of words in the same document in the training corpus, and finally convert them into vector patterns. , for further processing. In addition, Gensim also implements the word2vec function, which can convert words into word vectors.

Corpus is a collection of original texts used to unsupervisedly train the hidden layer structure of text topics. No additional information manually annotated is required in the corpus. In Gensim, a Corpus is usually an iterable object (such as a list). Each iteration returns a sparse vector that can be used to represent the text object.

A vector is a list consisting of a set of text features. It is the internal expression of a piece of text in Gensim.

A dictionary is a collection of all words in all documents, and records information such as the number of occurrences of each word. ?

Model is an abstract term. Defines the transformation of two vector spaces (that is, transforming from one vector representation of text to another vector representation).