Text similarity calculation based on Gensim

Gensim is a Python natural language processing library that uses algorithms such as TF-IDF (Term Frequency–Inverse Document Frequency), Latent Dirichlet Allocation (LDA),? Latent Semantic Analysis (LSA) or Random Projections, etc., discover the semantic structure of the document by checking the statistical occurrence patterns of words in the same document in the training corpus, and finally convert them into vector patterns. , for further processing. In addition, Gensim also implements the word2vec function, which can convert words into word vectors.

Corpus is a collection of original texts used to unsupervisedly train the hidden layer structure of text topics. No additional information manually annotated is required in the corpus. In Gensim, a Corpus is usually an iterable object (such as a list). Each iteration returns a sparse vector that can be used to represent the text object.

A vector is a list consisting of a set of text features. It is the internal expression of a piece of text in Gensim.

A dictionary is a collection of all words in all documents, and records information such as the number of occurrences of each word. ?

Model is an abstract term. Defines the transformation of two vector spaces (that is, transforming from one vector representation of text to another vector representation).

Is it up or down after the reduction?

What platform is the three rural financial poverty alleviation fund? Do you still need investment?

Will the travel expenses for training project experts be borne by the National Art Foundation?

How much farmland social security can an acre of land in a town get?

Is CLP Zhiyuan Co., Ltd. a central enterprise?

Can you make money by buying a fund?

What impact will universal second-child policy have on social insurance?

The difference between the secret room and the ordinary mystery story

Why can't private equity funds be listed on the New Third Board?

Net value of Yin Hua Quality Fund