WPS Office from Kingsoft is a productivity tool that is serving 150M+ users globally. The AI department of Kingsoft independently developed the WPS intelligent writing assistant which is a lightweight yet powerful mobile app. The writing assistant uses semantic matching algorithms, such as intent recognition and text clustering, to help users create documents. It is featured by official document templates, material recommendations, and auxiliary text generation. At the same time, it has collected a massive amount of official documents, reaching tens of millions of articles, outlines, and paragraphs.
Being an important component of the WPS writing assistant, the material recommendation module uses Milvus-based vector processing as its core function. Its goal is to efficiently extract and store high-quality official documents from a massive number of texts and accurately recommend them to a targeted audience.
The material recommendation service consists of three parts: the data processing module, the encoding and storage module, and the query and recommendation module. The Milvus vector database involves the encoding and storage module and query and recommendation module.
The data processing module mainly contains data cleansing, outline, and paragraph extraction. It sorts out the outline and paragraph data from massive data.
The encoding and storage module involves two parts: text encoding and vector storage. The procedure of text encoding is as follows:
- Obtain 256-dimensional vectors through deep learning methods.
- Insert the vectors and IDs of the original texts to the Milvus database, and build IVFFLAT index for each collection.
The procedure of query and recommendation is as follows:
- Encode the input search vector according to query criteria such as user input.
- Use the similarity calculation method (L2 distance) provided by Milvus to perform a nearest neighbor search, and return the rough recalled vector and original text id.
- Make accurately ranked recommendations using portraits and models.
The data processing module mainly uses feature engineering, regular matching, and NLP model scoring.
A document is a human representation of complex semantics. A large amount of semantic information is distributed at multiple levels from words to phrases, sentences, paragraphs, or articles. Establishing feature engineering for documents is an efficient way to maximize the extraction of semantic features.
This case, when combined with the actual text data, start from multiple levels such as vocabulary and sentence to establish the vocabulary features and sentence features of the document.
At the vocabulary feature level, a corpus is established through word segmentation, and then the Term Frequency is calculated through the TF-IDF algorithm.
TF (Term frequency):
IDF (Inverse document frequency):
TF-IDF (term frequency – inverse document frequency):
Document keywords are extracted after sorting the TF-IDF values.
Text objects contain a large number of vocabularies about entities, such as people and institution names. These entities play a significant role in sorting and recalling (especially in accurate sorting), so the deep learning model BLSTM-CNNs-CRF is used to extract these words from documents.
The BLSTM-CNNs-CRF model consists of three parts:
- Based on char embedding, the CNN model calculates the char representation for each word.
- Connect the result of the previous step to the word embedding and send it to the BLSTM (bidirectional long short-term memory).
- The output of BLSTM is fed to the CRF (conditional random field) to decode the best label sequence.
At the sentence feature level, this case uses TextRank to extract summary sentences as key sentences of the document. As an extractable unsupervised text summary method, TextRank algorithm draws on the PageRank algorithm, which is used to sort web pages in online search results. It performs text segmentation, text vectorization, and graph model building, as well as sorting sentences via a transition probability matrix, to extract key sentences from the document.
At the same time, this case also trains a TextCNN model to extract high-quality paragraphs and sentences in the document. The extraction task is regarded as a classification task. To better capture the local correlation between sentences and words, this case uses a pre-trained vector (Word2Vec) and multiple convolution kernels of different sizes.
TextCNN mainly consists of the embedded layer, the convolution layer, the max pooling layer, and the full-connection and softmax layer. As one of the most commonly used text classification algorithms, TextCNN features simple structure, effective results, and high scalability.
The code storage module mainly uses the semantic understanding model for embedding and the index component of Milvus.
In the encoding part, traditional sentence embedding usually uses unsupervised methods, but these methods are not robust enough to deal with long sentences. This case uses the Infersent model to generate general representations of sentences. In the Infersent model, sentences are embedded via supervised training methods.
As a supervised model, Infersent
- Selects SNLI as the classification task, and encodes the sentence pairs (text, hypothesis) through an encoder to obtain their corresponding feature vectors U and V;
- Obtains features by performing calculations about connection, difference and inner product;
- Outputs the corresponding judgments after the full-connection and softmax layer.
After the training completes, the encoded vectors of the sentences are obtained from the encoder.
In the index part, the IndexFlatL2 index is used as a brute-force search index designed for Euclidean distance calculation. However, considering a real scenario, this case uses the IVF-FLAT index with clustering based on the IndexFlatL2 index. By dividing the search space and retrieving certain clusters during the query, the overall speed is greatly accelerated.
This case also uses the partitioning function of Milvus to divide the data into different categories, making query faster and more accurate.
The online service part mainly uses K8s shared cluster and chooses MySQL to save metadata. It does not use the default SQLite.
The Milvus vector database currently deployed in the 0.6.0 CPU version has about 2 million texts and is used to support the WPS writing assistant. At the same time can increase data processing made up of tens of millions of corpora.
This case uses a shared cluster, which means the computing resources are shared by other applications. Therefore, the data in this case is for reference only. In the current version, the overall average response time for a single service is 0.2 s.
The one-click document generation function of WPS writing assistant generates a complete article for users to use based on the understanding of the title and keywords, combined with relevant recommendation algorithms for selecting the appropriate outline and paragraph.
The intelligent text generation function initially provides users with multiple replaceable outline paragraphs. At the same time, when the user is modifying or creating a document, text paragraphs for the selected field are generated by an AI generation algorithm for the user to quote or reference. This way, the function can achieve the effect of human-oriented and Machine-Assisted text creation.
WPS writing assistant has developed practical functions, such as poetry creation and letter template recommendation, based on related document generation functions. It has introduced a writing community to help users broaden their view, share their work, make friends, and further improve their writing experience.
Thank you so much for reading, and I hope you now have a better understanding of AI-assisted writing, how they work, and how you can implement the technologies to similar projects. If you are interested in applying Milvus in your NLP projects, I encourage you to check out Milvus on GitHub.
This article is co-written by Lang Wang and Qixian Chen from WPS. WPS team also uses other AI technologies such as Tensorflow, here’s a blog post if you are interested in learning more.