|
|
2018 » Papers » Volume 2 » Automated Template Generation based on Word Embeddings 1. AUTOMATED TEMPLATE GENERATION BASED ON WORD EMBEDDINGS Authors: Manatuica Maria, Dascalu Mihai, Trausan-Matu Stefan, Ruseti Stefan Volume 2 | DOI: 10.12753/2066-026X-18-124 | Pages: 393-399 | Download PDF | Abstract
Extracting document templates and generalizing the structure of similar documents from specific domains can significantly increase learner productivity when creating new documents. Moreover, from a generalizable point of view, the endeavour of manually creating a draft document can be a difficult and time-consuming task, whether we need to obtain a general form for a specific document, or to identify the main ideas of a set of scientific papers on the same subject. Thus, instead of starting from a blank page that can be frustrating in most cases, we propose an automated method of grouping semantically similar documents and identifying potential templates. This paper introduces the first steps towards building an automated method relying on advanced Natural Language Processing techniques that can be used to generate templates based on large collections by identifying patterns between and within documents. The underlying semantic model used is word2vec, a two-layered neural network that builds word embeddings and was trained using the general-purpose TASA corpus. The generated word vectors were then used to compute the document representations that consider normalized word occurrences; afterwards, an agglomerative clustering algorithm is applied. Each cluster produced one template formed of paragraphs chosen from the original collection. In order to evaluate the results of the proposed method, several experiments were conducted on collections from multiple domains. The results were analysed using charts for the similarity of documents and of paragraphs on one hand, as well as evolution graphs for the agglomerative clustering process, on the other hand. Overall, our automated process was efficient, and the results were encouraging in terms of proposing initial document templates. Further research paths include the anonymization of named entities and more in-depth comparisons in terms of document structure and syntax, besides semantic relatedness. | Keywords
Natural Language Processing; Template generation; Word embeddings; Agglomerative Clustering |
|
|
|