[box type=”note” align=”” class=”” width=””]The following excerpt is taken from the book Mastering Text Mining with R, co-authored by Ashish Kumar and Avinash Paul. This book lists various techniques to extract useful and high-quality information from your textual data.[/box]
There is a wide range of packages available in R for natural language processing and text mining. In the article below, we present some of the popular and widely used R packages for NLP:
OpenNLP is an R package which provides an interface, Apache OpenNLP, which is a machine-learning-based toolkit written in Java for natural language processing activities. Apache OpenNLP is widely used for most common tasks in NLP, such as tokenization, POS tagging, named entity recognition (NER), chunking, parsing, and so on. It provides wrappers for Maxent entropy models using the Maxent Java package.
It provides functions for sentence annotation, word annotation, POS tag annotation, and annotation parsing using the Apache OpenNLP chunking parser. The Maxent Chunk annotator function computes the chunk annotation using the Maxent chunker provided by OpenNLP.
The Maxent entity annotator function in R package utilizes the Apache OpenNLP Maxent name finder for entity annotation. Model files can be downloaded from
These language models can be effectively used in R packages by installing the OpenNLPmodels.language package from the repository at http://datacube.wu.ac.at.
Get the OpenNLP package here.
The RWeka package in R provides an interface to Weka. Weka is an open source software developed by a machine learning group at the University of Wakaito, which provides a wide range of machine learning algorithms which can either be directly applied to a dataset or it can be called from a Java code. Different data-mining activities, such as data processing, supervised and unsupervised learning, association mining, and so on, can be performed using the RWeka package. For natural language processing, RWeka provides tokenization and stemming functions. RWeka packages provide an interface to Alphabetic,
wordTokenizer functions, which can efficiently perform tokenization for contiguous alphabetic sequence, string-split to n-grams, or simple word tokenization, respectively.
Get started with Rweka here.
RcmdrPlugin.temis package in R provides a graphical integrated text-mining solution. This package can be leveraged for many text-mining tasks, such as importing and cleaning a corpus, terms and documents count, term co-occurrences, correspondence analysis, and so on. Corpora can be imported from different sources and analysed using the
importCorpusDlg function. The package provides flexible data source options to import corpora from different sources, such as text files, spreadsheet files, XML, HTML files, Alceste format and Twitter search. The Import function in this package processes the corpus and generates a term-document matrix. The package provides different functions to summarize and visualize the corpus statistics. Correspondence analysis and hierarchical clustering can be performed on the corpus. The
corpusDissimilarity function helps analyse and create a
crossdissimilarity table between term-documents present in the corpus.
This package provides many functions to help the users explore the corpus. For example,
frequentTerms to list the most frequent terms of a corpus,
specificTerms to list terms most associated with each document,
subsetCorpusByTermsDlg to create a subset of the corpus. Term frequency, term co-occurrence, term dictionary, temporal evolution of occurrences or term time series, term metadata variables, and corpus temporal evolution are among the other very useful functions available in this package for text mining.
Download the package from CRAN page.
The tm package is a text-mining framework which provides some powerful functions which will aid in text-processing steps. It has methods for importing data, handling corpus, metadata management, creation of term document matrices, and preprocessing methods. For managing documents using the tm package, we create a corpus which is a collection of text documents. There are two types of implementation, volatile corpus (VCorpus) and permanent corpus (PCropus). VCorpus is completely held in memory and when the R object is destroyed the corpus is gone.
PCropus is stored in the filesystem and is present even after the R object is destroyed; this corpus can be created by using the VCorpus and PCorpus functions respectively. This package provides a few predefined sources which can be used to import text, such as DirSource, VectorSource, or DataframeSource. The getSources method lists available sources, and users can create their own sources. The tm package ships with several reader options: readPlain, readPDF, and readDOC. We can execute the getReaders method for an up-to-date list of available readers. To write a corpus to the filesystem, we can use writeCorpus.
For inspecting a corpus, there are methods such as inspect and print. For transformation of text, such as stop-word removal, stemming, whitespace removal, and so on, we can use the
tm_map, content_transformer, tolower, stopwords("english") functions. For metadata management, meta comes in handy. The tm package provides various quantitative function for text analysis, such as
Download the tm package here.
languageR provides data sets and functions for statistical analysis on text data. This package contains functions for vocabulary richness, vocabulary growth, frequency spectrum, also mixed-effects models and so on. There are simulation functions available: simple regression, quasi-F factor, and Latin-square designs. Apart from that, this package can also be used for correlation, collinearity diagnostic, diagnostic visualization of logistic models, and so on.
The koRpus package is a versatile tool for text mining which implements many functions for text readability and lexical variation. Apart from that, it can also be used for basic level functions such as tokenization and POS tagging.
You can find more information about its current version and dependencies here.
The RKEA package provides an interface to KEA, which is a tool for keyword extraction from texts. RKEA requires a keyword extraction model, which can be created by manually indexing a small set of texts, using which it extracts keywords from the document.
The maxent package in R provides tools for low-memory implementation of multinomial logistic regression, which is also called the maximum entropy model. This package is quite helpful for classification processes involving sparse term-document matrices, and low memory consumption on huge datasets.
Download and get started with maxent.
Truncated singular vector decomposition can help overcome the variability in a term-document matrix by deriving the latent features statistically. The lsa package in R provides an implementation of latent semantic analysis.
The ease of use and efficiency of R packages can be very handy when carrying out even the trickiest of text mining task. As a result, they have grown to become very popular in the community.
If you found this post useful, you should definitely refer to our book Mastering Text Mining with R. It will give you ample techniques for effective text mining and analytics using the above mentioned packages.