This article is written by Breck Baldwin and Krishna Dayanidhi, the authors of Natural Language Processing with Java and LingPipe Cookbook. In this article, we will cover logistic regression.

(For more resources related to this topic, see here.)

Logistic regression is probably responsible for the majority of industrial classifiers, with the possible exception of naïve Bayes classifiers. It almost certainly is one of the best performing classifiers available, albeit at the cost of slow training and considerable complexity in configuration and tuning.

Logistic regression is also known as maximum entropy, neural network classification with a single neuron, and others. The classifiers have been based on the underlying characters or tokens, but logistic regression uses unrestricted feature extraction, which allows for arbitrary observations of the situation to be encoded in the classifier.

This article closely follows a more complete tutorial at http://alias-i.com/lingpipe/demos/tutorial/logistic-regression/read-me.html.

How logistic regression works

All that logistic regression does is take a vector of feature weights over the data, apply a vector of coefficients, and do some simple math, which results in a probability for each class encountered in training. The complicated bit is in determining what the coefficients should be.

The following are some of the features produced by our training example for 21 tweets annotated for English e and non-English n. There are relatively few features because feature weights are being pushed to 0.0 by our prior, and once a weight is 0.0, then the feature is removed. Note that one category, n, is set to 0.0 for all the features of the n-1 category—this is a property of the logistic regression process that fixes once categories features to 0.0 and adjust all other categories features with respect to that:

FEATURE e n
I : 0.37 0.0
! : 0.30 0.0
Disney : 0.15 0.0
" : 0.08 0.0
to : 0.07 0.0
anymore : 0.06 0.0
isn : 0.06 0.0
' : 0.06 0.0
t : 0.04 0.0
for : 0.03 0.0
que : -0.01 0.0
moi : -0.01 0.0
_ : -0.02 0.0
, : -0.08 0.0
pra : -0.09 0.0
? : -0.09 0.0

Take the string, I luv Disney, which will only have two non-zero features: I=0.37 and Disney=0.15 for e and zeros for n. Since there is no feature that matches luv, it is ignored. The probability that the tweet is English breaks down to:

vectorMultiply(e,[I,Disney]) = exp(.37*1 + .15*1) = 1.68

vectorMultiply(n,[I,Disney]) = exp(0*1 + 0*1) = 1

We will rescale to a probability by summing the outcomes and dividing it:

p(e|,[I,Disney]) = 1.68/(1.68 +1) = 0.62
p(e|,[I,Disney]) = 1/(1.68 +1) = 0.38

This is how the math works on running a logistic regression model. Training is another issue entirely.

Getting ready

This example assumes the same framework that we have been using all along to get training data from .csv files, train the classifier, and run it from the command line.

Setting up to train the classifier is a bit complex because of the number of parameters and objects used in training.

The main() method starts with what should be familiar classes and methods:

public static void main(String[] args) throws IOException {
String trainingFile = args.length > 0 ? args[0]
: "data/disney_e_n.csv";
List<String[]> training
= Util.readAnnotatedCsvRemoveHeader(new File(trainingFile));
int numFolds = 0;
XValidatingObjectCorpus<Classified<CharSequence>> corpus
= Util.loadXValCorpus(training,numFolds);
TokenizerFactory tokenizerFactory
= IndoEuropeanTokenizerFactory.INSTANCE;

Note that we are using XValidatingObjectCorpus when a simpler implementation such as ListCorpus will do. We will not take advantage of any of its cross-validation features, because the numFolds param as 0 will have training visit the entire corpus. We are trying to keep the number of novel classes to a minimum, and we tend to always use this implementation in real-world gigs anyway.

Now, we will start to build the configuration for our classifier. The FeatureExtractor<E> interface provides a mapping from data to features; this will be used to train and run the classifier. In this case, we are using a TokenFeatureExtractor() method, which creates features based on the tokens found by the tokenizer supplied during construction. This is similar to what naïve Bayes reasons over:

FeatureExtractor<CharSequence> featureExtractor
= new TokenFeatureExtractor(tokenizerFactory);

The minFeatureCount item is usually set to a number higher than 1, but with small training sets, this is needed to get any performance. The thought behind filtering feature counts is that logistic regression tends to overfit low-count features that, just by chance, exist in one category of training data. As training data grows, the minFeatureCount value is adjusted usually by paying attention to cross-validation performance:

int minFeatureCount = 1;

The addInterceptFeature Boolean controls whether a category feature exists that models the prevalence of the category in training. The default name of the intercept feature is *&^INTERCEPT%$^&**, and you will see it in the weight vector output if it is being used. By convention, the intercept feature is set to 1.0 for all inputs. The idea is that if a category is just very common or very rare, there should be a feature that captures just this fact, independent of other features that might not be as cleanly distributed. This models the category probability in naïve Bayes in some way, but the logistic regression algorithm will decide how useful it is as it does with all other features:

boolean addInterceptFeature = true;
boolean noninformativeIntercept = true;

These Booleans control what happens to the intercept feature if it is used. Priors, in the following code, are typically not applied to the intercept feature; this is the result if this parameter is true. Set the Boolean to false, and the prior will be applied to the intercept.

Next is the RegressionPrior instance, which controls how the model is fit. What you need to know is that priors help prevent logistic regression from overfitting the data by pushing coefficients towards 0. There is a non-informative prior that does not do this with the consequence that if there is a feature that applies to just one category it will be scaled to infinity, because the model keeps fitting better as the coefficient is increased in the numeric estimation. Priors, in this context, function as a way to not be over confident in observations about the world.

Another dimension in the RegressionPrior instance is the expected variance of the features. Low variance will push coefficients to zero more aggressively. The prior returned by the static laplace() method tends to work well for NLP problems. There is a lot going on, but it can be managed without a deep theoretical understanding.

double priorVariance = 2;
RegressionPrior prior
= RegressionPrior.laplace(priorVariance,
noninformativeIntercept);

Next, we will control how the algorithm searches for an answer.

AnnealingSchedule annealingSchedule
= AnnealingSchedule.exponential(0.00025,0.999);
double minImprovement = 0.000000001;
int minEpochs = 100;
int maxEpochs = 2000;

AnnealingSchedule is best understood by consulting the Javadoc, but what it does is change how much the coefficients are allowed to vary when fitting the model. The minImprovement parameter sets the amount the model fit has to improve to not terminate the search, because the algorithm has converged. The minEpochs parameter sets a minimal number of iterations, and maxEpochs sets an upper limit if the search does not converge as determined by minImprovement.

Next is some code that allows for basic reporting/logging. LogLevel.INFO will report a great deal of information about the progress of the classifier as it tries to converge:

PrintWriter progressWriter = new PrintWriter(System.out,true);
progressWriter.println("Reading data.");
Reporter reporter = Reporters.writer(progressWriter);
reporter.setLevel(LogLevel.INFO);

Here ends the Getting ready section of one of our most complex classes—next, we will train and run the classifier.

How to do it...

It has been a bit of work setting up to train and run this class. We will just go through the steps to get it up and running:

Note that there is a more complex 14-argument train method as well the one that extends configurability. This is the 10-argument version:

LogisticRegressionClassifier<CharSequence> classifier
= LogisticRegressionClassifier.
<CharSequence>train(corpus,
featureExtractor,
minFeatureCount,
addInterceptFeature,
prior,
annealingSchedule,
minImprovement,
minEpochs,
maxEpochs,
reporter);

The train() method, depending on the LogLevel constant, will produce from nothing with LogLevel.NONE to the prodigious output with LogLevel.ALL.

While we are not going to use it, we show how to serialize the trained model to disk:

AbstractExternalizable.compileTo(classifier,
new File("models/myModel.LogisticRegression"));

Once trained, we will apply the standard classification loop with:
```
Util.consoleInputPrintClassification(classifier);
```

Run the preceding code in the IDE of your choice or use the command-line command:

java -cp lingpipe-cookbook.1.0.jar:lib/lingpipe-4.1.0.jar:lib/
opencsv-2.4.jar com.lingpipe.cookbook.chapter3.TrainAndRunLogReg

The result is a big dump of information about the training:

Reading data.
:00 Feature Extractor class=class com.aliasi.tokenizer.
TokenFeatureExtractor
:00 min feature count=1
:00 Extracting Training Data
:00 Cold start
:00 Regression callback handler=null
:00 Logistic Regression Estimation
:00 Monitoring convergence=true
:00 Number of dimensions=233
:00 Number of Outcomes=2
:00 Number of Parameters=233
:00 Number of Training Instances=21
:00 Prior=LaplaceRegressionPrior(Variance=2.0,
noninformativeIntercept=true)
:00 Annealing Schedule=Exponential(initialLearningRate=2.5E-4,
base=0.999)
:00 Minimum Epochs=100
:00 Maximum Epochs=2000
:00 Minimum Improvement Per Period=1.0E-9
:00 Has Informative Prior=true
:00 epoch= 0 lr=0.000250000 ll= -20.9648 lp=
-232.0139 llp= -252.9787 llp*= -252.9787
:00 epoch= 1 lr=0.000249750 ll= -20.9406 lp=
-232.0195 llp= -252.9602 llp*= -252.9602

The epoch reporting goes on until either the number of epochs is met or the search converges. In the following case, the number of epochs was met:

:00 epoch= 1998 lr=0.000033868 ll= -15.4568 lp=
-233.8125 llp= -249.2693 llp*= -249.2693
:00 epoch= 1999 lr=0.000033834 ll= -15.4565 lp=
-233.8127 llp= -249.2692 llp*= -249.2692

Now, we can play with the classifier a bit:

Type a string to be classified. Empty string to quit.
I luv Disney
Rank Category Score P(Category|Input)
0=e 0.626898085027528 0.626898085027528
1=n 0.373101914972472 0.373101914972472

This should look familiar; it is exactly the same result as the worked example at the start.

That's it! You have trained up and used the world's most relevant industrial classifier. However, there's a lot more to harnessing the power of this beast.