Lise Getoor is a professor in the Computer Science Department, at the University of California, Santa Cruz. She has a PhD in Computer Science from Stanford University. She has spent a lot of time studying machine learning, reasoning under uncertainty, databases, data science for social good, artificial intelligence

This article attempts to bring our readers to Lisa’s Keynote speech at NIPS 2017. It highlights how structures can be unreasonably effective and the ways to leverage structures in Machine learning problems. After reading this article, head over to the NIPS Facebook page for the complete keynote. All images in this article come from Lisa’s presentation slides and do not belong to us.

Our ability to collect, manipulate, analyze, and act on vast amounts of data is having a profound impact on all aspects of society. Much of this data is heterogeneous in nature and interlinked in a myriad of complex ways. This Data is **Multimodal** (it has different kinds of entities), **Multi-relational **(it has different links between things), and **Spatio-Temporal **(it involves space and time parameters). This keynote explores how we can exploit the structure that’s in the input as well as the output of machine learning algorithms.

A large number of structured problems exists in the fields of NLP and computer vision, computational biology, computational social sciences, knowledge graph extraction and so on. According to Dan Roth, all interesting decisions are structured i.e. there are dependencies between the predictions. Most ML algorithms take this nicely structured data and flatten it put it in a matrix form, which is convenient for our algorithms. However, there is a bunch of different issues with it.

The most fundamental issue with the matrix form is that it assumes incorrect independence. Further, in the context of structure and outputs, we’re unable to apply the collective reasoning about the predictions we made for different entries in this matrix. Therefore we need to have ways where we can declaratively talk about how to transform the structure into features. This talk provides us with patterns, tools, and templates for dealing with structures in both inputs and outputs.

Lisa has covered three topics for solving structured problems: Patterns, Tools, and Templates. Patterns are used for simple structured problems. Tools help in getting patterns to work and in creating tractable structured problems. Templates build on patterns and tools to solve bigger computational problems.

# Patterns

They are used for naively simple structured problems. But on encoding them repeatedly, one can increase performance by 5 or 10%.

We use Logical Rules to capture structure in patterns. These logical structures capture structure, i.e. they give an easy way of talking about entities and links between entities. They also tend to be interpretable.

There are three basic patterns for structured prediction problems: **Collective Classification, Link Prediction, Entity Resolution.**

### To learn more about Patterns, open this section

### Collective Classification

Collective classification is used for inferring the labels of nodes in a graph. The pattern for expressing this in logical rules is

**
**

**
local – predictor (x, l) →
**

**label (x, l)**

**label (x, l) & link(x,y) **→ **label (y,l)
**

It is called as collective classification as the thing to predict i.e. the label, occurs on both sides of the rule.

Let us consider a toy problem:

We have to predict unknown labels here (marked in grey) as to what political party the unknown person will vote for. We apply logical rules to the problem.

Local rules:

**If X donates to part P, X votes for P”**

**“If X tweets party P slogans, X votes for P”**

Relational rules:

**“If X is linked to Y, and X votes for P, Y votes for P”**

**Votes (X,P) & Friends (X,Y) **→ **Votes (Y, P)**

**Votes (X,P) & Spouse (X,Y) **→ **Votes (Y, P)
**

The above example shows the local and relational rules applied to the problem based on collective classification. Adding a collective classifier like this to other problems yields significant improvement.

### Link Prediction

Link Prediction is used for predicting links or edges in a graph. The pattern for expressing this in logical rules is :

**
**

**
link (x,y) & similar (y,z)
**

**→**

**link (x,y)**

For example, consider a basic recommendation system. We apply logical rules of link prediction to express likes and similarities. So, how you infer one link is gonna give you information about another link.

Rules express:

**
**

**
“If user U likes item1, and item2 is similar to item1, user U likes item2”
**

**Likes (U, I1) & SimilarItem (I1, I2) ****→ ****Likes(U, I2)**

**“If user1 likes item I, and user2 is similar to user1, user2 likes item I”**

**Likes (U1, I) & SimilarUser (U1, U2) ****→ ****Likes(U2, I)
**

### Entity Resolution

Entity Resolution is used for determining which nodes refer to the same underlying entity. Here we use local rules between how similar things are, for instance, how similar their names or links are

**
**

**
similar – name (x,y)
**

**→**

**same (x,y)**

**similar – links (x,y) ****→**** same (x,y)
**

There are two collective rules. One is based on transitivity.

**
**

**
similar – name (x,y)
**

**→**

**same (x,y)**

**similar – links (x,y) ****→**** same (x,y)**

**same (x,y) && same(y,z) ****→**** same (x,z)
**

The other is based on matching i.e. dependence on both sides of the rule.

**
**

**
similar – name (x,y)
**

**→**

**same (x,y)**

**similar – links (x,y) ****→**** same (x,y)**

**same (x,y) & ! same (y,z) ****→**** ! same (x,z)
**

The logical rules as described above though being quite helpful, have certain disadvantages. They are intractable, can’t handle inconsistencies, and can’t represent degrees of similarity.

# Tools

Tools help in making the structured kind of problems tractable and in getting patterns to work. The tools come from the Statistical Relational Learning community. Lise adds another one to this mix of languages – PSL.

PSL is probabilistic logical programming, a declarative language for expressing collective inference problems. To know more: psl.linqs.org

Predicate = relationship or property

Ground Atom = (continuous) random variable

Weighted Rules = capture dependency or constraint

PSL Program = Rules + Input DB

**PSL makes reasoning scalable by mapping Logical inference to Convex optimization.**

The language takes logical rules and assign weights to them and then uses it to define a distribution for the unknown variables. One of the striking features here is that the random variables have continuous values. The work done pertaining to the PSL language turns the disadvantages of logical rules into advantages. So they are tractable, can handle inconsistencies, and can represent similarity.

The key idea is to convert the clauses to concave functions. To be tractable, we relax it to a **concave maximization**.

PSL has semantics from three different worlds: Randomized algorithms from the Computer science community, Probabilistic graphical models from the Machine Learning community, and Soft Logic from the AI community.

### To learn more about PSL, open this section

### Randomized Algorithm

In this setting, we have a weighted rule. We have nonnegative weights and then a set of weighted logical rules in clausal form.

Weighted Max SAT is a classical problem where we attempt to find the assignment to the random variables that maximize the weights of the satisfied rules.

However, this problem is NP-HARD (which is a computational complexity theory for non-deterministic polynomial-time hardness). To overcome this, the randomized community converts this combinatorial optimization to a continuous optimization by introducing random variables which denote rounding probabilities.

### Probabilistic Graphic Models

Graph models represent problems as a factor graph where we have random variables and rules that are essentially the potential function.

However, this problem is also NP-Hard. We use **Variational Inference** approximation technique to solve this. Here we introduce marginal distributions (μ) for the variables. We can then express a solution if we can find a set of globally consistent assignment for these marginal distributions. The problem here is, although we can express it as a linear program, there is an exponential number of constraints.

We will use techniques from the graphical model’s community, particularly **Local Consistency Relaxation** to convert this to a simpler problem. The simple idea is to relax search over consistent marginals to simpler set. We introduce local pseudo marginals over joint potential states.

Using KKT conditions we can optimize out the θ to derive simplified projected LCR over μ. This approach shows 16% improvement over canonical dual decomposition (MPLP)

### Soft Logic

In the Soft Logic technique for convex optimizations, we have random variables that denote a degree of truth or similarity.

We are essentially trying to minimize the amount of dissatisfaction in the rules.

Hence with three different interpretations i.e. Randomized Algorithms, Graphical Models, and Soft Logic, we get the same convex optimizations. A PSL essentially takes a PSL program, takes some input data and defines a convex optimization.

- PSL is open-source. The code, data, tutorials are available online at psl.linqs.org
- MAP inference in PSL translates into convex optimization problem
- Inference is further enhanced with state-of-the-art optimization and distributed graph processing paradigms
- Learning methods for rule weights and latent variables

Using PSL gives fast as well as accurate results on comparison with other approaches.

# Templates

Templates build on patterns to solve problems in bigger areas such as computational social sciences, knowledge discovery, and responsible data science and machine learning.

### To learn about some use cases of PSL and Templates for pattern recognition, open this section.

### Computational Social Sciences

For exploring this area we will apply a PSL model to **Debate stance classification**. Let us consider a scenario of an online debate. The topic of the debate is climate change. We can use information in the text to figure out if the people participating in the debate are pro or anti the topic.

We can also use information about the dialogue in the discourse. And we can build this on a PSL model. This is based on the collective classification problem we saw earlier in the post.

We get a significant rise in accuracy by using a PSL program. Here are the results

### Knowledge Discovery

Using a structure and making use of patterns in Knowledge discovery really pays off.

Although we have Information extractors which can extract information from the web and other sources such as facts about entities, relationships, they are usually noisy. So it gets difficult to reason about them collectively to figure out which facts we actually wanna add to our knowledge base.

We can add structure to the **knowledge graph construction** by

- Performing collective classification, link prediction, and entity resolution
- Enforcing ontological constraints
- Integrate knowledge source confidences
- Using PSL to make it scalable

Here’s the PSL program for knowledge graph identification.

These were evaluated on three real-world knowledge graphs. NELL, MusicBrainz, and Freebase.

As shown in the above image, both statistical features and semantic constraints help but combining them always wins.

### Responsible Machine Learning

Understanding structure can be key to mitigating negative effects and lead to responsible Machine Learning. The perils of ignoring structure in the machine learning space include overlooking Privacy. For instance, many approaches consider only individual’s attribute data. Some don’t take into account what can be inferred from relational context.

The other area is around Fairness. The structure here is often outside the data. It can be in the organization or the socio-economic structure. To enable fairness we need to implement impartial decision making without bias and need to take into account structural patterns.

Algorithmic Discrimination is another area which can make use of a structure. The fundamental structural pattern here is a feedback loop. Having a way of encoding this feedback loop is important to eliminate algorithmic discrimination.

# Conclusion

In this article, we saw ways of exploiting structures that can be tractable. It provided some tools and templates for exploiting structure. The keynote also provided opportunities for Machine Learning methods that can mix:

- Structured and unstructured approaches
- Probabilistic and logical inference
- Data-driven and knowledge-driven modeling

AI and Machine Learning developers need to build on the approaches as described above and discover, exploit, and find new structure and create compelling commercial, scientific, and societal applications.