Building a Training model for Searching and DeDuplicating Using Convolutional Neural Network

6 min read

What’s happening in Data? 

Here are the updates from AWS Database, IBM DataOps and TensorFlow.  

AWS Database 

IBM DataOps  


Weekly Picks

Here are some interesting articles from Data you might enjoy.

Tutorial of the Week

Building a Training model for Searching and DeDuplicating Using Convolutional Neural Network (CNN)

By: Rajesh Arumugam & Rajalingappaa Shanmugamani

In order to build our search and retrieval system, we will use the same data for training and testing. 

Data description 

The data provided is in the form of id, qid1, qid2, question1, question2, and is_duplicate, where the id field provides the ID for the training pair, qid and qid2 provide the ID for each question, and question1 and question2 are the full text for each question used for training, and is_duplicate is a Boolean or target value, set to 1 if the pair of texts are duplicates (semantically meaning the same) and 0 if they are not duplicates.

Training the model 

import sys 

import os 

import pandas as pd 

import numpy as np 

import string 

import tensorflow as tf 

Following is a function that takes a pandas series of text as input. Then, the series is converted to a list. Each item in the list is converted into a string, made lower case, and stripped of surrounding empty spaces. The entire list is converted into a NumPy array, to be passed back: 

def read_x(x): 

    x = np.array([list(str(line).lower().strip()) for line in x.tolist()]) 

    return x 

Next up is a function that takes a pandas series as input, converts it to a list, and returns it as a NumPy array: 

def read_y(y): 

    return np.asarray(y.tolist()) 

The next function splits the data for training and validation. The function takes question pairs and their corresponding labels as input, along with the ratio of the split: 

def split_train_val(x1, x2, y, ratio=0.1): 

    indicies = np.arange(x1.shape[0]) 


    num_train = int(x1.shape[0]*(1-ratio)) 

    train_indicies = indicies[:num_train] 

    val_indicies = indicies[num_train:] 

The ratios of training and validation are set to 10%, and accordingly, the shuffled indices are separated for training and validation by slicing the array. Since the indices are already shuffled, they can be used for splitting the training data. The input data has two question pairs, x1 and x2, with a y label indicating whether the pair is a duplicate: 

train_x1 = x1[train_indicies, :] 

train_x2 = x2[train_indicies, :] 

train_y = y[train_indicies] 

Similar to the training question pairs and labels, the validation data is sliced, based on a 10% ratio split of the indices: 

val_x1 = x1[val_indicies, :] 

val_x2 = x2[val_indicies, :] 

val_y = y[val_indicies] 

return train_x1, train_x2, train_y, val_x1, val_x2, val_y 

The training and validation data are picked from the shuffled indices, and the data is split based on the indices. This how-to was curated from the Free eBook – Hands-On Natural Language Processing with Python. To explore more, click the button below! 

Read the Book