k-Nearest Neighbors and NER pipeline

0
147
5 min read

What’s happening in Data? 

Here are the updates from AWS SageMaker, SAP HANA, and Teradata.   

AWS SageMaker 

SAP HANA 

Teradata 

  • Escaping the Prison of Forecasting: This is about creating precise, timely demand insights to share with operations systems and partners up and down the supply chain is helping Retail and CPG companies adjust to the new reality of today’s marketplace.   
  • The 7 Steps for an Analytics-led Digital Transformation: This blog is about analytic-led digital transformation that delivers all benefits at low TCO as well as provides best-in-class security to the most valuable asset — your data. 

 

Understanding Machine Learning Algorithms

k-Nearest Neighbors – By Giuseppe Bonaccorso  

KNN is an approach that can be easily employed to solve clustering, classification, and regression problems (even if, in this case, we are going to consider only the first technique). The main idea behind the clustering algorithm is very simple. Let’s consider a data generating process pdata and a finite a dataset drawn from this distribution: 

Each sample has a dimensionality equal to N. We can now introduce a distance function d(x1, x2), which in the majority of cases can be generalized with the Minkowski distance: 

When p = 2, dp represents the classical Euclidean distance, that is normally the default choice. In particular cases, it can be useful to employ other variants, such as p = 1 (which is the Manhattan distance) or p > 2. Even if all the properties of a metric function remain unchanged, different values of p yield results that can be semantically diverse. The distance decreases monotonically with p and converges to the largest component absolute difference, |x1(j) – x2(j)|, when p → ∞. Therefore, whenever it’s important to weight all the components in the same way in order to have a consistent metric, small values of p are preferable (for example, p=1 or 2). 


 


It’s clear that when the input dimensionality is very high and p >> 2, the expected value, E[Dmaxp – Dminp], becomes bounded between two constants, k1 (Cpd1/p-1/2) and k2  ((M-1)Cpd1/p-1/2) → 0, reducing the actual effect of almost any distance. In fact, given two generic couples of points (x1, x2) and (x3, x4) drawn from G, the natural consequence of the following inequality is that dp(x1, x2) ≈ dp(x3, x4) when p → ∞, independently of their relative positions. This explainer on algorithms was curated from the Book – Mastering Machine Learning Algorithms. To explore more, click the button below!    

Read Here!

 

Quick Tutorial 

NER (Named Entity Recognition) pipeline – By Svetlana Karslioglu 

NER is an information extraction technique that recognizes entities in text and puts them in certain categories, such as a person, location, and organization. For example, say we have the following phrase: Snap Inc. Announces First Quarter 2021 Financial Results 

If you use spaCy’s en_core_web_lg against this phrase, you will get the following results: 

Snap Inc. – 0 – 9 – ORG – Companies, agencies, institutions, etc. 

First Quarter 2021 – 20 – 38 – DATE – Absolute or relative dates or periods 

Name recognition can be useful in a variety of tasks. In this section, we will use it to retrieve the main characters of The Legend of Sleepy Hollow. Here is what our NER pipeline specification will look like:

pipeline: 

   name: ner 

description: A NER pipeline 

input: 

   pfs: 

     glob: “/text.txt” 

     repo: data-clean 

transform: 

   cmd: 

   – python3 

   – “/ner.py” 

    image: svekars/nlp-example:1.0 

This pipeline performs the following: 

  • Takes the original text of The Legend of Sleepy Hollow from the data-clean repository 
  • Uses the svekars/nlp-example:1.0 Docker image 
  • Runs the ner.py script 
  • Outputs the results to the ner repository 

Now, let’s look at what the ner.py script does. Here is the list of components the script imports: 

import spacy 

from spacy import displacy 

from contextlib import redirect_stdout 

We need spacy to perform NER and the displacy module to visualize the results. redirect_stdout is a handy way to redirect printed output to a file. The rest of the code imports spaCy’s pretrained model called en_core_web_lg. This quick tutorial was curated from the Book – Reproducible Data Science with Pachyderm. To explore more, click the button below!  

 Read the Book

 

Video Tutorial   

Essential Math for Data Science – Mathematical Structures – By Ermin Dedic

Watch the Video

This video explains a commonly used data fitting technique called least squares in a detailed way, with all necessary concepts and intuition. It also demonstrates classical and Bayesian probability methods!

 

Secret Knowledge: Building your Data Arsenal 

  • pyqtgraph/pyqtgraph: Fast data visualization and GUI tools for scientific / engineering applications 
  • prometheus/haproxy_exporter: Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption 
  • Interana/eventsim: Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.