# What’s happening in Data?

Here are the updates from AWS SageMaker, SAP HANA, and Teradata.

#### AWS SageMaker

- Create Amazon SageMaker model building pipelines and deploy R models using RStudio on Amazon SageMaker: This post explains the process of creating a SageMaker pipeline using R in their RStudio environment and showcased how to deploy their R model on a serverless endpoint on SageMaker using the SageMaker model registry.
- Amazon SageMaker Automatic Model Tuning now supports SageMaker Training Instance Fallbacks: This post is about how to overcome the InsufficientCapacityError by using the HyperParameterTuningResourceConfig parameter, which can be specified under the training job definition.

#### SAP HANA

- Sending S/4HANA Cloud eDocument File Data using BTP Cloud Integration: This blog is about how you can quickly develop an iFlow on SAP BTP Cloud Integration to poll for new documents, decode the files and send to an external system.
- Python with SAP Databases: This post explains how we can establish a connection with the database/s and execute some basic queries with Python.

#### Teradata

- Escaping the Prison of Forecasting: This is about creating precise, timely demand insights to share with operations systems and partners up and down the supply chain is helping Retail and CPG companies adjust to the new reality of today’s marketplace.
- The 7 Steps for an Analytics-led Digital Transformation: This blog is about analytic-led digital transformation that delivers all benefits at low TCO as well as provides best-in-class security to the most valuable asset — your data.

## Understanding Machine Learning Algorithms

#### k-Nearest Neighbors – By Giuseppe Bonaccorso

KNN is an approach that can be easily employed to solve clustering, classification, and regression problems (even if, in this case, we are going to consider only the first technique). The main idea behind the clustering algorithm is very simple. Let’s consider a data generating process pdata and a finite a dataset drawn from this distribution:

Each sample has a dimensionality equal to N. We can now introduce a distance function d(x1, x2), which in the majority of cases can be generalized with the Minkowski distance:

When p = 2, dp represents the classical Euclidean distance, that is normally the default choice. In particular cases, it can be useful to employ other variants, such as p = 1 (which is the Manhattan distance) or p > 2. Even if all the properties of a metric function remain unchanged, different values of p yield results that can be semantically diverse. The distance decreases monotonically with p and converges to the largest component absolute difference, |x1(j) – x2(j)|, when p → ∞. Therefore, whenever it’s important to weight all the components in the same way in order to have a consistent metric, small values of p are preferable (for example, p=1 or 2).

It’s clear that when the input dimensionality is very high and p >> 2, the expected value, E[Dmaxp – Dminp], becomes bounded between two constants, k1 (Cpd1/p-1/2) and k2 ((M-1)Cpd1/p-1/2) → 0, reducing the actual effect of almost any distance. In fact, given two generic couples of points (x1, x2) and (x3, x4) drawn from G, the natural consequence of the following inequality is that dp(x1, x2) ≈ dp(x3, x4) when p → ∞, independently of their relative positions. This explainer on algorithms was curated from the Book – Mastering Machine Learning Algorithms. To explore more, click the button below!

## Quick Tutorial

#### NER (Named Entity Recognition) pipeline – By Svetlana Karslioglu

NER is an information extraction technique that recognizes entities in text and puts them in certain categories, such as a person, location, and organization. For example, say we have the following phrase: Snap Inc. Announces First Quarter 2021 Financial Results

If you use spaCy’s en_core_web_lg against this phrase, you will get the following results:

**Snap Inc. – 0 – 9 – ORG – Companies, agencies, institutions, etc. **

**First Quarter 2021 – 20 – 38 – DATE – Absolute or relative dates or periods **

Name recognition can be useful in a variety of tasks. In this section, we will use it to retrieve the main characters of The Legend of Sleepy Hollow. Here is what our NER pipeline specification will look like:

**pipeline: **

** name: ner **

**description: A NER pipeline **

**input: **

** pfs: **

** glob: “/text.txt” **

** repo: data-clean **

**transform: **

** cmd: **

** – python3 **

** – “/ner.py” **

** image: svekars/nlp-example:1.0 **

This pipeline performs the following:

- Takes the original text of The Legend of Sleepy Hollow from the data-clean repository
- Uses the svekars/nlp-example:1.0 Docker image
- Runs the ner.py script
- Outputs the results to the ner repository

Now, let’s look at what the ner.py script does. Here is the list of components the script imports:

**import spacy **

**from spacy import displacy **

**from contextlib import redirect_stdout **

We need spacy to perform NER and the displacy module to visualize the results. redirect_stdout is a handy way to redirect printed output to a file. The rest of the code imports spaCy’s pretrained model called en_core_web_lg. This quick tutorial was curated from the Book – Reproducible Data Science with Pachyderm. To explore more, click the button below!

### Video Tutorial

#### Essential Math for Data Science – Mathematical Structures – By Ermin Dedic

This video explains a commonly used data fitting technique called least squares in a detailed way, with all necessary concepts and intuition. It also demonstrates classical and Bayesian probability methods!

### Secret Knowledge: Building your Data Arsenal

- pyqtgraph/pyqtgraph: Fast data visualization and GUI tools for scientific / engineering applications
- prometheus/haproxy_exporter: Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption
- Interana/eventsim: Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.