🌐 Amazon DynamoDB, Custom Data Sink 📊, Advanced Python 🐍 Bundle.

5 min read

What is happening in Data?

Here are the updates from AWS Machine Learning, Microsoft Azure, and Data Analytics – Google Cloud.

AWS Machine Learning

Microsoft Azure

Data Analytics – Google Cloud

Spatial Clustering on BigQuery – Best Practices – This post describes how BigQuery does spatial clustering out of the box using the S2 indexing system.


Understanding Machine Learning Algorithms

The Machine Learning Workflow – By Stefan Jansen

Developing an ML solution for an algorithmic trading strategy requires a systematic approach to maximize the chances of success while economizing on resources. It is also very important to make the process transparent and replicable in order to facilitate collaboration, maintenance, and later refinements.

The following chart outlines the key steps, from problem definition to the deployment of a predictive solution:

Key steps of the machine learning workflow

The process is iterative throughout, and the effort required at different stages will vary according to the project. Generally, however, this process should include the following steps:

  1. Frame the problem, identify a target metric, and define success.
  2. Source, clean, and validate the data.
  3. Understand your data and generate informative features.
  4. Pick one or more machine learning algorithms suitable for your data.
  5. Train, test, and tune your mFraming the problem – from goals to metrics odels.
  6. Use your model to solve the original problem.

Framing the problem – from goals to metrics

The starting point for any machine learning project is the use case it ultimately aims to address. Sometimes, this goal will be statistical inference in order to identify an association or even a causal relationship between variables. Most frequently, however, the goal will be the prediction of an outcome to yield a trading signal.

Both inference and prediction tasks rely on metrics to evaluate how well a model achieves its objective. Due to their prominence in practice, we will focus on common objective functions and the corresponding error metrics for predictive models. This explainer on ML algorithms was curated from the Book – Machine Learning for Algorithmic Trading – Second Edition. To explore more, click the button below!

Read Here!


Quick Tutorial

Writing a Custom Data Sink – By Jan Lukavský

As opposed to a data source, a data sink has much less work to do. Actually – in trivial cases – a data sink can be implemented using a plain ParDo object. In fact, we have already implemented one of these, which was PrintElements, located in the util module. The PrintElements transform can be considered a sink to stderr, as we can see from this implementation:


public PDone expand(PCollection<T> input) { 

  input.apply(ParDo.of(new LogResultsFn<>())); 

  return PDone.in(input.getPipeline()); 

private static class LogResultsFn<T> extends DoFn<T, Void> { 


  public void process(@Element T elem) { 



This sink is very simplistic – a real-life solution would need some of the tools we already know. For example, batching RPCs using bundle life cycles via @StartBundle and @FinishBundle, or even complex, sink-specific processing such as Beam’s KafkaExactyOnceSink for Kafka or FileIO, which deals with writing data to files on distributed filesystems. Either way, no new features are generally needed for writing sink functions. They boil down to everything we already know. This tutorial on data pipelines was curated from the Book – Building Big Data Pipelines with Apache Beam. To explore more, click the button below!

Read the Book


Secret Knowledge: Building your Data Arsenal

  • DLib – A suite of ML tools designed to be easy to imbed in other applications.
  • Intel® oneAPI Data Analytics Library – A high-performance software library developed by Intel and optimized for Intel’s architectures. The library provides algorithmic building blocks for all stages of data analytics.
  • igraph – General purpose graph library.
  • DyNet – A dynamic neural network library working well with networks that have dynamic structures that change for every training instance.