6 min read

The techlash hasn’t died down – it’s just become normalized. Barely a day passes without a new scandal emerging, from questionable surveillance to racist AI algorithms. But it hasn’t all been bad: while negatives get a lot of attention (and so they should – the consequences of tech can be lethal, both societally and literally), there was still plenty to get excited about. And for those working in the data profession – as analysts, scientists, and engineers, there were several important trends that really helped to define where we are now from a purely practical perspective – as well as hinting at where we might go in the future.

With just a few weeks left to go of the year (and the decade!), let’s look at some of the key things that defined this year in the field of data science and data engineering.

The growth of PyTorch

TensorFlow is undoubtedly the most popular deep learning framework. You might even say that its role in popularizing deep learning and artificial intelligence has been understated. But while TensorFlow has held its place for some time, 2019 was the year when things started to change. Look, for example at this Google Trends graph (and yes, I know it’s not in any way scientific):

PyTorch v. TensorFlow Google Trends


As you can see TensorFlow hit its stride pretty early on. It’s only in the last 12 months or so that PyTorch has been narrowing the gap.

One of the reasons for this is the fact that PyTorch 1.0 was released at the end of last year. This has been the foundation that has spurred its growth over the last 12 months, effectively announcing its ‘official’ arrival on the scene. With Facebook (PyTorch’s creator) building on this foundation throughout the year with a few small but important releases. PyTorch 1.3, for example, which was released at the PyTorch Developer Conference in October, included a number of ‘experimental’ new features, including named tensors and PyTorch Mobile.

Another reason for PyTorch’s growth this year is that it is finding traction in the research field. This article provides some hard data that proves that PyTorch is starting to grow in this area, citing the tool’s comparable simplicity, API and performance as the reasons that it’s undermining TensorFlow’s utter dominance of the field.

Find our PyTorch bundle, and other data bundles, here. Grab 5 titles for just $25.

TensorFlow 2.0

While PyTorch has grown significantly in 2019, TensorFlow is nevertheless still holding its place at the top of the deep learning rankings. And TensorFlow 2.0 has undoubtedly cemented its position. With the alpha release getting developers excited since March, the full launch of 2.0 marked an important milestone for the project.

The key difference between TensorFlow 2.0 and 1.0 is ultimately accessibility and ease of use. Despite its massive popularity, TensorFlow 1.0 always had a reputation for being a little more difficult to use than many other deep learning tools. The team were clearly aware of this and have done a lot to make life easier for TensorFlow developers.

“With tight integration of Keras into TensorFlow, eager execution by default, and Pythonic function execution,” the team write in the release notes, “TensorFlow 2.0 makes the experience of developing applications as familiar as possible for Python developers.”

When placed alongside the exciting development of PyTorch, it’s clear that these two tools are going to be defining deep learning in the year – or years – to come.

TensorFlow 2.0 Quick Start Guide cover image

Get up to date with what’s new in TensorFlow 2.0 with TensorFlow 2.0 Quick Start Guide.

Stream processing with Kafka, Flink, and others

Dealing with large quantities of data in real-time is now the cutting-edge of big data. It’s for this reason that this year we’ve started to see stream processing gain headway in the mainstream. Although it’s been an important technique for organizations with data-intensive needs, the use of cloud and hybrid solutions – as well as an overall awareness of the opportunities of real-time data – has become truly mainstream.

In turn, this is giving new prominence to a range of stream-processing platforms. Kafka, Spark, and Flink are just three of the most well-known names in this space, but the market is undoubtedly growing.

Another key driver here is Nvidia – as one of the leading hardware companies, it deserves a lot of credit for helping to make massive processing power accessible to organizations that wouldn’t have had a chance just a few years ago. With CUDA, Nvidia’s parallel programming paradigm for GPUs, the company is helping all sorts of users to leverage stream processing in different ways.

Get started with Apache Kafka with Apache Kafka Quick Start Guide.

Data analysis on the cloud

Although I’ve already mentioned how influential TensorFlow was in popularizing deep learning, today public cloud is going even further. It’s making artificial intelligence and analytics accessible to new roles (thinking here about tools like Azure Machine Learning Studio and Amazon SageMaker), as well as making it easier to build and deploy machine learning models in applications and products.

In recent weeks, Microsoft has made another step in its bid to eat into AWS’s market share with Azure Synapse. Essentially a next generation Azure SQL Warehouse, Synapse is designed to bridge the gap between data lake and data warehouse – so, offering massive scale, and improving analytical speed.

It will be interesting to see how this plays with the wider market. AWS might respond with something similar – but the onus remains on Microsoft to shift mindshare; AWS will want to consolidate its powerful position.

Security

It would be wrong to suggest that security is a new issue in the world of data engineering and analytics. But in 2019 it’s become almost impossible to think about the two domains as separate from one another.

This cuts two different ways: on the one hand the emphasis on securing data and protecting privacy has never been greater. On the other hand, artificial intelligence and machine learning have started to play a critical part in the way that we monitor and identify threats to our systems.

To a certain extent this expresses the double bind that data poses: the amount of data at our disposal is a nightmare from a governance and architectural perspective, but it is, at the same time, a way of mitigating that very nightmare.

All in all, then, a bit of a vicious cycle, but nevertheless a reminder that however big our data gets, and however much we try to automate, there will always be a need for humans to think creatively and strategically about how we actually go about solving problems.

Explore Packt’s security bundles now.

For more technology eBooks and videos to prepare you for 2020, head to the Packt store.