Home Data News Meet Pypeline, a simple python library for building concurrent data pipelines

Meet Pypeline, a simple python library for building concurrent data pipelines

September 25, 2018 - 7:44 am

7091

2 min read

The Python team came out with a new simple and powerful library called Pypeline, last week for creating concurrent data pipelines. Pypeline has been designed for solving simple to medium data tasks that require concurrency and parallelism. It can be used in places where using frameworks such as Spark or Dask feel unnatural.

Pypeline comprises an easy to use familiar and functional API. It enables building data pipelines using Processes, Threads, and asyncio.Tasks via the exact same API. With Pypeline, you also have control over memory and CPU resources which are used at each stage of your pipeline.

Pypeline Basic Usage

Using Pypeline, you can easily create multi-stage data pipelines with the help of functions such as map, flat_map, filter, etc. To do so, you need to define a computational graph specifying the operations which are to be performed at each stage, the number of resources, and the type of workers you want to use. Pypeline comes with 3 main modules, and each of them uses a different type of worker. To build multi-stage data pipelines, you can use 3 type of workers, namely, processes, threads, and tasks.

Processes

You can create a pipeline based on multiprocessing. Process workers with the help of process module. After this, you can specify the numbers of workers at each stage. The maxsize parameter limits the maximum amount of elements that the stage can hold simultaneously.

Threads and Tasks

Create a pipeline using threading.Thread workers by using the thread module. Additionally, in order to create a pipeline based on asyncio.Task workers, use an asyncio_task module.

Apart from being used to create multi-stage data pipelines, it can also help you create pipelines with the help of the pipe | operator.

For more information, check out the official documentation.

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…

Meet Pypeline, a simple python library for building concurrent data pipelines

Pypeline Basic Usage

Processes

Threads and Tasks

Read Next

MobilePro

datapro

Programming

Subscribe to our newsletter