Microsoft Dataverse, scikit-learn Pipeline Tutorial, NLP Bundle.

5 min read

What’s happening in Data?

Here are the updates from PyTorch, Microsoft Dataverse, and AWS Data Exchange.


Microsoft Dataverse

AWS Data Exchange

Weekly Picks

Here are some interesting articles you might enjoy. 

Tutorial of the Week

Creating a scikit-learn pipeline 

By: Dan Meador 

The following are the steps you require to create a scikit-learn pipeline: 

  1. Start by activating the appropriate conda environment, where you have installed the packages referenced in the Technical requirements section, and load up your Jupyter notebook. 

conda activate dssw_anaconda

  1. Start your Jupyter notebook via the command line or Navigator. Ensure that you are in the conda environment with the installed packages.

  2. Import the needed packages. In the first cell of your notebook, type in the following code and run that cell: 

import pandas as pd 

import numpy as np 

from sklearn.pipeline import Pipeline 

from sklearn.preprocessing import StandardScaler 

from sklearn.model_selection import train_test_split 

from sklearn.datasets import fetch_california_housing 

from sklearn.impute import SimpleImputer 

from sklearn.linear_model import LinearRegression 

from sklearn.model_selection import GridSearchCV 

from sklearn import __version__ as scikit_version 

import pickle 

import joblib 

  1. We’ll then import the data needed as a pandas DataFrame: 

cali_data = fetch_california_housing(as_frame=True) 

  1. Get the target and training value, and then split the training data.

training_data = 

target_value = 

X_train, X_test, y_train, y_test = train_test_split(training_data, target_value, test_size = 0.2, random_state=5) 

  1. Now, we are ready to create the pipeline. We are going to use a simple strategy of filling in any blank records with the mean of its column. For a scaler, we’ll make use of StandardScaler() and use linear regression for our algorithm.

Use the following code to create this pipeline: 

pipeline = Pipeline([ 

    (‘imputer’, SimpleImputer(missing_values=np.nan, strategy=’mean’)), 

    (‘std_scaler’, StandardScaler()), 

    (‘algorithm_regression’, LinearRegression()) 


We now have a pipeline ready to use! But we aren’t done yet. 

  1. Now, we need to call the fit() method on our pipeline. This is where the magic happens, as behind the scenes, it essentially calls the same fit function on the first transformer and then uses the output from one function to feed into the next. It repeats this for as many steps as you have and, finally, ends on the regression model:, y_train) 

  1. Lastly, we can score how well the model was trained on the last step. We’ll pass in the training and test data for this and then print output with two decimal places: 

train_score = pipeline.score(X_train,y_train) 

test_score = pipeline.score(X_test,y_test) 

print(f”Training set score: {train_score:.2f}”) 

print(f”Test  set score: {test_score:.2f}”) 

This how-to was curated from the Book – Building Data Science Solutions with Anaconda. To explore more, click the button below! 

Read the Book