22 min read

NumPy has a number of modules inherited from its predecessor, Numeric. Some of these packages have a SciPy counterpart, which may have fuller functionality.

In this article by Ivan Idris author of the book NumPy: Beginner’s Guide – Third Edition we will cover the following topics:

  • The linalg package
  • The fft package
  • Random numbers
  • Continuous and discrete distributions

(For more resources related to this topic, see here.)

Linear algebra

Linear algebra is an important branch of mathematics. The numpy.linalg package contains linear algebra functions. With this module, you can invert matrices, calculate eigenvalues, solve linear equations, and determine determinants, among other things (see http://docs.scipy.org/doc/numpy/reference/routines.linalg.html).

Time for action – inverting matrices

The inverse of a matrix A in linear algebra is the matrix A-1, which, when multiplied with the original matrix, is equal to the identity matrix I. This can be written as follows:

A A-1 = I

The inv() function in the numpy.linalg package can invert an example matrix with the following steps:

  1. Create the example matrix with the mat() function:
    A = np.mat("0 1 2;1 0 3;4 -3 8")
    print("An", A)

    The A matrix appears as follows:

    A
    [[ 0 1 2]
    [ 1 0 3]
    [ 4 -3 8]]
  2. Invert the matrix with the inv() function:
    inverse = np.linalg.inv(A)
    print("inverse of An", inverse)

    The inverse matrix appears as follows:

    inverse of A
    [[-4.5 7. -1.5]
    [-2.   4. -1. ]
    [ 1.5 -2.   0.5]]

    If the matrix is singular, or not square, a LinAlgError is raised. If you want, you can check the result manually with a pen and paper. This is left as an exercise for the reader.

  3. Check the result by multiplying the original matrix with the result of the inv() function:
    print("Checkn", A * inverse)

    The result is the identity matrix, as expected:

    Check
    [[ 1. 0. 0.]
    [ 0. 1. 0.]
    [ 0. 0. 1.]]

What just happened?

We calculated the inverse of a matrix with the inv() function of the numpy.linalg package. We checked, with matrix multiplication, whether this is indeed the inverse matrix (see inversion.py):

from __future__ import print_function
import numpy as np
 
A = np.mat("0 1 2;1 0 3;4 -3 8")
print("An", A)
 
inverse = np.linalg.inv(A)
print("inverse of An", inverse)
 
print("Checkn", A * inverse)

Pop quiz – creating a matrix

Q1. Which function can create matrices?

  1. array
  2. create_matrix
  3. mat
  4. vector

Have a go hero – inverting your own matrix

Create your own matrix and invert it. The inverse is only defined for square matrices. The matrix must be square and invertible; otherwise, a LinAlgError exception is raised.

Solving linear systems

A matrix transforms a vector into another vector in a linear way. This transformation mathematically corresponds to a system of linear equations. The numpy.linalg function solve() solves systems of linear equations of the form Ax = b, where A is a matrix, b can be a one-dimensional or two-dimensional array, and x is an unknown variable. We will see the dot() function in action. This function returns the dot product of two floating-point arrays.

The dot() function calculates the dot product (see https://www.khanacademy.org/math/linear-algebra/vectors_and_spaces/dot_cross_products/v/vector-dot-product-and-vector-length). For a matrix A and vector b, the dot product is equal to the following sum:

NumPy: Beginner's Guide - Third Edition

Time for action – solving a linear system

Solve an example of a linear system with the following steps:

  1. Create A and b:
    A = np.mat("1 -2 1;0 2 -8;-4 5 9")
    print("An", A)
    b = np.array([0, 8, -9])
    print("bn", b)

    A and b appear as follows:

    NumPy: Beginner's Guide - Third Edition

  2. Solve this linear system with the solve() function:
    x = np.linalg.solve(A, b)
    print("Solution", x)

    The solution of the linear system is as follows:

    Solution [ 29. 16.   3.]
  3. Check whether the solution is correct with the dot() function:
    print("Checkn", np.dot(A , x))

    The result is as expected:

    Check
    [[ 0. 8. -9.]]

What just happened?

We solved a linear system using the solve() function from the NumPy linalg module and checked the solution with the dot() function:

from __future__ import print_function
import numpy as np
 
A = np.mat("1 -2 1;0 2 -8;-4 5 9")
print("An", A)
 
b = np.array([0, 8, -9])
print("bn", b)
 
x = np.linalg.solve(A, b)
print("Solution", x)
 
print("Checkn", np.dot(A , x))

Finding eigenvalues and eigenvectors

Eigenvalues are scalar solutions to the equation Ax = ax, where A is a two-dimensional matrix and x is a one-dimensional vector. Eigenvectors are vectors corresponding to eigenvalues (see https://www.khanacademy.org/math/linear-algebra/alternate_bases/eigen_everything/v/linear-algebra-introduction-to-eigenvalues-and-eigenvectors). The eigvals() function in the numpy.linalg package calculates eigenvalues. The eig() function returns a tuple containing eigenvalues and eigenvectors.

Time for action – determining eigenvalues and eigenvectors

Let’s calculate the eigenvalues of a matrix:

  1. Create a matrix as shown in the following:
    A = np.mat("3 -2;1 0")
    print("An", A)

    The matrix we created looks like the following:

    A
    [[ 3 -2]
    [ 1 0]]
  2. Call the eigvals() function:
    print("Eigenvalues", np.linalg.eigvals(A))

    The eigenvalues of the matrix are as follows:

    Eigenvalues [ 2. 1.]
  3. Determine eigenvalues and eigenvectors with the eig() function. This function returns a tuple, where the first element contains eigenvalues and the second element contains corresponding eigenvectors, arranged column-wise:
    eigenvalues, eigenvectors = np.linalg.eig(A)
    print("First tuple of eig", eigenvalues)
    print("Second tuple of eign", eigenvectors)

    The eigenvalues and eigenvectors appear as follows:

    First tuple of eig [ 2. 1.]
    Second tuple of eig
    [[ 0.89442719 0.70710678]
    [ 0.4472136   0.70710678]]
  4. Check the result with the dot() function by calculating the right and left side of the eigenvalues equation Ax = ax:
    for i, eigenvalue in enumerate(eigenvalues):
         print("Left", np.dot(A, eigenvectors[:,i]))
         print("Right", eigenvalue * eigenvectors[:,i])
         print()

    The output is as follows:

    Left [[ 1.78885438]
    [ 0.89442719]]
    Right [[ 1.78885438]
    [ 0.89442719]]

What just happened?

We found the eigenvalues and eigenvectors of a matrix with the eigvals() and eig() functions of the numpy.linalg module. We checked the result using the dot() function (see eigenvalues.py):

from __future__ import print_function
import numpy as np
 
A = np.mat("3 -2;1 0")
print("An", A)
 
print("Eigenvalues", np.linalg.eigvals(A) )
 
eigenvalues, eigenvectors = np.linalg.eig(A)
print("First tuple of eig", eigenvalues)
print("Second tuple of eign", eigenvectors)
 
for i, eigenvalue in enumerate(eigenvalues):
     print("Left", np.dot(A, eigenvectors[:,i]))
     print("Right", eigenvalue * eigenvectors[:,i])
     print()

Singular value decomposition

Singular value decomposition (SVD) is a type of factorization that decomposes a matrix into a product of three matrices. The SVD is a generalization of the previously discussed eigenvalue decomposition. SVD is very useful for algorithms such as the pseudo inverse, which we will discuss in the next section. The svd() function in the numpy.linalg package can perform this decomposition. This function returns three matrices U, ?, and V such that U and V are unitary and ? contains the singular values of the input matrix:

NumPy: Beginner's Guide - Third Edition

The asterisk denotes the Hermitian conjugate or the conjugate transpose. The complex conjugate changes the sign of the imaginary part of a complex number and is therefore not relevant for real numbers.

A complex square matrix A is unitary if A*A = AA* = I (the identity matrix). We can interpret SVD as a sequence of three operations—rotation, scaling, and another rotation.

We already transposed matrices in this article. The transpose flips matrices, turning rows into columns, and columns into rows.

Time for action – decomposing a matrix

It’s time to decompose a matrix with the SVD using the following steps:

  1. First, create a matrix as shown in the following:
    A = np.mat("4 11 14;8 7 -2")
    print("An", A)

    The matrix we created looks like the following:

    A
    [[ 4 11 14]
    [ 8 7 -2]]
  2. Decompose the matrix with the svd() function:
    U, Sigma, V = np.linalg.svd(A, full_matrices=False)
    print("U")
    print(U)
    print("Sigma")
    print(Sigma)
    print("V")
    print(V)

    Because of the full_matrices=False specification, NumPy performs a reduced SVD decomposition, which is faster to compute. The result is a tuple containing the two unitary matrices U and V on the left and right, respectively, and the singular values of the middle matrix:

    U
    [[-0.9486833 -0.31622777]
      [-0.31622777 0.9486833 ]]
    Sigma
    [ 18.97366596   9.48683298]
    V
    [[-0.33333333 -0.66666667 -0.66666667]
    [ 0.66666667 0.33333333 -0.66666667]]
  3. We do not actually have the middle matrix—we only have the diagonal values. The other values are all 0. Form the middle matrix with the diag() function. Multiply the three matrices as follows:

    print(“Productn”, U * np.diag(Sigma) * V)

    The product of the three matrices is equal to the matrix we created in the first step:

    Product
    [[ 4. 11. 14.]
    [ 8.   7. -2.]]

What just happened?

We decomposed a matrix and checked the result by matrix multiplication. We used the svd() function from the NumPy linalg module (see decomposition.py):

from __future__ import print_function
import numpy as np
 
A = np.mat("4 11 14;8 7 -2")
print("An", A)
 
U, Sigma, V = np.linalg.svd(A, full_matrices=False)
 
print("U")
print(U)
 
print("Sigma")
print(Sigma)
 
print("V")
print(V)
 
print("Productn", U * np.diag(Sigma) * V)

Pseudo inverse

The Moore-Penrose pseudo inverse of a matrix can be computed with the pinv() function of the numpy.linalg module (see http://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse). The pseudo inverse is calculated using the SVD (see previous example). The inv() function only accepts square matrices; the pinv() function does not have this restriction and is therefore considered a generalization of the inverse.

Time for action – computing the pseudo inverse of a matrix

Let’s compute the pseudo inverse of a matrix:

  1. First, create a matrix:
    A = np.mat("4 11 14;8 7 -2")
    print("An", A)

    The matrix we created looks like the following:

    A
    [[ 4 11 14]
    [ 8 7 -2]]
  2. Calculate the pseudo inverse matrix with the pinv() function:
    pseudoinv = np.linalg.pinv(A)
    print("Pseudo inversen", pseudoinv)

    The pseudo inverse result is as follows:

    Pseudo inverse
    [[-0.00555556 0.07222222]
    [ 0.02222222 0.04444444]
    [ 0.05555556 -0.05555556]]
  3. Multiply the original and pseudo inverse matrices:
    print("Check", A * pseudoinv)

    What we get is not an identity matrix, but it comes close to it:

    Check [[ 1.00000000e+00   0.00000000e+00]
    [ 8.32667268e-17   1.00000000e+00]]

What just happened?

We computed the pseudo inverse of a matrix with the pinv() function of the numpy.linalg module. The check by matrix multiplication resulted in a matrix that is approximately an identity matrix (see pseudoinversion.py):

from __future__ import print_function
import numpy as np
 
A = np.mat("4 11 14;8 7 -2")
print("An", A)
 
pseudoinv = np.linalg.pinv(A)
print("Pseudo inversen", pseudoinv)
 
print("Check", A * pseudoinv)

Determinants

The determinant is a value associated with a square matrix. It is used throughout mathematics; for more details, please refer to http://en.wikipedia.org/wiki/Determinant. For a n x n real value matrix, the determinant corresponds to the scaling a n-dimensional volume undergoes when transformed by the matrix. The positive sign of the determinant means the volume preserves its orientation (clockwise or anticlockwise), while a negative sign means reversed orientation. The numpy.linalg module has a det() function that returns the determinant of a matrix.

Time for action – calculating the determinant of a matrix

To calculate the determinant of a matrix, follow these steps:

  1. Create the matrix:
    A = np.mat("3 4;5 6")
    print("An", A)

    The matrix we created appears as follows:

    A
    [[ 3. 4.]
    [ 5. 6.]]
  2. Compute the determinant with the det() function:
    print("Determinant", np.linalg.det(A))

    The determinant appears as follows:

    Determinant -2.0

What just happened?

We calculated the determinant of a matrix with the det() function from the numpy.linalg module (see determinant.py):

from __future__ import print_function
import numpy as np
 
A = np.mat("3 4;5 6")
print("An", A)
 
print("Determinant", np.linalg.det(A))

Fast Fourier transform

The Fast Fourier transform (FFT) is an efficient algorithm to calculate the discrete Fourier transform (DFT).

The Fourier series represents a signal as a sum of sine and cosine terms.

FFT improves on more naïve algorithms and is of order O(N log N). DFT has applications in signal processing, image processing, solving partial differential equations, and more. NumPy has a module called fft that offers FFT functionality. Many functions in this module are paired; for those functions, another function does the inverse operation. For instance, the fft() and ifft() function form such a pair.

Time for action – calculating the Fourier transform

First, we will create a signal to transform. Calculate the Fourier transform with the following steps:

  1. Create a cosine wave with 30 points as follows:
    x = np.linspace(0, 2 * np.pi, 30)
    wave = np.cos(x)
  2. Transform the cosine wave with the fft() function:
    transformed = np.fft.fft(wave)
  3. Apply the inverse transform with the ifft() function. It should approximately return the original signal. Check with the following line:
    print(np.all(np.abs(np.fft.ifft(transformed) - wave)   < 10 ** -9))

    The result appears as follows:

    True
  4. Plot the transformed signal with matplotlib:
    plt.plot(transformed)
    plt.title('Transformed cosine')
    plt.xlabel('Frequency')
    plt.ylabel('Amplitude')
    plt.grid()
    plt.show()

    The following resulting diagram shows the FFT result:

    NumPy: Beginner's Guide - Third Edition

What just happened?

We applied the fft() function to a cosine wave. After applying the ifft() function, we got our signal back (see fourier.py):

from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
 
 
x = np.linspace(0, 2 * np.pi, 30)
wave = np.cos(x)
transformed = np.fft.fft(wave)
print(np.all(np.abs(np.fft.ifft(transformed) - wave) < 10 ** -9))
 
plt.plot(transformed)
plt.title('Transformed cosine')
plt.xlabel('Frequency')
plt.ylabel('Amplitude')
plt.grid()
plt.show()

Shifting

The fftshift() function of the numpy.linalg module shifts zero-frequency components to the center of a spectrum. The zero-frequency component corresponds to the mean of the signal. The ifftshift() function reverses this operation.

Time for action – shifting frequencies

We will create a signal, transform it, and then shift the signal. Shift the frequencies with the following steps:

  1. Create a cosine wave with 30 points:
    x = np.linspace(0, 2 * np.pi, 30)
    wave = np.cos(x)
  2. Transform the cosine wave with the fft() function:
    transformed = np.fft.fft(wave)
  3. Shift the signal with the fftshift() function:
    shifted = np.fft.fftshift(transformed)
  4. Reverse the shift with the ifftshift() function. This should undo the shift. Check with the following code snippet:
    print(np.all((np.fft.ifftshift(shifted) - transformed)   < 10 ** -9))

    The result appears as follows:

    True
  5. Plot the signal and transform it with matplotlib:
    plt.plot(transformed, lw=2, label="Transformed")
    plt.plot(shifted, '--', lw=3, label="Shifted")
    plt.title('Shifted and transformed cosine wave')
    plt.xlabel('Frequency')
    plt.ylabel('Amplitude')
    plt.grid()
    plt.legend(loc='best')
    plt.show()

    The following diagram shows the effect of the shift and the FFT:

    NumPy: Beginner's Guide - Third Edition

What just happened?

We applied the fftshift() function to a cosine wave. After applying the ifftshift() function, we got our signal back (see fouriershift.py):

import numpy as np
import matplotlib.pyplot as plt
 
 
x = np.linspace(0, 2 * np.pi, 30)
wave = np.cos(x)
transformed = np.fft.fft(wave)
shifted = np.fft.fftshift(transformed)
print(np.all(np.abs(np.fft.ifftshift(shifted) - transformed) < 10 ** -9))
 
plt.plot(transformed, lw=2, label="Transformed")
plt.plot(shifted, '--', lw=3, label="Shifted")
plt.title('Shifted and transformed cosine wave')
plt.xlabel('Frequency')
plt.ylabel('Amplitude')
plt.grid()
plt.legend(loc='best')
plt.show()

Random numbers

Random numbers are used in Monte Carlo methods, stochastic calculus, and more. Real random numbers are hard to generate, so, in practice, we use pseudo random numbers, which are random enough for most intents and purposes, except for some very special cases. These numbers appear random, but if you analyze them more closely, you will realize that they follow a certain pattern. The random numbers-related functions are in the NumPy random module. The core random number generator is based on the Mersenne Twister algorithm—a standard and well-known algorithm (see https://en.wikipedia.org/wiki/Mersenne_Twister). We can generate random numbers from discrete or continuous distributions. The distribution functions have an optional size parameter, which tells NumPy how many numbers to generate. You can specify either an integer or a tuple as size. This will result in an array filled with random numbers of appropriate shape. Discrete distributions include the geometric, hypergeometric, and binomial distributions.

Time for action – gambling with the binomial

The binomial distribution models the number of successes in an integer number of independent trials of an experiment, where the probability of success in each experiment is a fixed number (see https://www.khanacademy.org/math/probability/random-variables-topic/binomial_distribution).

Imagine a 17th century gambling house where you can bet on flipping pieces of eight. Nine coins are flipped. If less than five are heads, then you lose one piece of eight, otherwise you win one. Let’s simulate this, starting with 1,000 coins in our possession. Use the binomial() function from the random module for that purpose.

To understand the binomial() function, look at the following section:

  1. Initialize an array, which represents the cash balance, to zeros. Call the binomial() function with a size of 10000. This represents 10,000 coin flips in our casino:
    cash = np.zeros(10000)
    cash[0] = 1000
    outcome = np.random.binomial(9, 0.5, size=len(cash))
  2. Go through the outcomes of the coin flips and update the cash array. Print the minimum and maximum of the outcome, just to make sure we don’t have any strange outliers:
    for i in range(1, len(cash)):
       if outcome[i] < 5:
         cash[i] = cash[i - 1] - 1
       elif outcome[i] < 10:
         cash[i] = cash[i - 1] + 1
       else:
         raise AssertionError("Unexpected outcome " + outcome)
     
    print(outcome.min(), outcome.max())

    As expected, the values are between 0 and 9. In the following diagram, you can see the cash balance performing a random walk:

    NumPy: Beginner's Guide - Third Edition

What just happened?

We did a random walk experiment using the binomial() function from the NumPy random module (see headortail.py):

from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
 
 
cash = np.zeros(10000)
cash[0] = 1000
np.random.seed(73)
outcome = np.random.binomial(9, 0.5, size=len(cash))
 
for i in range(1, len(cash)):
   if outcome[i] < 5:
     cash[i] = cash[i - 1] - 1
   elif outcome[i] < 10:
     cash[i] = cash[i - 1] + 1
   else:
     raise AssertionError("Unexpected outcome " + outcome)
 
print(outcome.min(), outcome.max())
 
plt.plot(np.arange(len(cash)), cash)
plt.title('Binomial simulation')
plt.xlabel('# Bets')
plt.ylabel('Cash')
plt.grid()
plt.show()

Hypergeometric distribution

The hypergeometricdistribution models a jar with two types of objects in it. The model tells us how many objects of one type we can get if we take a specified number of items out of the jar without replacing them (see https://en.wikipedia.org/wiki/Hypergeometric_distribution). The NumPy random module has a hypergeometric() function that simulates this situation.

Time for action – simulating a game show

Imagine a game show where every time the contestants answer a question correctly, they get to pull three balls from a jar and then put them back. Now, there is a catch, one ball in the jar is bad. Every time it is pulled out, the contestants lose six points. If, however, they manage to get out 3 of the 25 normal balls, they get one point. So, what is going to happen if we have 100 questions in total? Look at the following section for the solution:

  1. Initialize the outcome of the game with the hypergeometric() function. The first parameter of this function is the number of ways to make a good selection, the second parameter is the number of ways to make a bad selection, and the third parameter is the number of items sampled:
    points = np.zeros(100)
    outcomes = np.random.hypergeometric(25, 1, 3, size=len(points))
  2. Set the scores based on the outcomes from the previous step:
    for i in range(len(points)):
       if outcomes[i] == 3:
         points[i] = points[i - 1] + 1
       elif outcomes[i] == 2:
         points[i] = points[i - 1] - 6
       else:
        print(outcomes[i])

    The following diagram shows how the scoring evolved:

    NumPy: Beginner's Guide - Third Edition

What just happened?

We simulated a game show using the hypergeometric() function from the NumPy random module. The game scoring depends on how many good and how many bad balls the contestants pulled out of a jar in each session (see urn.py):

from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
 
 
points = np.zeros(100)
np.random.seed(16)
outcomes = np.random.hypergeometric(25, 1, 3, size=len(points))
 
for i in range(len(points)):
   if outcomes[i] == 3:
     points[i] = points[i - 1] + 1
   elif outcomes[i] == 2:
     points[i] = points[i - 1] - 6
   else:
     print(outcomes[i])
 
plt.plot(np.arange(len(points)), points)
plt.title('Game show simulation')
plt.xlabel('# Rounds')
plt.ylabel('Score')
plt.grid()
plt.show()

Continuous distributions

We usually model continuous distributions with probability density functions (PDF). The probability that a value is in a certain interval is determined by integration of the PDF (see https://www.khanacademy.org/math/probability/random-variables-topic/random_variables_prob_dist/v/probability-density-functions). The NumPy random module has functions that represent continuous distributions—beta(), chisquare(), exponential(), f(), gamma(), gumbel(), laplace(), lognormal(), logistic(), multivariate_normal(), noncentral_chisquare(), noncentral_f(), normal(), and others.

Time for action – drawing a normal distribution

We can generate random numbers from a normal distribution and visualize their distribution with a histogram (see https://www.khanacademy.org/math/probability/statistics-inferential/normal_distribution/v/introduction-to-the-normal-distribution). Draw a normal distribution with the following steps:

  1. Generate random numbers for a given sample size using the normal() function from the random NumPy module:
    N=10000
    normal_values = np.random.normal(size=N)
  2. Draw the histogram and theoretical PDF with a center value of 0 and standard deviation of 1. Use matplotlib for this purpose:
    _, bins, _ = plt.hist(normal_values,   np.sqrt(N), normed=True, lw=1)
    sigma = 1
    mu = 0
    plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi))   * np.exp( - (bins - mu)**2 / (2 * sigma**2) ),lw=2)
    plt.show()

    In the following diagram, we see the familiar bell curve:

    NumPy: Beginner's Guide - Third Edition

What just happened?

We visualized the normal distribution using the normal() function from the random NumPy module. We did this by drawing the bell curve and a histogram of randomly generated values (see normaldist.py):

import numpy as np
import matplotlib.pyplot as plt
 
N=10000
 
np.random.seed(27)
normal_values = np.random.normal(size=N)
_, bins, _ = plt.hist(normal_values, np.sqrt(N), normed=True, lw=1, label="Histogram")
sigma = 1
mu = 0
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (bins - mu)**2 / (2 * sigma**2) ), '--', lw=3, label="PDF")
plt.title('Normal distribution')
plt.xlabel('Value')
plt.ylabel('Normalized Frequency')
plt.grid()
plt.legend(loc='best')
plt.show()

Lognormal distribution

A lognormal distribution is a distribution of a random variable whose natural logarithm is normally distributed. The lognormal() function of the random NumPy module models this distribution.

Time for action – drawing the lognormal distribution

Let’s visualize the lognormal distribution and its PDF with a histogram:

  1. Generate random numbers using the normal() function from the random NumPy module:
    N=10000
    lognormal_values = np.random.lognormal(size=N)
  2. Draw the histogram and theoretical PDF with a center value of 0 and standard deviation of 1:
    _, bins, _ = plt.hist(lognormal_values,   np.sqrt(N), normed=True, lw=1)
    sigma = 1
    mu = 0
    x = np.linspace(min(bins), max(bins), len(bins))
    pdf = np.exp(-(numpy.log(x) - mu)**2 / (2 * sigma**2))/ (x *   sigma * np.sqrt(2 * np.pi))
    plt.plot(x, pdf,lw=3)
    plt.show()

    The fit of the histogram and theoretical PDF is excellent, as you can see in the following diagram:

    NumPy: Beginner's Guide - Third Edition

What just happened?

We visualized the lognormal distribution using the lognormal() function from the random NumPy module. We did this by drawing the curve of the theoretical PDF and a histogram of randomly generated values (see lognormaldist.py):

import numpy as np
import matplotlib.pyplot as plt
 
N=10000
np.random.seed(34)
lognormal_values = np.random.lognormal(size=N)
_, bins, _ = plt.hist(lognormal_values,   np.sqrt(N), normed=True, lw=1, label="Histogram")
sigma = 1
mu = 0
x = np.linspace(min(bins), max(bins), len(bins))
pdf = np.exp(-(np.log(x) - mu)**2 / (2 * sigma**2))/ (x * sigma * np.sqrt(2 * np.pi))
plt.xlim([0, 15])
plt.plot(x, pdf,'--', lw=3, label="PDF")
plt.title('Lognormal distribution')
plt.xlabel('Value')
plt.ylabel('Normalized frequency')
plt.grid()
plt.legend(loc='best')
plt.show()

Bootstrapping in statistics

Bootstrapping is a method used to estimate variance, accuracy, and other metrics of sample estimates, such as the arithmetic mean. The simplest bootstrapping procedure consists of the following steps:

  1. Generate a large number of samples from the original data sample having the same size N. You can think of the original data as a jar containing numbers. We create the new samples by N times randomly picking a number from the jar. Each time we return the number into the jar, so a number can occur multiple times in a generated sample.
  2. With the new samples, we calculate the statistical estimate under investigation for each sample (for example, the arithmetic mean). This gives us a sample of possible values for the estimator.

Time for action – sampling with numpy.random.choice()

We will use the numpy.random.choice() function to perform bootstrapping.

  1. Start the IPython or Python shell and import NumPy:
    $ ipython
    In [1]: import numpy as np
  2. Generate a data sample following the normal distribution:
    In [2]: N = 500
     
    In [3]: np.random.seed(52)
     
    In [4]: data = np.random.normal(size=N)
     
  3. Calculate the mean of the data:
    In [5]: data.mean()
    Out[5]: 0.07253250605445645

    Generate 100 samples from the original data and calculate their means (of course, more samples may lead to a more accurate result):

    In [6]: bootstrapped = np.random.choice(data, size=(N, 100))
     
    In [7]: means = bootstrapped.mean(axis=0)
     
    In [8]: means.shape
    Out[8]: (100,)
  4. Calculate the mean, variance, and standard deviation of the arithmetic means we obtained:
    In [9]: means.mean()
    Out[9]: 0.067866373318115278
     
    In [10]: means.var()
    Out[10]: 0.001762807104774598
     
    In [11]: means.std()
    Out[11]: 0.041985796464692651

    If we are assuming a normal distribution for the means, it may be relevant to know the z-score, which is defined as follows:

    NumPy: Beginner's Guide - Third Edition

    In [12]: (data.mean() - means.mean())/means.std()
    Out[12]: 0.11113598238549766

    From the z-score value, we get an idea of how probable the actual mean is.

What just happened?

We bootstrapped a data sample by generating samples and calculating the means of each sample. Then we computed the mean, standard deviation, variance, and z-score of the means. We used the numpy.random.choice() function for bootstrapping.

Summary

You learned a lot in this article about NumPy modules. We covered linear algebra, the Fast Fourier transform, continuous and discrete distributions, and random numbers.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here