The pandas Data Structures

24 min read

In this article by Femi Anthony, author of the book, Mastering pandas, starts by taking a tour of NumPy ndarrays, a data structure not in pandas but NumPy. Knowledge of NumPy ndarrays is useful as it forms the foundation for the pandas data structures. Another key benefit of NumPy arrays is that they execute what is known as vectorized operations, which are operations that require traversing/looping on a Python array, much faster.

In this article, I will present the material via numerous examples using IPython, a browser-based interface that allows the user to type in commands interactively to the Python interpreter.

(For more resources related to this topic, see here.)

NumPy ndarrays

The NumPy library is a very important package used for numerical computing with Python. Its primary features include the following:

The type numpy.ndarray, a homogenous multidimensional array
Access to numerous mathematical functions – linear algebra, statistics, and so on
Ability to integrate C, C++, and Fortran code

For more information about NumPy, see http://www.numpy.org.

The primary data structure in NumPy is the array class ndarray. It is a homogeneous multi-dimensional (n-dimensional) table of elements, which are indexed by integers just as a normal array. However, numpy.ndarray (also known as numpy.array) is different from the standard Python array.array class, which offers much less functionality. More information on the various operations is provided at http://scipy-lectures.github.io/intro/numpy/array_object.html.

NumPy array creation

NumPy arrays can be created in a number of ways via calls to various NumPy methods.

NumPy arrays via numpy.array

NumPy arrays can be created via the numpy.array constructor directly:

In [1]: import numpy as np
In [2]: ar1=np.array([0,1,2,3])# 1 dimensional array
In [3]: ar2=np.array ([[0,3,5],[2,8,7]]) # 2D array
In [4]: ar1
Out[4]: array([0, 1, 2, 3])
In [5]: ar2
Out[5]: array([[0, 3, 5],
               [2, 8, 7]])

The shape of the array is given via ndarray.shape:

In [5]: ar2.shape
Out[5]: (2, 3)

The number of dimensions is obtained using ndarray.ndim:

In [7]: ar2.ndim
Out[7]: 2

NumPy array via numpy.arange

ndarray.arange is the NumPy version of Python’s range function:In [10]: # produces the integers from 0 to 11, not inclusive of 12

           ar3=np.arange(12); ar3
Out[10]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
In [11]: # start, end (exclusive), step size
       ar4=np.arange(3,10,3); ar4
Out[11]: array([3, 6, 9])

NumPy array via numpy.linspace

ndarray.linspace generates linear evenly spaced elements between the start and the end:

In [13]:# args - start element,end element, number of elements
       ar5=np.linspace(0,2.0/3,4); ar5
Out[13]:array([ 0., 0.22222222, 0.44444444, 0.66666667])

NumPy array via various other functions

These functions include numpy.zeros, numpy.ones, numpy.eye, nrandom.rand, numpy.random.randn, and numpy.empty.

The argument must be a tuple in each case. For the 1D array, you can just specify the number of elements, no need for a tuple.

numpy.ones

The following command line explains the function:

In [14]:# Produces 2x3x2 array of 1's.
       ar7=np.ones((2,3,2)); ar7
Out[14]: array([[[ 1., 1.],
                 [ 1., 1.],
                 [ 1., 1.]],
               [[ 1., 1.],
                 [ 1., 1.],
                 [ 1., 1.]]])

numpy.zeros

The following command line explains the function:

In [15]:# Produce 4x2 array of zeros.
           ar8=np.zeros((4,2));ar8
Out[15]: array([[ 0., 0.],
         [ 0., 0.], 
           [ 0., 0.],
           [ 0., 0.]])

numpy.eye

The following command line explains the function:

In [17]:# Produces identity matrix
           ar9 = np.eye(3);ar9
Out[17]: array([[ 1., 0., 0.],
           [ 0., 1., 0.],
           [ 0., 0., 1.]])

numpy.diag

The following command line explains the function:
In [18]: # Create diagonal array
       ar10=np.diag((2,1,4,6));ar10
Out[18]: array([[2, 0, 0, 0],
           [0, 1, 0, 0],
           [0, 0, 4, 0],
           [0, 0, 0, 6]])

numpy.random.rand

The following command line explains the function:

In [19]: # Using the rand, randn functions
         # rand(m) produces uniformly distributed random numbers with range 0 to m
         np.random.seed(100)   # Set seed
         ar11=np.random.rand(3); ar11
Out[19]: array([ 0.54340494, 0.27836939, 0.42451759])
In [20]: # randn(m) produces m normally distributed (Gaussian) random numbers
           ar12=np.random.rand(5); ar12
Out[20]: array([ 0.35467445, -0.78606433, -0.2318722 ,   0.20797568, 0.93580797])

numpy.empty

Using np.empty to create an uninitialized array is a cheaper and faster way to allocate an array, rather than using np.ones or np.zeros (malloc versus. cmalloc). However, you should only use it if you’re sure that all the elements will be initialized later:

In [21]: ar13=np.empty((3,2)); ar13
Out[21]: array([[ -2.68156159e+154,   1.28822983e-231],
               [ 4.22764845e-307,   2.78310358e-309],
               [ 2.68156175e+154,   4.17201483e-309]])

numpy.tile

The np.tile function allows one to construct an array from a smaller array by repeating it several times on the basis of a parameter:

In [334]: np.array([[1,2],[6,7]])
Out[334]: array([[1, 2],
                 [6, 7]])
In [335]: np.tile(np.array([[1,2],[6,7]]),3)
Out[335]: array([[1, 2, 1, 2, 1, 2],
                [6, 7, 6, 7, 6, 7]])
In [336]: np.tile(np.array([[1,2],[6,7]]),(2,2))
Out[336]: array([[1, 2, 1, 2],
                 [6, 7, 6, 7],
                 [1, 2, 1, 2],
                 [6, 7, 6, 7]])

NumPy datatypes

We can specify the type of contents of a numeric array by using the dtype parameter:

In [50]: ar=np.array([2,-1,6,3],dtype='float'); ar
Out[50]: array([ 2., -1., 6., 3.])
In [51]: ar.dtype
Out[51]: dtype('float64')
In [52]: ar=np.array([2,4,6,8]); ar.dtype
Out[52]: dtype('int64')
In [53]: ar=np.array([2.,4,6,8]); ar.dtype
Out[53]: dtype('float64')

The default dtype in NumPy is float. In the case of strings, dtype is the length of the longest string in the array:

In [56]: sar=np.array(['Goodbye','Welcome','Tata','Goodnight']); sar.dtype
Out[56]: dtype('S9')

You cannot create variable-length strings in NumPy, since NumPy needs to know how much space to allocate for the string. dtypes can also be Boolean values, complex numbers, and so on:

In [57]: bar=np.array([True, False, True]); bar.dtype
Out[57]: dtype('bool')

The datatype of ndarray can be changed in much the same way as we cast in other languages such as Java or C/C++. For example, float to int and so on. The mechanism to do this is to use the numpy.ndarray.astype() function. Here is an example:

In [3]: f_ar = np.array([3,-2,8.18])
       f_ar
Out[3]: array([ 3. , -2. , 8.18])
In [4]: f_ar.astype(int)
Out[4]: array([ 3, -2, 8])

More information on casting can be found in the official documentation at http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.astype.html.

NumPy indexing and slicing

Array indices in NumPy start at 0, as in languages such as Python, Java, and C++ and unlike in Fortran, Matlab, and Octave, which start at 1. Arrays can be indexed in the standard way as we would index into any other Python sequences:

# print entire array, element 0, element 1, last element.
In [36]: ar = np.arange(5); print ar; ar[0], ar[1], ar[-1]
[0 1 2 3 4]
Out[36]: (0, 1, 4)
# 2nd, last and 1st elements
In [65]: ar=np.arange(5); ar[1], ar[-1], ar[0]
Out[65]: (1, 4, 0)

Arrays can be reversed using the ::-1 idiom as follows:

In [24]: ar=np.arange(5); ar[::-1]
Out[24]: array([4, 3, 2, 1, 0])

Multi-dimensional arrays are indexed using tuples of integers:

In [71]: ar = np.array([[2,3,4],[9,8,7],[11,12,13]]); ar
Out[71]: array([[ 2, 3, 4],
               [ 9, 8, 7],
               [11, 12, 13]])
In [72]: ar[1,1]
Out[72]: 8

Here, we set the entry at row1 and column1 to 5:

In [75]: ar[1,1]=5; ar
Out[75]: array([[ 2, 3, 4],
               [ 9, 5, 7],
               [11, 12, 13]])

Retrieve row 2:

In [76]: ar[2]
Out[76]: array([11, 12, 13])
In [77]: ar[2,:]
Out[77]: array([11, 12, 13])

Retrieve column 1:

In [78]: ar[:,1]
Out[78]: array([ 3, 5, 12])

If an index is specified that is out of bounds of the range of an array, IndexError will be raised:

In [6]: ar = np.array([0,1,2])
In [7]: ar[5]
   ---------------------------------------------------------------------------
   IndexError                 Traceback (most recent call last)
 <ipython-input-7-8ef7e0800b7a> in <module>()
   ----> 1 ar[5]
     IndexError: index 5 is out of bounds for axis 0 with size 3

Thus, for 2D arrays, the first dimension denotes rows and the second dimension, the columns. The colon (:) denotes selection across all elements of the dimension.

Array slicing

Arrays can be sliced using the following syntax: ar[startIndex: endIndex: stepValue].

In [82]: ar=2*np.arange(6); ar
Out[82]: array([ 0, 2, 4, 6, 8, 10])
In [85]: ar[1:5:2]
Out[85]: array([2, 6])

Note that if we wish to include the endIndex value, we need to go above it, as follows:

In [86]: ar[1:6:2]
Out[86]: array([ 2, 6, 10])

Obtain the first n-elements using ar[:n]:

In [91]: ar[:4]
Out[91]: array([0, 2, 4, 6])

The implicit assumption here is that startIndex=0, step=1.

Start at element 4 until the end:

In [92]: ar[4:]
Out[92]: array([ 8, 10])

Slice array with stepValue=3:

In [94]: ar[::3]
Out[94]: array([0, 6])

To illustrate the scope of indexing in NumPy, let us refer to this illustration, which is taken from a NumPy lecture given at SciPy 2013 and can be found at http://bit.ly/1GxCDpC:

Let us now examine the meanings of the expressions in the preceding image:

The expression a[0,3:5] indicates the start at row 0, and columns 3-5, where column 5 is not included.
In the expression a[4:,4:], the first 4 indicates the start at row 4 and will give all columns, that is, the array [[40, 41,42,43,44,45] [50,51,52,53,54,55]]. The second 4 shows the cutoff at the start of column 4 to produce the array [[44, 45], [54, 55]].
The expression a[:,2] gives all rows from column 2.
Now, in the last expression a[2::2,::2], 2::2 indicates that the start is at row 2 and the step value here is also 2. This would give us the array [[20, 21, 22, 23, 24, 25], [40, 41, 42, 43, 44, 45]]. Further, ::2 specifies that we retrieve columns in steps of 2, producing the end result array ([[20, 22, 24], [40, 42, 44]]).

Assignment and slicing can be combined as shown in the following code snippet:

In [96]: ar
Out[96]: array([ 0, 2, 4, 6, 8, 10])
In [100]: ar[:3]=1; ar
Out[100]: array([ 1, 1, 1, 6, 8, 10])
In [110]: ar[2:]=np.ones(4);ar
Out[110]: array([1, 1, 1, 1, 1, 1])

Array masking

Here, NumPy arrays can be used as masks to select or filter out elements of the original array. For example, see the following snippet:

In [146]: np.random.seed(10)
         ar=np.random.random_integers(0,25,10); ar
Out[146]: array([ 9, 4, 15, 0, 17, 25, 16, 17, 8, 9])
In [147]: evenMask=(ar % 2==0); evenMask
Out[147]: array([False, True, False, True, False, False, True, False, True, False], dtype=bool)
In [148]: evenNums=ar[evenMask]; evenNums
Out[148]: array([ 4, 0, 16, 8])

In the following example, we randomly generate an array of 10 integers between 0 and 25. Then, we create a Boolean mask array that is used to filter out only the even numbers. This masking feature can be very useful, say for example, if we wished to eliminate missing values, by replacing them with a default value. Here, the missing value ” is replaced by ‘USA’ as the default country. Note that ” is also an empty string:

In [149]: ar=np.array(['Hungary','Nigeria', 
                       'Guatemala','','Poland',
                       '','Japan']); ar
Out[149]: array(['Hungary', 'Nigeria', 'Guatemala', 
                 '', 'Poland', '', 'Japan'], 
                 dtype='|S9')
In [150]: ar[ar=='']='USA'; ar
Out[150]: array(['Hungary', 'Nigeria', 'Guatemala', 
 'USA', 'Poland', 'USA', 'Japan'], dtype='|S9')

Arrays of integers can also be used to index an array to produce another array. Note that this produces multiple values; hence, the output must be an array of type ndarray. This is illustrated in the following snippet:

In [173]: ar=11*np.arange(0,10); ar
Out[173]: array([ 0, 11, 22, 33, 44, 55, 66, 77, 88, 99])
In [174]: ar[[1,3,4,2,7]]
Out[174]: array([11, 33, 44, 22, 77])

In the preceding code, the selection object is a list and elements at indices 1, 3, 4, 2, and 7 are selected. Now, assume that we change it to the following:

In [175]: ar[1,3,4,2,7]

We get an IndexError error since the array is 1D and we’re specifying too many indices to access it.

IndexError         Traceback (most recent call last)
<ipython-input-175-adbcbe3b3cdc> in <module>()
----> 1 ar[1,3,4,2,7]
 
IndexError: too many indices

This assignment is also possible with array indexing, as follows:

In [176]: ar[[1,3]]=50; ar
Out[176]: array([ 0, 50, 22, 50, 44, 55, 66, 77, 88, 99])

When a new array is created from another array by using a list of array indices, the new array has the same shape.

Complex indexing

Here, we illustrate the use of complex indexing to assign values from a smaller array into a larger one:

In [188]: ar=np.arange(15); ar
Out[188]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
 
In [193]: ar2=np.arange(0,-10,-1)[::-1]; ar2
Out[193]: array([-9, -8, -7, -6, -5, -4, -3, -2, -1, 0])

Slice out the first 10 elements of ar, and replace them with elements from ar2, as follows:

In [194]: ar[:10]=ar2; ar
Out[194]: array([-9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 10, 11, 12, 13, 14])

Copies and views

A view on a NumPy array is just a particular way of portraying the data it contains. Creating a view does not result in a new copy of the array, rather the data it contains may be arranged in a specific order, or only certain data rows may be shown. Thus, if data is replaced on the underlying array’s data, this will be reflected in the view whenever the data is accessed via indexing.

The initial array is not copied into the memory during slicing and is thus more efficient. The np.may_share_memory method can be used to see if two arrays share the same memory block. However, it should be used with caution as it may produce false positives. Modifying a view modifies the original array:

In [118]:ar1=np.arange(12); ar1
Out[118]:array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
 
In [119]:ar2=ar1[::2]; ar2
Out[119]: array([ 0, 2, 4, 6, 8, 10])
 
In [120]: ar2[1]=-1; ar1
Out[120]: array([ 0, 1, -1, 3, 4, 5, 6, 7, 8, 9, 10, 11])

To force NumPy to copy an array, we use the np.copy function. As we can see in the following array, the original array remains unaffected when the copied array is modified:

In [124]: ar=np.arange(8);ar
Out[124]: array([0, 1, 2, 3, 4, 5, 6, 7])
 
In [126]: arc=ar[:3].copy(); arc
Out[126]: array([0, 1, 2])
 
In [127]: arc[0]=-1; arc
Out[127]: array([-1, 1, 2])
 
In [128]: ar
Out[128]: array([0, 1, 2, 3, 4, 5, 6, 7])

Operations

Here, we present various operations in NumPy.

Basic operations

Basic arithmetic operations work element-wise with scalar operands. They are – +, –, *, /, and **.

In [196]: ar=np.arange(0,7)*5; ar
Out[196]: array([ 0, 5, 10, 15, 20, 25, 30])
 
In [198]: ar=np.arange(5) ** 4 ; ar
Out[198]: array([ 0,   1, 16, 81, 256])
 
In [199]: ar ** 0.5
Out[199]: array([ 0.,   1.,   4.,   9., 16.])

Operations also work element-wise when another array is the second operand as follows:

In [209]: ar=3+np.arange(0, 30,3); ar
Out[209]: array([ 3, 6, 9, 12, 15, 18, 21, 24, 27, 30])
 
In [210]: ar2=np.arange(1,11); ar2
Out[210]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Here, in the following snippet, we see element-wise subtraction, division, and multiplication:

In [211]: ar-ar2
Out[211]: array([ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20])
 
In [212]: ar/ar2
Out[212]: array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3])
 
In [213]: ar*ar2
Out[213]: array([ 3, 12, 27, 48, 75, 108, 147, 192, 243, 300])

It is much faster to do this using NumPy rather than pure Python. The %timeit function in IPython is known as a magic function and uses the Python timeit module to time the execution of a Python statement or expression, explained as follows:

In [214]: ar=np.arange(1000)
         %timeit a**3
         100000 loops, best of 3: 5.4 µs per loop
 
In [215]:ar=range(1000)
         %timeit [ar[i]**3 for i in ar]
         1000 loops, best of 3: 199 µs per loop

Array multiplication is not the same as matrix multiplication; it is element-wise, meaning that the corresponding elements are multiplied together. For matrix multiplication, use the dot operator. For more information refer to http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html.

In [228]: ar=np.array([[1,1],[1,1]]); ar
Out[228]: array([[1, 1],
                 [1, 1]])
 
In [230]: ar2=np.array([[2,2],[2,2]]); ar2
Out[230]: array([[2, 2],
                 [2, 2]])
 
In [232]: ar.dot(ar2)
Out[232]: array([[4, 4],
                 [4, 4]])

Comparisons and logical operations are also element-wise:

In [235]: ar=np.arange(1,5); ar
Out[235]: array([1, 2, 3, 4])
 
In [238]: ar2=np.arange(5,1,-1);ar2
Out[238]: array([5, 4, 3, 2])
 
In [241]: ar < ar2
Out[241]: array([ True, True, False, False], dtype=bool)
 
In [242]: l1 = np.array([True,False,True,False])
         l2 = np.array([False,False,True, False])
         np.logical_and(l1,l2)
Out[242]: array([False, False, True, False], dtype=bool)

Other NumPy operations such as log, sin, cos, and exp are also element-wise:

In [244]: ar=np.array([np.pi, np.pi/2]); np.sin(ar)
Out[244]: array([ 1.22464680e-16,   1.00000000e+00])

Note that for element-wise operations on two NumPy arrays, the two arrays must have the same shape, else an error will result since the arguments of the operation must be the corresponding elements in the two arrays:

In [245]: ar=np.arange(0,6); ar
Out[245]: array([0, 1, 2, 3, 4, 5])
 
In [246]: ar2=np.arange(0,8); ar2
Out[246]: array([0, 1, 2, 3, 4, 5, 6, 7])
 
In [247]: ar*ar2
         ---------------------------------------------------------------------------
         ValueError                              Traceback (most recent call last)
         <ipython-input-247-2c3240f67b63> in <module>()
         ----> 1 ar*ar2
         ValueError: operands could not be broadcast together with shapes (6) (8)

Further, NumPy arrays can be transposed as follows:

In [249]: ar=np.array([[1,2,3],[4,5,6]]); ar
Out[249]: array([[1, 2, 3],
                 [4, 5, 6]])
 
In [250]:ar.T
Out[250]:array([[1, 4],
               [2, 5],
               [3, 6]])
 
In [251]: np.transpose(ar)
Out[251]: array([[1, 4],
                [2, 5],
                 [3, 6]])

Suppose we wish to compare arrays not element-wise, but array-wise. We could achieve this as follows by using the np.array_equal operator:

In [254]: ar=np.arange(0,6)
         ar2=np.array([0,1,2,3,4,5])
         np.array_equal(ar, ar2)
Out[254]: True

Here, we see that a single Boolean value is returned instead of a Boolean array. The value is True only if all the corresponding elements in the two arrays match. The preceding expression is equivalent to the following:

In [24]: np.all(ar==ar2)
Out[24]: True

Reduction operations

Operators such as np.sum and np.prod perform reduces on arrays; that is, they combine several elements into a single value:

In [257]: ar=np.arange(1,5)
         ar.prod()
Out[257]: 24

In the case of multi-dimensional arrays, we can specify whether we want the reduction operator to be applied row-wise or column-wise by using the axis parameter:

In [259]: ar=np.array([np.arange(1,6),np.arange(1,6)]);ar
Out[259]: array([[1, 2, 3, 4, 5],
                [1, 2, 3, 4, 5]])
# Columns
In [261]: np.prod(ar,axis=0)
Out[261]: array([ 1, 4, 9, 16, 25])
# Rows
In [262]: np.prod(ar,axis=1)
Out[262]: array([120, 120])

In the case of multi-dimensional arrays, not specifying an axis results in the operation being applied to all elements of the array as explained in the following example:

In [268]: ar=np.array([[2,3,4],[5,6,7],[8,9,10]]); ar.sum()
Out[268]: 54
 
In [269]: ar.mean()
Out[269]: 6.0
In [271]: np.median(ar)
Out[271]: 6.0

Statistical operators

These operators are used to apply standard statistical operations to a NumPy array. The names are self-explanatory: np.std(), np.mean(), np.median(), and np.cumsum().

In [309]: np.random.seed(10)
         ar=np.random.randint(0,10, size=(4,5));ar
Out[309]: array([[9, 4, 0, 1, 9],
                 [0, 1, 8, 9, 0],
                 [8, 6, 4, 3, 0],
                 [4, 6, 8, 1, 8]])
In [310]: ar.mean()
Out[310]: 4.4500000000000002
 
In [311]: ar.std()
Out[311]: 3.4274626183227732
 
In [312]: ar.var(axis=0) # across rows
Out[312]: array([ 12.6875,   4.1875, 11.   , 10.75 , 18.1875])
 
In [313]: ar.cumsum()
Out[313]: array([ 9, 13, 13, 14, 23, 23, 24, 32, 41, 41, 49, 55, 
                 59, 62, 62, 66, 72, 80, 81, 89])

Logical operators

Logical operators can be used for array comparison/checking. They are as follows:

np.all(): This is used for element-wise and all of the elements
np.any(): This is used for element-wise or all of the elements

Generate a random 4 × 4 array of ints and check if any element is divisible by 7 and if all elements are less than 11:

In [320]: np.random.seed(100)
         ar=np.random.randint(1,10, size=(4,4));ar
Out[320]: array([[9, 9, 4, 8],
                 [8, 1, 5, 3],
                 [6, 3, 3, 3],
                 [2, 1, 9, 5]])
 
In [318]: np.any((ar%7)==0)
Out[318]: False
 
In [319]: np.all(ar<11)
Out[319]: True

Broadcasting

In broadcasting, we make use of NumPy’s ability to combine arrays that don’t have the same exact shape. Here is an example:

In [357]: ar=np.ones([3,2]); ar
Out[357]: array([[ 1., 1.],
                 [ 1., 1.],
                 [ 1., 1.]])
 
In [358]: ar2=np.array([2,3]); ar2
Out[358]: array([2, 3])
 
In [359]: ar+ar2
Out[359]: array([[ 3., 4.],
                 [ 3., 4.],
                 [ 3., 4.]])

Thus, we can see that ar2 is broadcasted across the rows of ar by adding it to each row of ar producing the preceding result. Here is another example, showing that broadcasting works across dimensions:

In [369]: ar=np.array([[23,24,25]]); ar
Out[369]: array([[23, 24, 25]])
In [368]: ar.T
Out[368]: array([[23],
                 [24],
                 [25]])
In [370]: ar.T+ar
Out[370]: array([[46, 47, 48],
                 [47, 48, 49],
                 [48, 49, 50]])

Here, both row and column arrays were broadcasted and we ended up with a 3 × 3 array.

Array shape manipulation

There are a number of steps for the shape manipulation of arrays.

Flattening a multi-dimensional array

The np.ravel() function allows you to flatten a multi-dimensional array as follows:

In [385]: ar=np.array([np.arange(1,6), np.arange(10,15)]); ar
Out[385]: array([[ 1, 2, 3, 4, 5],
                 [10, 11, 12, 13, 14]])
 
In [386]: ar.ravel()
Out[386]: array([ 1, 2, 3, 4, 5, 10, 11, 12, 13, 14])
 
In [387]: ar.T.ravel()
Out[387]: array([ 1, 10, 2, 11, 3, 12, 4, 13, 5, 14])

You can also use np.flatten, which does the same thing, except that it returns a copy while np.ravel returns a view.

Reshaping

The reshape function can be used to change the shape of or unflatten an array:

In [389]: ar=np.arange(1,16);ar
Out[389]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
In [390]: ar.reshape(3,5)
Out[390]: array([[ 1, 2, 3, 4, 5],
                 [ 6, 7, 8, 9, 10],
                [11, 12, 13, 14, 15]])

The np.reshape function returns a view of the data, meaning that the underlying array remains unchanged. In special cases, however, the shape cannot be changed without the data being copied. For more details on this, see the documentation at http://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html.

Resizing

There are two resize operators, numpy.ndarray.resize, which is an ndarray operator that resizes in place, and numpy.resize, which returns a new array with the specified shape. Here, we illustrate the numpy.ndarray.resize function:

In [408]: ar=np.arange(5); ar.resize((8,));ar
Out[408]: array([0, 1, 2, 3, 4, 0, 0, 0])

Note that this function only works if there are no other references to this array; else, ValueError results:

In [34]: ar=np.arange(5); 
         ar
Out[34]: array([0, 1, 2, 3, 4])
In [35]: ar2=ar
In [36]: ar.resize((8,));
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-394f7795e2d1> in <module>()
----> 1 ar.resize((8,));
 
ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function

The way around this is to use the numpy.resize function instead:

In [38]: np.resize(ar,(8,))
Out[38]: array([0, 1, 2, 3, 4, 0, 1, 2])

Adding a dimension

The np.newaxis function adds an additional dimension to an array:

In [377]: ar=np.array([14,15,16]); ar.shape
Out[377]: (3,)
In [378]: ar
Out[378]: array([14, 15, 16])
In [379]: ar=ar[:, np.newaxis]; ar.shape
Out[379]: (3, 1)
In [380]: ar
Out[380]: array([[14],
                 [15],
                 [16]])

Array sorting

Arrays can be sorted in various ways.

Sort the array along an axis; first, let’s discuss this along the y-axis:

In [43]: ar=np.array([[3,2],[10,-1]])
         ar
Out[43]: array([[ 3, 2],
               [10, -1]])
In [44]: ar.sort(axis=1)
         ar
Out[44]: array([[ 2, 3],
               [-1, 10]])

Here, we will explain the sorting along the x-axis:

In [45]: ar=np.array([[3,2],[10,-1]])
         ar
Out[45]: array([[ 3, 2],
               [10, -1]])
In [46]: ar.sort(axis=0)
         ar
Out[46]: array([[ 3, -1],
               [10, 2]])

Sorting by in-place (np.array.sort) and out-of-place (np.sort) functions.
Other operations that are available for array sorting include the following:
- np.min(): It returns the minimum element in the array
- np.max(): It returns the maximum element in the array
- np.std(): It returns the standard deviation of the elements in the array
- np.var(): It returns the variance of elements in the array
- np.argmin(): It indices of minimum
- np.argmax(): It indices of maximum
- np.all(): It returns element-wise and all of the elements
- np.any(): It returns element-wise or all of the elements

Summary

In this article we discussed how numpy.ndarray is the bedrock data structure on which the pandas data structures are based. The pandas data structures at their heart consist of NumPy ndarray of data and an array or arrays of labels.

There are three main data structures in pandas: Series, DataFrame, and Panel. The pandas data structures are much easier to use and more user-friendly than Numpy ndarrays, since they provide row indexes and column indexes in the case of DataFrame and Panel. The DataFrame object is the most popular and widely used object in pandas.