## Loading data files

When performing a statistical analysis of a particular problem, you often have some data stored in a file. You can save your variables (or the entire workspace) using different file formats and then load them back in again. Octave can, of course, also load data from files generated by other programs. There are certain restrictions when you do this which we will discuss here. In the following matter, we will only consider ASCII files, that is, readable text files.

When you load data from an ASCII file using the *load* command, the data is treated as a two-dimensional array. We can then think of the data as a matrix where lines represent the matrix rows and columns the matrix columns. For this matrix to be well defined, the data must be organized such that all the rows have the same number of columns (and therefore the columns the same number of rows). For example, the content of a file called *series.dat* can be:

Next we to load this into Octave’s workspace:

octave:1> load -ascii series.dat;

whereby the data is stored in the variable named *series*. In fact, Octave is capable of loading the data even if you do not specify the ASCII format. The number of rows and columns are then:

octave:2> size(series)

ans =

4 3

I prefer the file extension *.dat*, but again this is optional and can be anything you wish, say *.txt, .ascii, .data*, or nothing at all.

In the data files you can have:

- Octave comments
- Data blocks separated by blank lines (or equivalent empty rows)
- Tabs or single and multi-space for number separation

Thus, the following data file will successfully load into Octave:

# First block

1 232 334

2 245 334

3 456 342

4 555 321

# Second block

1 231 334

2 244 334

3 450 341

4 557 327

The resulting variable is a matrix with 8 rows and 3 columns. If you know the number of blocks or the block sizes, you can then separate the blocked-data.

Now, the following data stored in the file *bad.dat* will not load into Octave’s workspace:

1 232.1 334

2 245.2

3 456.23

4 555.6

because line 1 has three columns whereas lines 2-4 have two columns. If you try to load this file, Octave will complain:

octave:3> load -ascii bad.dat

error: load: bad.dat: inconsisitent number of columns near line 2

error:load: unable to extract matrix size from file ‘bad.dat’

## Simple descriptive statistics

Consider an Octave function *mcintgr* and its vectorized version *mcintgrv*. This function can evaluate the integral for a mathematical function *f* in some interval *[a; b]* where the function is positive. The Octave function is based on the Monte Carlo method and the return value, that is, the integral, is therefore a stochastic variable. When we calculate a given integral, we should as a minimum present the result as a mean or another appropriate measure of a central value together with an associated statistical uncertainty. This is true for any other stochastic variable, whether it is the height of the pupils in class, length of a plant’s leaves, and so on.

In this section, we will use Octave for the most simple statistical description of stochastic variables.

### Histogram and moments

Let us calculate the integral given in Equation (5.9) one thousand times using the vectorized version of the Monte Carlo integrator:

octave:4> for i=1:1000

> s(i) = mcintgrv(“sin”, 0, pi, 1000);

> endfor

The array s now contains a sequence of numbers which we know are approximately 2. Before we make any quantitative statistical description, it is always a good idea to first plot a histogram of the data as this gives an approximation to the true underlying probability distribution of the variable s. The easiest way to do this is by using Octave’s *hist* function which can be called using:

octave:5> hist(s, 30, 1)

The first argument, s, to *hist* is the stochastic variable, the second is the number of bins that s should be grouped into (here we have used 30), and the third argument gives the sum of the heights of the histogram (here we set it to 1). The histogram is shown in the figure below. If *hist* is called via the command *hist(s)*, s is grouped into ten bins and the sum of the heights of the histogram is equal to *sum(s)*.

From the figure, we see that *mcintgrv* produces a sequence of random numbers that appear to be normal (or Gaussian) distributed with a mean of 2. This is what we expected. It then makes good sense to describe the variable via the sample mean defined as:

where *N* is the number of samples (here 1000) and *s _{i}* the

*i*‘th data point, as well as the sample variance given by:

The variance is a measure of the distribution width and therefore an estimate of the statistical uncertainty of the mean value. Sometimes, one uses the standard deviation instead of the variance. The standard deviation is simply the square root of the variance

To calculate the sample mean, sample variance, and the standard deviation in Octave, you use:

octave:6> mean(s)

ans = 1.9999

octave:7> var(s)

ans = 0.002028

octave:8> std(s)

ans = 0.044976

In the statistical description of the data, we can also include the skewness which measures the symmetry of the underlying distribution around the mean. If it is positive, it is an indication that the distribution has a long tail stretching towards positive values with respect to the mean. If it is negative, it has a long negative tail. The skewness is often defined as:

We can calculate this in Octave via:

octave:9> skewness(s)

ans = -0.15495

This result is a bit surprising because we would assume from the histogram that the data set represents numbers picked from a normal distribution which is symmetric around the mean and therefore has zero skewness. It illustrates an important point—be careful to use the skewness as a direct measure of the distributions symmetry—you need a very large data set to get a good estimate.

You can also calculate the kurtosis which measures the flatness of the sample distribution compared to a normal distribution. Negative kurtosis indicates a relative flatter distribution around the mean and a positive kurtosis that the sample distribution has a sharp peak around the mean. The kurtosis is defined by the following:

It can be calculated by the *kurtosis* function.

octave:10> kurtosis(s)

ans = -0.02310

The kurtosis has the same problem as the skewness—you need a very large sample size to obtain a good estimate.

### Sample moments

As you may know, the sample mean, variance, skewness, and kurtosis are examples of sample moments. The mean is related to the first moment, the variance the second moment, and so forth. Now, the moments are not uniquely defined. One can, for example, define the *k*‘th absolute sample moment p^{k}_{a} and *k*‘th central sample moment p^{k}_{c} as:

Notice that the first absolute moment is simply the sample mean, but the first central sample moment is zero. In Octave, you can easily retrieve the sample moments using the *moment* function, for example, to calculate the second central sample moment you use:

octave:11> moment(s, 2, ‘c’)

ans = 0.002022

Here the first input argument is the sample data, the second defines the order of the moment, and the third argument specifies whether we want the central moment ‘*c*‘ or absolute moment ‘*a*‘ which is the default. Compare the output with the output from Command 7—why is it not the same?