25 min read

In this article by Tiago Antao, author of Bioinformatics with Python Cookbook, you will process next-generation sequencing datasets using Python.

If you work in life sciences, you are probably aware of the increasing importance of computational methods to analyze increasingly larger datasets. There is a massive need for bioinformaticians to process this data, and one the main tools is, of course, Python. Python is probably the fastest growing language in the field of data sciences. It includes a rich ecology of software libraries to perform complex data analysis. Another major point in Python is its great community, which is always ready to help and produce great documentation and high-quality reliable software.

In this article, we will use Python to process next-generation sequencing datasets. This is one of the many examples of Python usability in bioinformatics; chances are that if you have a biological dataset to analyze, Python can help you. This is surely the case with population genetics, genomics, phylogenetics, proteomics, and many other fields.

Next-generation Sequencing (NGS) is one of the fundamental technological developments of the decade in the field of life sciences. Whole Genome Sequencing (WGS), RAD-Seq, RNA-Seq, Chip-Seq, and several other technologies are routinely used to investigate important biological problems. These are also called high-throughput sequencing technologies with good reason; they generate vast amounts of data that need to be processed. NGS is the main reason why computational biology is becoming a “big data” discipline. More than anything else, this is a field that requires strong bioinformatics techniques. There is very strong demand for professionals with these skillsets.

Here, we will not discuss each individual NGS technique per se (this will be a massive undertaking). We will use two existing WGS datasets: the Human 1000 genomes project (http://www.1000genomes.org/) and the Anopheles 1000 genomes dataset (http://www.malariagen.net/projects/vector/ag1000g). The code presented will be easily applicable for other genomic sequencing approaches; some of them can also be used for transcriptomic analysis (for example, RNA-Seq). Most of the code is also species-independent, that is, you will be able to apply them to any species in which you have sequenced data.

As this is not an introductory text, you are expected to at least know what FASTA, FASTQ, BAM, and VCF files are. We will also make use of basic genomic terminology without introducing it (things such as exomes, nonsynonym mutations, and so on). You are required to be familiar with basic Python, and we will leverage that knowledge to introduce the fundamental libraries in Python to perform NGS analysis.

Here, we will concentrate on analyzing VCF files.

Preparing the environment

You will need Python 2.7 or 3.4. You can use many of the available distributions, including the standard one at http://www.python.org, but we recommend Anaconda Python from http://continuum.io/downloads.

We also recommend the IPython Notebook (Project Jupyter) from http://ipython.org/. If you use Anaconda, this and many other packages are available with a simple conda install.

There are some amazing libraries to perform data analysis in Python; here, we will use NumPy (http://www.numpy.org/) and matplotlib (http://matplotlib.org/), which you may already be using in your projects. We will also make use of the less widely used seaborn library (http://stanford.edu/~mwaskom/software/seaborn/).

For bioinformatics, we will use Biopython (http://biopython.org) and PyVCF (https://pyvcf.readthedocs.org).

The code used here is available on GitHub at https://github.com/tiagoantao/bioinf-python.

In your realistic pipeline, you will probably be using other tools, such as bwa, samtools, or GATK to perform your alignment and SNP calling. In our case, tabix and bgzip (http://www.htslib.org/) is needed.

Analyzing variant calls

After running a genotype caller (for example, GATK or samtools mpileup), you will have a Variant Call Format (VCF) file reporting on genomic variations, such as SNPs (Single-Nucleotide Polymorphisms), InDels (Insertions/Deletions), CNVs (Copy Number Variation) among others. In this recipe, we will discuss VCF processing with the PyVCF module over the human 1000 genomes project to analyze SNP data.

Getting ready

I am to believe that 2 to 20 GB of data for a tutorial is asking too much. Although, the 1000 genomes’ VCF files with realistic annotations are in that order of magnitude, we will want to work with much less data here. Fortunately, the bioinformatics community has developed tools that allow partial download of data. As part of the samtools/htslib package (http://www.htslib.org/), you can download tabix and bgzip, which will take care of data management. For example:

tabix -fh ftp://ftp-
 trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/vcf_
 with_sample_level_annotation/ALL.chr22.phase3_shapeit2_mvncall_
 integrated_v5_extra_anno.20130502.genotypes.vcf.gz 22:1-17000000 
 |bgzip -c > genotypes.vcf.gz
tabix -p vcf genotypes.vcf.gz

The first line will perform a partial download of the VCF file for chromosome 22 (up to 17 Mbp) of the 1000 genomes project. Then, bgzip will compress it.

The second line will create an index, which we will need for direct access to a section of the genome.

The preceding code is available at https://github.com/tiagoantao/bioinf-python/blob/master/notebooks/01_NGS/Working_with_VCF.ipynb.

How to do it…

Take a look at the following steps:

  1. Let’s start inspecting the information that we can get per record, as shown in the following code:
    import vcf
    v = vcf.Reader(filename='genotypes.vcf.gz')
     
    print('Variant Level information')
    infos = v.infos
    for info in infos:
       print(info)
     
    print('Sample Level information')
    fmts = v.formats
    for fmt in fmts:
       print(fmt)
    •     We start by inspecting the annotations that are available for each record (remember that each record encodes variants, such as SNP, CNV, and InDel, and the state of that variant per sample). At the variant (record) level, we will find AC: the total number of ALT alleles in called genotypes, AF: the estimated allele frequency, NS: the number of samples with data, AN: the total number of alleles in called genotypes, and DP: the total read depth. There are others, but they are mostly specific to the 1000 genomes project (here, we are trying to be as general as possible). Your own dataset might have much more annotations or none of these.
    •     At the sample level, there are only two annotations in this file: GT: genotype and DP: the per sample read depth. Yes, you have the per variant (total) read depth and the per sample read depth; be sure not to confuse both.
  2. Now that we know which information is available, let’s inspect a single VCF record with the following code:
    v = vcf.Reader(filename='genotypes.vcf.gz')
    rec = next(v)
    print(rec.CHROM, rec.POS, rec.ID, rec.REF, rec.ALT, 
     rec.QUAL, rec.FILTER)
    print(rec.INFO)
    print(rec.FORMAT)
    samples = rec.samples
    print(len(samples))
    sample = samples[0]
    print(sample.called, sample.gt_alleles, sample.is_het, 
     sample.phased)
    print(int(sample['DP']))
    •     We start by retrieving standard information: the chromosome, position, identifier, reference base (typically, just one), alternative bases (can have more than one, but it is not uncommon as the first filtering approach to only accept a single ALT, for example, only accept bi-allelic SNPs), quality (PHRED scaled—as you may expect), and the FILTER status. Regarding the filter, remember that whatever the VCF file says, you may still want to apply extra filters (as in the next recipe).
    •     Then, we will print the additional variant-level information (AC, AS, AF, AN, DP, and so on), followed by the sample format (in this case, DP and GT). Finally, we will count the number of samples and inspect a single sample checking if it was called for this variant. If available, the reported alleles, heterozygosity, and phasing status (this dataset happens to be phased, which is not that common).
  3. Let’s check the type of variant and the number of nonbiallelic SNPs in a single pass with the following code:
    from collections import defaultdict
    f = vcf.Reader(filename='genotypes.vcf.gz')
     
    my_type = defaultdict(int)
    num_alts = defaultdict(int)
     
    for rec in f:
       my_type[rec.var_type, rec.var_subtype] += 1
       if rec.is_snp:
           num_alts[len(rec.ALT)] += 1
    print(num_alts)
    print(my_type)
    •     We use the Python defaultdict collection type. We find that this dataset has InDels (both insertions and deletions), CNVs, and, of course, SNPs (roughly two-third being transitions with one-third transversions). There is a residual number (79) of triallelic SNPs.

There’s more…

The purpose of this recipe is to get you up to speed on the PyVCF module. At this stage, you should be comfortable with the API. We do not delve much here on usage details because that will be the main purpose of the next recipe: using the VCF module to study the quality of your variant calls.

It will probably not be a shocking revelation that PyVCF is not the fastest module on earth. This file format (highly text-based) makes processing a time-consuming task. There are two main strategies of dealing with this problem: parallel processing or converting to a more efficient format. Note that VCF developers will perform a binary (BCF) version to deal with part of these problems at http://www.1000genomes.org/wiki/analysis/variant-call-format/bcf-binary-vcf-version-2.

See also

Studying genome accessibility and filtering SNP data

If you are using NGS data, the quality of your VCF calls may need to be assessed and filtered. Here, we will put in place a framework to filter SNP data. More than giving filtering rules (an impossible task to be performed in a general way), we give you procedures to assess the quality of your data. With this, you can then devise your own filters.

Getting ready

In the best-case scenario, you have a VCF file with proper filters applied; if this is the case, you can just go ahead and use your file. Note that all VCF files will have a FILTER column, but this does not mean that all the proper filters were applied. You have to be sure that your data is properly filtered.

In the second case, which is one of the most common, your file will have unfiltered data, but you have enough annotations. Also, you can apply hard filters (that is, no need for programmatic filtering). If you have a GATK annotated file, refer, for instance, to http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set.

In the third case, you have a VCF file that has all the annotations that you need, but you may want to apply more flexible filters (for example, “if read depth > 20, then accept; if mapping quality > 30, accept if mapping quality > 40”).

In the fourth case, your VCF file does not have all the necessary annotations, and you have to revisit your BAM files (or even other sources of information). In this case, the best solution is to find whatever extra information you have and create a new VCF file with the needed annotations. Some genotype callers like GATK allow you to specify with annotations you want; you may also want to use extra programs to provide more annotations, for example, SnpEff (http://snpeff.sourceforge.net/) will annotate your SNPs with predictions of their effect (for example, if they are in exons, are they coding on noncoding?).

It is impossible to provide a clear-cut recipe; it will vary with the type of your sequencing data, your species of study, and your tolerance to errors, among other variables. What we can do is provide a set of typical analysis that is done for high-quality filtering.

In this recipe, we will not use data from the Human 1000 genomes project; we want “dirty” unfiltered data that has a lot of common annotations that can be used to filter it. We will use data from the Anopheles 1000 genomes project (Anopheles is the mosquito vector involved in the transmission of the parasite causing malaria), which makes available filtered and unfiltered data. You can find more information about this project at http://www.malariagen.net/projects/vector/ag1000g.

We will get a part of the centromere of chromosome 3L for around 100 mosquitoes, followed by a part somewhere in the middle of that chromosome (and index both), as shown in the following code:

tabix -fh 
 ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/preview/ag1000g.AC.
 phase1.AR1.vcf.gz 3L:1-200000 |bgzip -c > centro.vcf.gz
tabix -fh 
 ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/preview/ag1000g.AC.
 phase1.AR1.vcf.gz 3L:21000001-21200000 |bgzip -c > standard.vcf.gz
tabix -p vcf centro.vcf.gz
tabix -p vcf standard.vcf.gz

As usual, the code to download this data is available at the https://github.com/tiagoantao/bioinf-python/blob/master/notebooks/01_NGS/Filtering_SNPs.ipynb notebook.

Finally, a word of warning about this recipe: the level of Python here will be slightly more complicated than before. The more general code that we will write may be easier to reuse in your specific case. We will perform extensive use of functional programming techniques (lambda functions) and the partial function application.

How to do it…

Take a look at the following steps:

  1. Let’s start by plotting the distribution of variants across the genome in both files as follows:
    %matplotlib inline
    from collections import defaultdict
     
    import seaborn as sns
    import matplotlib.pyplot as plt
     
    import vcf
     
    def do_window(recs, size, fun):
       start = None
       win_res = []
       for rec in recs:
           if not rec.is_snp or len(rec.ALT) > 1:
               continue
           if start is None:
               start = rec.POS
           my_win = 1 + (rec.POS - start) // size
           while len(win_res) < my_win:
               win_res.append([])
           win_res[my_win - 1].extend(fun(rec))
       return win_res
     
    wins = {}
    size = 2000
    vcf_names = ['centro.vcf.gz', 'standard.vcf.gz']
    for vcf_name in vcf_names:
       recs = vcf.Reader(filename=vcf_name)
       wins[name] = do_window(recs, size, lambda x: [1])
    •     We start by performing the required imports (as usual, remember to remove the first line if you are not on the IPython Notebook). Before I explain the function, note what we will do.
    •     For both files, we will compute windowed statistics: we will divide our file that includes 200,000 bp of data in windows of size 2,000 (100 windows). Every time we find a bi-allelic SNP, we will add one to the list related to that window in the window function. The window function will take a VCF record (an SNP—rec.is_snp—that is not bi-allelic—len(rec.ALT) == 1), determine the window where that record belongs (by performing an integer division of rec.POS by size), and extend the list of results of that window by the function that is passed to it as the fun parameter (which in our case is just one).
    •     So, now we have a list of 100 elements (each representing 2,000 base pairs). Each element will be another list, which will have 1 for each bi-allelic SNP found. So, if you have 200 SNPs in the first 2,000 base pairs, the first element of the list will have 200 ones.
  2. Let’s continue:
    def apply_win_funs(wins, funs):
       fun_results = []
       for win in wins:
           my_funs = {}
           for name, fun in funs.items():
               try:
                   my_funs[name] = fun(win)
               except:
                   my_funs[name] = None
           fun_results.append(my_funs)
       return fun_results
     
    stats = {}
    fig, ax = plt.subplots(figsize=(16, 9))
    for name, nwins in wins.items():
       stats[name] = apply_win_funs(nwins, {'sum': sum})
       x_lim = [i * size for i in range(len(stats[name]))]
       ax.plot(x_lim, [x['sum'] for x in stats[name]], label=name)
    ax.legend()
    ax.set_xlabel('Genomic location in the downloaded segment')
    ax.set_ylabel('Number of variant sites (bi-allelic SNPs)')
    fig.suptitle('Distribution of MQ0 along the genome', 
     fontsize='xx-large')
    •     Here, we will perform a plot that contains statistical information for each of our 100 windows. The apply_win_funs will calculate a set of statistics for every window. In this case, it will sum all the numbers in the window. Remember that every time we find an SNP, we will add one to the window list. This means that if we have 200 SNPs, we will have 200 1s; hence; summing them will return 200.
    •     So, we are able to compute the number of SNPs per window in an apparently convoluted way. Why we are doing things with this strategy will become apparent soon, but for now, let’s check the result of this computation for both files (refer to the following figure):Bioinformatics with Python Cookbook

      Figure 1: The number of bi-allelic SNPs distributed of windows of 2, 000 bp of size for an area of 200 Kbp near the centromere (blue) and in the middle of chromosome (green). Both areas come from chromosome 3L for circa 100 Ugandan mosquitoes from the Anopheles 1000 genomes project

    •     Note that the amount of SNPs in the centromere is smaller than the one in the middle of the chromosome. This is expected because calling variants in chromosomes is more difficult than calling variants in the middle and also because probably there is less genomic diversity in centromeres. If you are used to humans or other mammals, you may find the density of variants obnoxiously high, that is, mosquitoes for you!
  3. Let’s take a look at the sample-level annotation. We will inspect Mapping Quality Zero (refer to https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_MappingQualityZeroBySample.php for details), which is a measure of how well all the sequences involved in calling this variant map clearly to this position. Note that there is also an MQ0 annotation at the variant-level:
    import functools
     
    import numpy as np
    mq0_wins = {}
    vcf_names = ['centro.vcf.gz', 'standard.vcf.gz']
    size = 5000
    def get_sample(rec, annot, my_type):
       res = []
       samples = rec.samples
       for sample in samples:
           if sample[annot] is None: # ignoring nones
               continue
           res.append(my_type(sample[annot]))
       return res
     
    for vcf_name in vcf_names:
       recs = vcf.Reader(filename=vcf_name)
       mq0_wins[vcf_name] = do_window(recs, size, functools.partial(get_sample, annot='MQ0', my_type=int))
    •     Start by inspecting this by looking at the last for; we will perform a windowed analysis by getting the MQ0 annotation from each record. We perform this by calling the get_sample function in which we return our preferred annotation (in this case, MQ0) cast with a certain type (my_type=int). We will use the partial application function here. Python allows you to specify some parameters of a function and wait for other parameters to be specified later. Note that the most complicated thing here is the functional programming style. Also, note that it makes it very easy to compute other sample-level annotations; just replace MQ0 with AB, AD, GQ, and so on. You will immediately have a computation for that annotation. If the annotation is not of type integer, no problem; just adapt my_type. This is a difficult programming style if you are not used to it, but you will reap the benefits very soon.
  4. Let’s now print the median and top 75 percent percentile for each window (in this case, with a size of 5,000) as follows:
    stats = {}
    colors = ['b', 'g']
    i = 0
    fig, ax = plt.subplots(figsize=(16, 9))
    for name, nwins in mq0_wins.items():
       stats[name] = apply_win_funs(nwins, {'median': 
     np.median, '75': functools.partial(np.percentile, q=75)})
       x_lim = [j * size for j in range(len(stats[name]))]
       ax.plot(x_lim, [x['median'] for x in stats[name]], 
     label=name, color=colors[i])
       ax.plot(x_lim, [x['75'] for x in stats[name]], '--', 
     color=colors[i])
       i += 1
    ax.legend()
    ax.set_xlabel('Genomic location in the downloaded segment')
    ax.set_ylabel('MQ0')
    fig.suptitle('Distribution of MQ0 along the genome', 
     fontsize='xx-large')
    •     Note that we now have two different statistics on apply_win_funs: percentile and median. Again, we will pass function names as parameters (np.median) and perform the partial function application (np.percentile). The result can be seen in the following figure:Bioinformatics with Python Cookbook

      Figure 2: Median (continuous line) and 75th percentile (dashed) of MQ0 of sample SNPs distributed on windows of 5,000 bp of size for an area of 200 Kbp near the centromere (blue) and in the middle of chromosome (green); both areas come from chromosome 3L for circa 100 Ugandan mosquitoes from the Anopheles 1000 genomes project

    •     For the “standard” file, the median MQ0 is 0 (it is plotted at the very bottom, which is almost unseen); this is good as it suggests that most sequences involved in the calling of variants map clearly to this area of the genome. For the centromere, MQ0 is of poor quality. Furthermore, there are areas where the genotype caller could not find any variants at all; hence, the incomplete chart.
  5. Let’s compare heterozygosity with the DP sample-level annotation:
    •     Here, we will plot the fraction of heterozygosity calls as a function of the sample read depth (DP) for every SNP. We will first explain the result and only then the code that generates it.
    •     The following screenshot shows the fraction of calls that are heterozygous at a certain depth:Bioinformatics with Python Cookbook

      Figure 3: The continuous line represents the fraction of heterozygosite calls computed at a certain depth; in blue is the centromeric area, in green is the “standard” area; the dashed lines represent the number of sample calls per depth; both areas come from chromosome 3L for circa 100 Ugandan mosquitoes from the Anopheles 1000 genomes project

  6. In the preceding screenshot, there are two considerations to be taken into account:
    •     At a very low depth, the fraction of heterozygote calls is biased low; this makes sense because the number of reads per position does not allow you to make a correct estimate of the presence of both alleles in a sample. So, you should not trust calls at a very low depth.
    •     As expected, the number of calls in the centromere is way lower than calls outside it. The distribution of SNPs outside the centromere follows a common pattern that you can expect in many datasets. Here is the code:
      def get_sample_relation(recs, f1, f2):
         rel = defaultdict(int)
         for rec in recs:
             if not rec.is_snp:
                   continue
             for sample in rec.samples:
                 try:
                      v1 = f1(sample)
                     v2 = f2(sample)
                     if v1 is None or v2 is None:
                         continue # We ignore Nones
                     rel[(v1, v2)] += 1
                 except:
                     pass # This is outside the domain 
       (typically None)
         return rel
       
      rels = {}
      for vcf_name in vcf_names:
         recs = vcf.Reader(filename=vcf_name)
         rels[vcf_name] = get_sample_relation(recs, lambda s: 1 
       if s.is_het else 0, lambda s: int(s['DP']))
  7. Let’s start by looking at the for loop. Again, we will use functional programming: the get_sample_relation function will traverse all the SNP records and apply the two functional parameters; the first determines heterozygosity, whereas the second gets the sample DP (remember that there is also a variant DP).
    •     Now, as the code is complex as it is, I opted for a naive data structure to be returned by get_sample_relation: a dictionary where the key is the pair of results (in this case, heterozygosity and DP) and the sum of SNPs, which share both values. There are more elegant data structures with different trade-offs for this: scipy spare matrices, pandas’ DataFrames, or maybe, you want to consider PyTables. The fundamental point here is to have a framework that is general enough to compute relationships among a couple of sample annotations.
    •     Also, be careful with the dimension space of several annotations, for example, if your annotation is of float type, you might have to round it (if not, the size of your data structure might become too big).
  8. Now, let’s take a look at all the plotting codes. Let’s perform it in two parts; here is part 1:
    def plot_hz_rel(dps, ax, ax2, name, rel):
       frac_hz = []
       cnt_dp = []
       for dp in dps:
           hz = 0.0
           cnt = 0
     
           for khz, kdp in rel.keys():
                if kdp != dp:
                   continue
               cnt += rel[(khz, dp)]
               if khz == 1:
                   hz += rel[(khz, dp)]
           frac_hz.append(hz / cnt)
           cnt_dp.append(cnt)
       ax.plot(dps, frac_hz, label=name)
       ax2.plot(dps, cnt_dp, '--', label=name)
    •     This function will take a data structure (as generated by get_sample_relation) expecting that the first parameter of the key tuple is the heterozygosity state (0 = homozygote, 1 = heterozygote) and the second is the DP. With this, it will generate two lines: one with the fraction of samples (which are heterozygotes at a certain depth) and the other with the SNP count.
  9. Let’s now call this function, as shown in the following code:
    fig, ax = plt.subplots(figsize=(16, 9))
    ax2 = ax.twinx()
    for name, rel in rels.items():
       dps = list(set([x[1] for x in rel.keys()]))
       dps.sort()
       plot_hz_rel(dps, ax, ax2, name, rel)
    ax.set_xlim(0, 75)
    ax.set_ylim(0, 0.2)
    ax2.set_ylabel('Quantity of calls')
    ax.set_ylabel('Fraction of Heterozygote calls')
    ax.set_xlabel('Sample Read Depth (DP)')
    ax.legend()
    fig.suptitle('Number of calls per depth and fraction of 
     calls which are Hz',,
                 fontsize='xx-large')
    •     Here, we will use two axes. On the left-hand side, we will have the fraction of heterozygozite SNPs, whereas on the right-hand side, we will have the number of SNPs. Then, we will call our plot_hz_rel for both data files. The rest is standard matplotlib code.
  10. Finally, let’s compare variant DP with the categorical variant-level annotation: EFF. EFF is provided by SnpEFF and tells us (among many other things) the type of SNP (for example, intergenic, intronic, coding synonymous, and coding nonsynonymous). The Anopheles dataset provides this useful annotation. Let’s start by extracting variant-level annotations and the functional programming style, as shown in the following code:
    def get_variant_relation(recs, f1, f2):
       rel = defaultdict(int)
       for rec in recs:
           if not rec.is_snp:
                 continue
           try:
               v1 = f1(rec)
               v2 = f2(rec)
               if v1 is None or v2 is None:
                   continue # We ignore Nones
               rel[(v1, v2)] += 1
           except:
               pass
       return rel
    •     The programming style here is similar to get_sample_relation, but we do not delve into the samples. Now, we will define the types of effects that we will work with and convert the effect to an integer as it would allow you to use it as in index, for example, matrices. Think about coding a categorical variable:
      accepted_eff = ['INTERGENIC', 'INTRON', 'NON_SYNONYMOUS_CODING', 'SYNONYMOUS_CODING']
       
      def eff_to_int(rec):
         try:
             for annot in rec.INFO['EFF']:
                 #We use the first annotation
                 master_type = annot.split('(')[0]
                 return accepted_eff.index(master_type)
         except ValueError:
             return len(accepted_eff)
  11. We will now traverse the file; the style should be clear to you now:
    eff_mq0s = {}
    for vcf_name in vcf_names:
       recs = vcf.Reader(filename=vcf_name)
       eff_mq0s[vcf_name] = get_variant_relation(recs, lambda 
     r: eff_to_int(r), lambda r: int(r.INFO['DP']))
  12. Finally, we will plot the distribution of DP using the SNP effect, as shown in the following code:
    fig, ax = plt.subplots(figsize=(16,9))
    vcf_name = 'standard.vcf.gz'
    bp_vals = [[] for x in range(len(accepted_eff) + 1)]
    for k, cnt in eff_mq0s[vcf_name].items():
       my_eff, mq0 = k
       bp_vals[my_eff].extend([mq0] * cnt)
    sns.boxplot(bp_vals, sym='', ax=ax)
    ax.set_xticklabels(accepted_eff + ['OTHER'])
    ax.set_ylabel('DP (variant)')
    fig.suptitle('Distribution of variant DP per SNP type',
                 fontsize='xx-large')
  13. Here, we will just print a box plot for the noncentromeric file (refer to the following screenshot). The results are as expected: SNPs in code areas will probably have more depth if they are in more complex regions (that is easier to call) than intergenic SNPs:Bioinformatics with Python Cookbook

    Figure 4: Boxplot for the distribution of variant read depth across different SNP effects

There’s more…

The approach would depend on the type of sequencing data that you have, the number of samples, and potential extra information (for example, pedigree among samples).

This recipe is very complex as it is, but parts of it are profoundly naive (there is a limit of complexity that I could force on you on a simple recipe). For example, the window code does not support overlapping windows; also, data structures are simplistic. However, I hope that they give you an idea of the general strategy to process genomic high-throughput sequencing data.

See also

  • There are many filtering rules, but I would like to draw your attention to the need of reasonably good coverage (clearly more than 10 x), for example, refer to. Meynet et al “Variant detection sensitivity and biases in whole genome and exome sequencing” at http://www.biomedcentral.com/1471-2105/15/247/
  • Brad Chapman is one of the best known specialist in sequencing analysis and data quality with Python and the main author of Blue Collar Bioinformatics, a blog that you may want to check at https://bcbio.wordpress.com/
  • Brad is also the main author of bcbio-nextgen, a Python-based pipeline for high-throughput sequencing analysis. Refer to https://bcbio-nextgen.readthedocs.org
  • Peter Cock is the main author of Biopython and is heavily involved in NGS analysis; be sure to check his blog, “Blasted Bionformatics!?” at http://blastedbio.blogspot.co.uk/

Summary

In this article, we prepared the environment, analyzed variant calls and learned about genome accessibility and filtering SNP data.

LEAVE A REPLY

Please enter your comment!
Please enter your name here