One Line Bioinformatics - mikael durling

When working with sequence data you often end up with simple questions regarding properties of your data: How many genes? Average gene length? Sequence count? Many of these questions are easy to answer with a single line at the shell prompt using standard Unix utilities. This page is by no means a complete catalog and there are several other pages out there collecting such things. However, it's also worth to consider that to some extent, it is intrinsic to one-line-analysis that you don't keep a collection of them. Once you are fluent in shell usage, you write them and adopt them to the specific case where you will use them. This page is a continuous work-in-progress.

Working with GFF files

Average gene length

Calculate gene count and average distance from start of first codon to end of last codon. This assumes maker annotations, but easily adopted to other use cases by changing the if-statement used for selecting feature-lines.

$ awk '{if($2 == "maker" && $3=="gene")
    {s+=($5 - $4);  n++; print n, s, $5 - $4}}
    END{print "Gene count", n, "Mean length", s/n}' \
    input.gff3

mikael durling - One Line Analysis

Working with GFF files

Average gene length