This web site is not a blog, even though it's a collection of random pieces of text on different subjects. However, I prefer more structure to the pages than the continuous flow common to the blog.
When working with sequence data you often end up with simple questions regarding properties of your data: How many genes? Average gene length? Sequence count? Many of these questions are easy to answer with a single line at the shell prompt using standard Unix utilities. This page is by no means a complete catalog and there are several other pages out there collecting such things. However, it's also worth to consider that to some extent, it is intrinsic to one-line-analysis that you don't keep a collection of them. Once you are fluent in shell usage, you write them and adopt them to the specific case where you will use them. This page is a continuous work-in-progress.
Calculate gene count and average distance from start of first codon to end of last codon. This assumes maker annotations, but easily adopted to other use cases by changing the if-statement used for selecting feature-lines.
$ awk '{if($2 == "maker" && $3=="gene")
{s+=($5 - $4); n++; print n, s, $5 - $4}}
END{print "Gene count", n, "Mean length", s/n}' \
input.gff3
Copyright © 2024 Mikael Durling. Published April 22, 2024, updated May 19, 2024.