Scripting in Python

Interactive Python is great but you want to put your workflow in scripts so it is easy to reproduce. The basic format is below. Copy and paste it into a text file.

#!/usr/bin/python
str='hello world'
print(str)
quit() #Can skip this, will exit at end of file

Accepting command line input

#!/usr/bin/python
import sys
print(sys.argv[0])
print(sys.argv[1])
print(len(sys.argv))
quit() 

Then run: “python script.py ‘in string’”

We can expand on this and do something with the command line input. Like print it in reverse order.

#!/usr/bin/python
import sys
str=sys.argv[1]
l=len(str)
for i in range(l,0,-1):
  print(str[i])
  
quit() 

NCBI Entrez

conda install Bio

Biopython and Entrez search tools

Biopython is a big project from which we will use the submodule Entrez which provides tools for searching NCBI databases and the submodule SeqIO for switching between sequence file types.

Take a look at the databases we can access:

from Bio import Entrez
from Bio import SeqIO
Entrez.email='your@email.com'
handle = Entrez.einfo()
record = Entrez.read(handle)
print(record["DbList"])
## ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'nucgss', 'nucest', 'structure', 'sparcle', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'unigene', 'gencoll', 'gtr']

For the search string check on the NCBI website search first.

https://www.ncbi.nlm.nih.gov/nucleotide/

Look at the search details box to the right after entering a search term.

Let’s put the next few code chunks into a script that accepts a search string as sys.argv[0] and passes it to the Entrez.esearch function.

Tutorial:http://biopython.org/DIST/docs/api/Bio.Entrez-module.html

from Bio import Entrez
from Bio import SeqIO
Entrez.email='your@email.com'
#search Genbank, returns accession numbers (up to 100)
handle=Entrez.esearch(db='nucleotide', retmax=3, term="rbcl[All Fields]", idtype="acc") 
record = Entrez.read(handle)
print(handle)
## <_io.TextIOWrapper encoding='utf-8'>
print(record)
## {'Count': '263906', 'RetMax': '3', 'RetStart': '0', 'IdList': ['NATU01000384.1', 'NATW01000126.1', 'NATT01000272.1'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'rbcl[All Fields]', 'Field': 'All Fields', 'Count': '263906', 'Explode': 'N'}, 'GROUP'], 'QueryTranslation': 'rbcl[All Fields]'}
handle.close()
fetch = Entrez.efetch(db='nucleotide', id=",".join(record['IdList']), rettype='gb', retmode='text')
gb=fetch.read()
print(gb[0:1000])
## LOCUS       NATU01000384           20221 bp    DNA     linear   ENV 03-MAY-2018
## DEFINITION  Gamma proteobacterium symbiont of Ctena orbiculata isolate 40D _2,
##             whole genome shotgun sequence.
## ACCESSION   NATU01000384 NATU01000000
## VERSION     NATU01000384.1
## DBLINK      BioProject: PRJNA377790
##             BioSample: SAMN06473048
## KEYWORDS    WGS.
## SOURCE      gamma proteobacterium symbiont of Ctena orbiculata (mollusc
##             metagenome)
##   ORGANISM  gamma proteobacterium symbiont of Ctena orbiculata
##             Bacteria; Proteobacteria; Gammaproteobacteria.
## REFERENCE   1  (bases 1 to 20221)
##   AUTHORS   Lim,J.S., Engel,A.S., Anderson,L.C. and Campbell,B.J.
##   TITLE     Taxonomic and genetic symbiont diversity in the seagrass-associated
##             lucinid bivalves Ctena orbiculata and Stewartia floridana
##   JOURNAL   Unpublished
## REFERENCE   2  (bases 1 to 20221)
##   AUTHORS   Lim,J.S., Engel,A.S., Anderson,L.C. and Campbell,B.J.
##   TITLE     Direct Submission
##   JOURNAL   Submitted

Note that Entrez.efetch() has a limit of 200 records. So you may have to put this in a loop. You can check this as below:

print(record['Count'])
## 263906

Now let’s use SeqIO to switch from the Genbank file format to fasta

target = open('./data/seqs.gb', 'w')     #Open file seqs.gb for writing
target.write(gb)
target.close()
count = SeqIO.convert("./data/seqs.gb", "genbank", "./data/seqs.fasta", "fasta")    #convert file seqs.gb to seqs.fasta
print("Converted %i records to fasta" % count)
## Converted 3 records to fasta

Pubmed

from Bio import Entrez
from Bio import SeqIO
Entrez.email='your@email.com'
handle=Entrez.esearch(db='pubmed', retmax=3, term="rbcl", retmode='xml') 
record = Entrez.read(handle)
handle.close()
print(record)
## {'Count': '2233', 'RetMax': '3', 'RetStart': '0', 'IdList': ['29733977', '29726460', '29723647'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'rbcl[All Fields]', 'Field': 'All Fields', 'Count': '2233', 'Explode': 'N'}, 'GROUP'], 'QueryTranslation': 'rbcl[All Fields]'}
handle = Entrez.efetch(db="pubmed", id=record['IdList'])
xml=handle.read()
print(xml[0:2000])
## 
## Pubmed-entry ::= {
##   pmid 29733977,
##   medent {
##     em std {
##       year 2018,
##       month 5,
##       day 8,
##       hour 6,
##       minute 0
##     },
##     cit {
##       title {
##         name "First insights on the biogeographical history of Phlegmariurus
##  (Lycopodiaceae), with a focus on Madagascar."
##       },
##       authors {
##         names std {
##           {
##             name ml "Bauret L",
##             affil str "Museum national d'Histoire naturelle, Sorbonne
##  Universites, Institut de Systematique, Evolution, Biodiversite (UMR 7205
##  CNRS, MNHN, UPMC, EPHE), Herbier national, 16 rue Buffon CP39, F-75005 Paris,
##  France; Universite Pierre et Marie Curie, Sorbonne Universites, Institut de
##  Systematique, Evolution, Biodiversite (UMR 7205 CNRS, MNHN, UPMC, EPHE), 57
##  rue Cuvier CP48, F-75005 Paris, France. Electronic address:
##  lu.bauret@gmail.com."
##           },
##           {
##             name ml "Field AR",
##             affil str "Queensland Herbarium, Department of Environment and
##  Science; Australian Tropical Herbarium, James Cook University, POBox 6811,
##  Cairns, Qld 4878, Australia."
##           },
##           {
##             name ml "Gaudeul M",
##             affil str "Museum national d'Histoire naturelle, Sorbonne
##  Universites, Institut de Systematique, Evolution, Biodiversite (UMR 7205
##  CNRS, MNHN, UPMC, EPHE), Herbier national, 16 rue Buffon CP39, F-75005 Paris,
##  France."
##           },
##           {
##             name ml "Selosse MA",
##             affil str "Museum national d'Histoire naturelle, Sorbonne
##  Universites, Institut de Systematique, Evolution, Biodiversite (UMR 7205
##  CNRS, MNHN, UPMC, EPHE), Herbier national, 16 rue Buffon CP39, F-75005 Paris,
##  France; Department of Plant Taxonomy and Nature Conservation, University of
##  Gdansk, ul. Wita Stwosza 59, 80-308 Gdansk, Poland."
##           },
##           {
##             name ml "Rouhan G",
##             affil str "Museum national d'Histoire naturelle, Sorbonne
##  Universites, Institut de Systematique, Evolution, Biodiversite (UMR 7205
##  CNRS, MNHN, UPM
handle = Entrez.efetch(db="pubmed", id=record['IdList'], rettype="gb", retmode="xml")
record = Entrez.read(handle)
title=list()
for i in range(len(record)+1):
  print(record['PubmedArticle'][i]['MedlineCitation']['Article']['ArticleTitle'])
## First insights on the biogeographical history of Phlegmariurus (Lycopodiaceae), with a focus on Madagascar.
## 3, a new polymorph of rubidium copper trichloride: synthesis, structure and structural complexity.
## Origins of East Asian Campanuloideae (Campanulaceae) diversity.

Follow up

I recommend looking the following for phylogenetic analyses in Python: http://dendropy.org/news.html