Biopython and Entrez search tools

Biopython is a big project from which we will use the submodule Entrez which provides tools for searching NCBI databases and the submodule SeqIO for switching between sequence file types.

Take a look at the databases we can access:

from Bio import Entrez
from Bio import SeqIO
Entrez.email='your@email.com'
handle = Entrez.einfo()
record = Entrez.read(handle)
print(record["DbList"])

## ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'nucgss', 'nucest', 'structure', 'sparcle', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'unigene', 'gencoll', 'gtr']

For the search string check on the NCBI website search first.

https://www.ncbi.nlm.nih.gov/nucleotide/

Look at the search details box to the right after entering a search term.

Let’s put the next few code chunks into a script that accepts a search string as sys.argv[0] and passes it to the Entrez.esearch function.

Tutorial:http://biopython.org/DIST/docs/api/Bio.Entrez-module.html

from Bio import Entrez
from Bio import SeqIO
Entrez.email='your@email.com'
#search Genbank, returns accession numbers (up to 100)
handle=Entrez.esearch(db='nucleotide', retmax=3, term="rbcl[All Fields]", idtype="acc") 
record = Entrez.read(handle)
print(handle)

## <_io.TextIOWrapper encoding='utf-8'>

print(record)

## {'Count': '263906', 'RetMax': '3', 'RetStart': '0', 'IdList': ['NATU01000384.1', 'NATW01000126.1', 'NATT01000272.1'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'rbcl[All Fields]', 'Field': 'All Fields', 'Count': '263906', 'Explode': 'N'}, 'GROUP'], 'QueryTranslation': 'rbcl[All Fields]'}

handle.close()
fetch = Entrez.efetch(db='nucleotide', id=",".join(record['IdList']), rettype='gb', retmode='text')
gb=fetch.read()
print(gb[0:1000])

## LOCUS       NATU01000384           20221 bp    DNA     linear   ENV 03-MAY-2018
## DEFINITION  Gamma proteobacterium symbiont of Ctena orbiculata isolate 40D _2,
##             whole genome shotgun sequence.
## ACCESSION   NATU01000384 NATU01000000
## VERSION     NATU01000384.1
## DBLINK      BioProject: PRJNA377790
##             BioSample: SAMN06473048
## KEYWORDS    WGS.
## SOURCE      gamma proteobacterium symbiont of Ctena orbiculata (mollusc
##             metagenome)
##   ORGANISM  gamma proteobacterium symbiont of Ctena orbiculata
##             Bacteria; Proteobacteria; Gammaproteobacteria.
## REFERENCE   1  (bases 1 to 20221)
##   AUTHORS   Lim,J.S., Engel,A.S., Anderson,L.C. and Campbell,B.J.
##   TITLE     Taxonomic and genetic symbiont diversity in the seagrass-associated
##             lucinid bivalves Ctena orbiculata and Stewartia floridana
##   JOURNAL   Unpublished
## REFERENCE   2  (bases 1 to 20221)
##   AUTHORS   Lim,J.S., Engel,A.S., Anderson,L.C. and Campbell,B.J.
##   TITLE     Direct Submission
##   JOURNAL   Submitted

Note that Entrez.efetch() has a limit of 200 records. So you may have to put this in a loop. You can check this as below:

print(record['Count'])

## 263906

Now let’s use SeqIO to switch from the Genbank file format to fasta

target = open('./data/seqs.gb', 'w')     #Open file seqs.gb for writing
target.write(gb)
target.close()
count = SeqIO.convert("./data/seqs.gb", "genbank", "./data/seqs.fasta", "fasta")    #convert file seqs.gb to seqs.fasta
print("Converted %i records to fasta" % count)

## Converted 3 records to fasta

Pubmed

from Bio import Entrez
from Bio import SeqIO
Entrez.email='your@email.com'
handle=Entrez.esearch(db='pubmed', retmax=3, term="rbcl", retmode='xml') 
record = Entrez.read(handle)
handle.close()
print(record)

## {'Count': '2233', 'RetMax': '3', 'RetStart': '0', 'IdList': ['29733977', '29726460', '29723647'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'rbcl[All Fields]', 'Field': 'All Fields', 'Count': '2233', 'Explode': 'N'}, 'GROUP'], 'QueryTranslation': 'rbcl[All Fields]'}

handle = Entrez.efetch(db="pubmed", id=record['IdList'])
xml=handle.read()
print(xml[0:2000])

## 
## Pubmed-entry ::= {
##   pmid 29733977,
##   medent {
##     em std {
##       year 2018,
##       month 5,
##       day 8,
##       hour 6,
##       minute 0
##     },
##     cit {
##       title {
##         name "First insights on the biogeographical history of Phlegmariurus
##  (Lycopodiaceae), with a focus on Madagascar."
##       },
##       authors {
##         names std {
##           {
##             name ml "Bauret L",
##             affil str "Museum national d'Histoire naturelle, Sorbonne
##  Universites, Institut de Systematique, Evolution, Biodiversite (UMR 7205
##  CNRS, MNHN, UPMC, EPHE), Herbier national, 16 rue Buffon CP39, F-75005 Paris,
##  France; Universite Pierre et Marie Curie, Sorbonne Universites, Institut de
##  Systematique, Evolution, Biodiversite (UMR 7205 CNRS, MNHN, UPMC, EPHE), 57
##  rue Cuvier CP48, F-75005 Paris, France. Electronic address:
##  lu.bauret@gmail.com."
##           },
##           {
##             name ml "Field AR",
##             affil str "Queensland Herbarium, Department of Environment and
##  Science; Australian Tropical Herbarium, James Cook University, POBox 6811,
##  Cairns, Qld 4878, Australia."
##           },
##           {
##             name ml "Gaudeul M",
##             affil str "Museum national d'Histoire naturelle, Sorbonne
##  Universites, Institut de Systematique, Evolution, Biodiversite (UMR 7205
##  CNRS, MNHN, UPMC, EPHE), Herbier national, 16 rue Buffon CP39, F-75005 Paris,
##  France."
##           },
##           {
##             name ml "Selosse MA",
##             affil str "Museum national d'Histoire naturelle, Sorbonne
##  Universites, Institut de Systematique, Evolution, Biodiversite (UMR 7205
##  CNRS, MNHN, UPMC, EPHE), Herbier national, 16 rue Buffon CP39, F-75005 Paris,
##  France; Department of Plant Taxonomy and Nature Conservation, University of
##  Gdansk, ul. Wita Stwosza 59, 80-308 Gdansk, Poland."
##           },
##           {
##             name ml "Rouhan G",
##             affil str "Museum national d'Histoire naturelle, Sorbonne
##  Universites, Institut de Systematique, Evolution, Biodiversite (UMR 7205
##  CNRS, MNHN, UPM

handle = Entrez.efetch(db="pubmed", id=record['IdList'], rettype="gb", retmode="xml")
record = Entrez.read(handle)
title=list()
for i in range(len(record)+1):
  print(record['PubmedArticle'][i]['MedlineCitation']['Article']['ArticleTitle'])

## First insights on the biogeographical history of Phlegmariurus (Lycopodiaceae), with a focus on Madagascar.
## 3, a new polymorph of rubidium copper trichloride: synthesis, structure and structural complexity.
## Origins of East Asian Campanuloideae (Campanulaceae) diversity.

Challenge: Link multiple sequence alignment to sequence harvesting in your Genbank script.

Align with mafft. Note that Biopython supports several multiple sequence alignment tools (Clustal, Mafft, ).

For install of mafft see: https://mafft.cbrc.jp/alignment/software/source.html

from Bio.Align.Applications import MafftCommandline
in_file = "./data/seqs.fasta"
mafft_cline = MafftCommandline(input=in_file) ## can add extra mafft commands here MafftCommandline(input=in_file, 'args')
print(mafft_cline)
stdout, stderr = mafft_cline() #Catch the output of mafft going to stdout and stderr 
with open("aligned.fasta", "w") as handle:
    handle.write(stdout)  #write stdout to file
print(stdout)

Follow up

I recommend looking the following for phylogenetic analyses in Python: http://dendropy.org/news.html

Scripting NCBI Web-scraping

Rob Harbert

5/9/2018

Scripting in Python

Accepting command line input

NCBI Entrez

Biopython and Entrez search tools

Pubmed

Challenge: Link multiple sequence alignment to sequence harvesting in your Genbank script.

Follow up