Interactive Python is great but you want to put your workflow in scripts so it is easy to reproduce. The basic format is below. Copy and paste it into a text file.
#!/usr/bin/python
str='hello world'
print(str)
quit() #Can skip this, will exit at end of file
#!/usr/bin/python
import sys
print(sys.argv[0])
print(sys.argv[1])
print(len(sys.argv))
quit()
Then run: “python script.py ‘in string’”
We can expand on this and do something with the command line input. Like print it in reverse order.
#!/usr/bin/python
import sys
str=sys.argv[1]
l=len(str)
for i in range(l,0,-1):
print(str[i])
quit()
conda install Bio
Biopython is a big project from which we will use the submodule Entrez which provides tools for searching NCBI databases and the submodule SeqIO for switching between sequence file types.
Take a look at the databases we can access:
from Bio import Entrez
from Bio import SeqIO
Entrez.email='your@email.com'
handle = Entrez.einfo()
record = Entrez.read(handle)
print(record["DbList"])
## ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'nucgss', 'nucest', 'structure', 'sparcle', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'unigene', 'gencoll', 'gtr']
For the search string check on the NCBI website search first.
https://www.ncbi.nlm.nih.gov/nucleotide/
Look at the search details box to the right after entering a search term.
Let’s put the next few code chunks into a script that accepts a search string as sys.argv[0] and passes it to the Entrez.esearch function.
Tutorial:http://biopython.org/DIST/docs/api/Bio.Entrez-module.html
from Bio import Entrez
from Bio import SeqIO
Entrez.email='your@email.com'
#search Genbank, returns accession numbers (up to 100)
handle=Entrez.esearch(db='nucleotide', retmax=3, term="rbcl[All Fields]", idtype="acc")
record = Entrez.read(handle)
print(handle)
## <_io.TextIOWrapper encoding='utf-8'>
print(record)
## {'Count': '263906', 'RetMax': '3', 'RetStart': '0', 'IdList': ['NATU01000384.1', 'NATW01000126.1', 'NATT01000272.1'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'rbcl[All Fields]', 'Field': 'All Fields', 'Count': '263906', 'Explode': 'N'}, 'GROUP'], 'QueryTranslation': 'rbcl[All Fields]'}
handle.close()
fetch = Entrez.efetch(db='nucleotide', id=",".join(record['IdList']), rettype='gb', retmode='text')
gb=fetch.read()
print(gb[0:1000])
## LOCUS NATU01000384 20221 bp DNA linear ENV 03-MAY-2018
## DEFINITION Gamma proteobacterium symbiont of Ctena orbiculata isolate 40D _2,
## whole genome shotgun sequence.
## ACCESSION NATU01000384 NATU01000000
## VERSION NATU01000384.1
## DBLINK BioProject: PRJNA377790
## BioSample: SAMN06473048
## KEYWORDS WGS.
## SOURCE gamma proteobacterium symbiont of Ctena orbiculata (mollusc
## metagenome)
## ORGANISM gamma proteobacterium symbiont of Ctena orbiculata
## Bacteria; Proteobacteria; Gammaproteobacteria.
## REFERENCE 1 (bases 1 to 20221)
## AUTHORS Lim,J.S., Engel,A.S., Anderson,L.C. and Campbell,B.J.
## TITLE Taxonomic and genetic symbiont diversity in the seagrass-associated
## lucinid bivalves Ctena orbiculata and Stewartia floridana
## JOURNAL Unpublished
## REFERENCE 2 (bases 1 to 20221)
## AUTHORS Lim,J.S., Engel,A.S., Anderson,L.C. and Campbell,B.J.
## TITLE Direct Submission
## JOURNAL Submitted
Note that Entrez.efetch() has a limit of 200 records. So you may have to put this in a loop. You can check this as below:
print(record['Count'])
## 263906
Now let’s use SeqIO to switch from the Genbank file format to fasta
target = open('./data/seqs.gb', 'w') #Open file seqs.gb for writing
target.write(gb)
target.close()
count = SeqIO.convert("./data/seqs.gb", "genbank", "./data/seqs.fasta", "fasta") #convert file seqs.gb to seqs.fasta
print("Converted %i records to fasta" % count)
## Converted 3 records to fasta
from Bio import Entrez
from Bio import SeqIO
Entrez.email='your@email.com'
handle=Entrez.esearch(db='pubmed', retmax=3, term="rbcl", retmode='xml')
record = Entrez.read(handle)
handle.close()
print(record)
## {'Count': '2233', 'RetMax': '3', 'RetStart': '0', 'IdList': ['29733977', '29726460', '29723647'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'rbcl[All Fields]', 'Field': 'All Fields', 'Count': '2233', 'Explode': 'N'}, 'GROUP'], 'QueryTranslation': 'rbcl[All Fields]'}
handle = Entrez.efetch(db="pubmed", id=record['IdList'])
xml=handle.read()
print(xml[0:2000])
##
## Pubmed-entry ::= {
## pmid 29733977,
## medent {
## em std {
## year 2018,
## month 5,
## day 8,
## hour 6,
## minute 0
## },
## cit {
## title {
## name "First insights on the biogeographical history of Phlegmariurus
## (Lycopodiaceae), with a focus on Madagascar."
## },
## authors {
## names std {
## {
## name ml "Bauret L",
## affil str "Museum national d'Histoire naturelle, Sorbonne
## Universites, Institut de Systematique, Evolution, Biodiversite (UMR 7205
## CNRS, MNHN, UPMC, EPHE), Herbier national, 16 rue Buffon CP39, F-75005 Paris,
## France; Universite Pierre et Marie Curie, Sorbonne Universites, Institut de
## Systematique, Evolution, Biodiversite (UMR 7205 CNRS, MNHN, UPMC, EPHE), 57
## rue Cuvier CP48, F-75005 Paris, France. Electronic address:
## lu.bauret@gmail.com."
## },
## {
## name ml "Field AR",
## affil str "Queensland Herbarium, Department of Environment and
## Science; Australian Tropical Herbarium, James Cook University, POBox 6811,
## Cairns, Qld 4878, Australia."
## },
## {
## name ml "Gaudeul M",
## affil str "Museum national d'Histoire naturelle, Sorbonne
## Universites, Institut de Systematique, Evolution, Biodiversite (UMR 7205
## CNRS, MNHN, UPMC, EPHE), Herbier national, 16 rue Buffon CP39, F-75005 Paris,
## France."
## },
## {
## name ml "Selosse MA",
## affil str "Museum national d'Histoire naturelle, Sorbonne
## Universites, Institut de Systematique, Evolution, Biodiversite (UMR 7205
## CNRS, MNHN, UPMC, EPHE), Herbier national, 16 rue Buffon CP39, F-75005 Paris,
## France; Department of Plant Taxonomy and Nature Conservation, University of
## Gdansk, ul. Wita Stwosza 59, 80-308 Gdansk, Poland."
## },
## {
## name ml "Rouhan G",
## affil str "Museum national d'Histoire naturelle, Sorbonne
## Universites, Institut de Systematique, Evolution, Biodiversite (UMR 7205
## CNRS, MNHN, UPM
handle = Entrez.efetch(db="pubmed", id=record['IdList'], rettype="gb", retmode="xml")
record = Entrez.read(handle)
title=list()
for i in range(len(record)+1):
print(record['PubmedArticle'][i]['MedlineCitation']['Article']['ArticleTitle'])
## First insights on the biogeographical history of Phlegmariurus (Lycopodiaceae), with a focus on Madagascar.
## 3, a new polymorph of rubidium copper trichloride: synthesis, structure and structural complexity.
## Origins of East Asian Campanuloideae (Campanulaceae) diversity.
Align with mafft. Note that Biopython supports several multiple sequence alignment tools (Clustal, Mafft, ).
For install of mafft see: https://mafft.cbrc.jp/alignment/software/source.html
from Bio.Align.Applications import MafftCommandline
in_file = "./data/seqs.fasta"
mafft_cline = MafftCommandline(input=in_file) ## can add extra mafft commands here MafftCommandline(input=in_file, 'args')
print(mafft_cline)
stdout, stderr = mafft_cline() #Catch the output of mafft going to stdout and stderr
with open("aligned.fasta", "w") as handle:
handle.write(stdout) #write stdout to file
print(stdout)
I recommend looking the following for phylogenetic analyses in Python: http://dendropy.org/news.html