In [1]:
Copied!
# change log level to INFO
import sys
from loguru import logger
logger.remove()
level = logger.add(sys.stderr, level="WARNING")
# change log level to INFO
import sys
from loguru import logger
logger.remove()
level = logger.add(sys.stderr, level="WARNING")
Basic Usage¶
The Blast
class provides an interface to search protein or nucleotide sequences against a local BLAST database.
In [2]:
Copied!
from pyeed.tools import Blast
# Example protein sequence
sequence = "MSEQVAAVAKLRAKASEAAKEAKAREAAKKLAEAAKKAKAKEAAKRAEAKLAEKAKAAKRAEAKAAKEAKRAAAKRAEAKLAEKAKAAK"
# Initialize BLAST search
blast = Blast(
# service_url="http://localhost:6001/blast",
mode="blastp", # Use blastp for protein sequences
db_path="/usr/local/bin/data/test_db", # Path in Docker container
db_name="protein_db", # Name of your BLAST database
evalue=0.1, # E-value threshold
max_target_seqs=10, # Maximum number of hits to return
)
# Perform search
results = blast.search(sequence)
results
from pyeed.tools import Blast
# Example protein sequence
sequence = "MSEQVAAVAKLRAKASEAAKEAKAREAAKKLAEAAKKAKAKEAAKRAEAKLAEKAKAAKRAEAKAAKEAKRAAAKRAEAKLAEKAKAAK"
# Initialize BLAST search
blast = Blast(
# service_url="http://localhost:6001/blast",
mode="blastp", # Use blastp for protein sequences
db_path="/usr/local/bin/data/test_db", # Path in Docker container
db_name="protein_db", # Name of your BLAST database
evalue=0.1, # E-value threshold
max_target_seqs=10, # Maximum number of hits to return
)
# Perform search
results = blast.search(sequence)
results
Out[2]:
subject_id | identity | alignment_length | mismatches | gap_opens | query_start | query_end | subject_start | subject_end | evalue | bit_score | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | seq7 | 81.818 | 22 | 3 | 1 | 31 | 51 | 11 | 32 | 0.003 | 22.3 |
1 | seq1 | 100.000 | 25 | 0 | 0 | 1 | 25 | 1 | 25 | 0.004 | 22.3 |
2 | seq2 | 61.538 | 26 | 10 | 0 | 20 | 45 | 5 | 30 | 0.038 | 19.2 |
The results are returned as a pandas DataFrame with the following columns:
- subject_id: ID of the matched sequence
- identity: Percentage identity
- alignment_length: Length of the alignment
- mismatches: Number of mismatches
- gap_opens: Number of gap openings
- query_start/end: Start/end positions in query sequence
- subject_start/end: Start/end positions in subject sequence
- evalue: Expectation value
- bit_score: Bit score
Creating a BLAST Database¶
Before using BLAST, you need to create a local database. Here's how to create one from a FASTA file:
# For protein sequences
makeblastdb -in proteins.fasta -dbtype prot -out blast_db/my_proteins
# For nucleotide sequences
makeblastdb -in nucleotides.fasta -dbtype nucl -out blast_db/my_nucleotides
To access the BLAST Docker container shell and create databases:
# Enter the BLAST container shell
docker compose exec blast bash
#
# Navigate to database directory
cd /usr/local/bin/data/blast_db
#
# Create protein database
makeblastdb -in proteins.fasta -dbtype prot -out my_proteins
#
# Create nucleotide database
makeblastdb -in nucleotides.fasta -dbtype nucl -out my_nucleotides
Make sure your FASTA files are mounted in the container's /usr/local/bin/data/blast_db
directory.
Advanced Usage¶
You can customize the BLAST search parameters:
In [3]:
Copied!
# Configure BLAST for sensitive protein search
blast = Blast(
# service_url="http://localhost:6001/blast",
mode="blastp",
db_path="/usr/local/bin/data/test_db",
db_name="protein_db",
evalue=1e-1, # More stringent E-value
max_target_seqs=100, # Return more hits
num_threads=4, # Use 4 CPU threads
)
# Search with longer timeout
results = blast.search(sequence, timeout=7200) # 2 hour timeout
# Filter results
significant_hits = results[results["identity"] > 80] # Only hits with >90% identity
significant_hits
# Configure BLAST for sensitive protein search
blast = Blast(
# service_url="http://localhost:6001/blast",
mode="blastp",
db_path="/usr/local/bin/data/test_db",
db_name="protein_db",
evalue=1e-1, # More stringent E-value
max_target_seqs=100, # Return more hits
num_threads=4, # Use 4 CPU threads
)
# Search with longer timeout
results = blast.search(sequence, timeout=7200) # 2 hour timeout
# Filter results
significant_hits = results[results["identity"] > 80] # Only hits with >90% identity
significant_hits
Out[3]:
subject_id | identity | alignment_length | mismatches | gap_opens | query_start | query_end | subject_start | subject_end | evalue | bit_score | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | seq7 | 81.818 | 22 | 3 | 1 | 31 | 51 | 11 | 32 | 0.003 | 22.3 |
1 | seq1 | 100.000 | 25 | 0 | 0 | 1 | 25 | 1 | 25 | 0.004 | 22.3 |
Thereafter, the ids of the hits can be added to the pyeed database, using the fetch_from_primary_db
function.