BLAST Search¶

Setup¶

The BLAST service runs in a Docker container and requires:

A local BLAST database
The Docker service running

In [1]:

Copied!

# change log level to INFO
import sys

from loguru import logger

logger.remove()
level = logger.add(sys.stderr, level="WARNING")
# change log level to INFO
import sys

from loguru import logger

logger.remove()
level = logger.add(sys.stderr, level="WARNING")

Basic Usage¶

The Blast class provides an interface to search protein or nucleotide sequences against a local BLAST database.

In [2]:

Copied!





from pyeed.tools import Blast

# Example protein sequence
sequence = "MSEQVAAVAKLRAKASEAAKEAKAREAAKKLAEAAKKAKAKEAAKRAEAKLAEKAKAAKRAEAKAAKEAKRAAAKRAEAKLAEKAKAAK"

# Initialize BLAST search
blast = Blast(
    # service_url="http://localhost:6001/blast",
    mode="blastp",  # Use blastp for protein sequences
    db_path="/usr/local/bin/data/test_db",  # Path in Docker container
    db_name="protein_db",  # Name of your BLAST database
    evalue=0.1,  # E-value threshold
    max_target_seqs=10,  # Maximum number of hits to return
)

# Perform search
results = blast.search(sequence)
results
from pyeed.tools import Blast

# Example protein sequence
sequence = "MSEQVAAVAKLRAKASEAAKEAKAREAAKKLAEAAKKAKAKEAAKRAEAKLAEKAKAAKRAEAKAAKEAKRAAAKRAEAKLAEKAKAAK"

# Initialize BLAST search
blast = Blast(
    # service_url="http://localhost:6001/blast",
    mode="blastp",  # Use blastp for protein sequences
    db_path="/usr/local/bin/data/test_db",  # Path in Docker container
    db_name="protein_db",  # Name of your BLAST database
    evalue=0.1,  # E-value threshold
    max_target_seqs=10,  # Maximum number of hits to return
)

# Perform search
results = blast.search(sequence)
results

Out[2]:

	subject_id	identity	alignment_length	mismatches	gap_opens	query_start	query_end	subject_start	subject_end	evalue	bit_score
0	seq7	81.818	22	3	1	31	51	11	32	0.003	22.3
1	seq1	100.000	25	0	0	1	25	1	25	0.004	22.3
2	seq2	61.538	26	10	0	20	45	5	30	0.038	19.2

The results are returned as a pandas DataFrame with the following columns:

subject_id: ID of the matched sequence
identity: Percentage identity
alignment_length: Length of the alignment
mismatches: Number of mismatches
gap_opens: Number of gap openings
query_start/end: Start/end positions in query sequence
subject_start/end: Start/end positions in subject sequence
evalue: Expectation value
bit_score: Bit score

Creating a BLAST Database¶

Before using BLAST, you need to create a local database. Here's how to create one from a FASTA file:

# For protein sequences
makeblastdb -in proteins.fasta -dbtype prot -out blast_db/my_proteins

# For nucleotide sequences
makeblastdb -in nucleotides.fasta -dbtype nucl -out blast_db/my_nucleotides

To access the BLAST Docker container shell and create databases:

# Enter the BLAST container shell
docker compose exec blast bash
# 
# Navigate to database directory
cd /usr/local/bin/data/blast_db
# 
# Create protein database
makeblastdb -in proteins.fasta -dbtype prot -out my_proteins
# 
# Create nucleotide database 
makeblastdb -in nucleotides.fasta -dbtype nucl -out my_nucleotides

Make sure your FASTA files are mounted in the container's /usr/local/bin/data/blast_db directory.

Advanced Usage¶

You can customize the BLAST search parameters:

In [3]:

Copied!





# Configure BLAST for sensitive protein search
blast = Blast(
    # service_url="http://localhost:6001/blast",
    mode="blastp",
    db_path="/usr/local/bin/data/test_db",
    db_name="protein_db",
    evalue=1e-1,  # More stringent E-value
    max_target_seqs=100,  # Return more hits
    num_threads=4,  # Use 4 CPU threads
)

# Search with longer timeout
results = blast.search(sequence, timeout=7200)  # 2 hour timeout

# Filter results
significant_hits = results[results["identity"] > 80]  # Only hits with >90% identity
significant_hits
# Configure BLAST for sensitive protein search
blast = Blast(
    # service_url="http://localhost:6001/blast",
    mode="blastp",
    db_path="/usr/local/bin/data/test_db",
    db_name="protein_db",
    evalue=1e-1,  # More stringent E-value
    max_target_seqs=100,  # Return more hits
    num_threads=4,  # Use 4 CPU threads
)

# Search with longer timeout
results = blast.search(sequence, timeout=7200)  # 2 hour timeout

# Filter results
significant_hits = results[results["identity"] > 80]  # Only hits with >90% identity
significant_hits

Out[3]:

	subject_id	identity	alignment_length	mismatches	gap_opens	query_start	query_end	subject_start	subject_end	evalue	bit_score
0	seq7	81.818	22	3	1	31	51	11	32	0.003	22.3
1	seq1	100.000	25	0	0	1	25	1	25	0.004	22.3

Thereafter, the ids of the hits can be added to the pyeed database, using the fetch_from_primary_db function.