Clustering with MMSeqs2¶
For clustering, please make sure that the pyeed Docker container with MMSeqs2 is running.
PyEED provides a convenient interface to MMSeqs2 for sequence clustering. This notebook demonstrates how to use the MMSeqs class for:
- Clustering sequences from a dictionary
- Clustering sequences directly from the database
In [1]:
Copied!
from pyeed import Pyeed
from pyeed.tools.mmseqs import MMSeqs
from pyeed import Pyeed
from pyeed.tools.mmseqs import MMSeqs
Direct Sequence Clustering¶
You can cluster sequences directly by providing a dictionary of sequences:
In [3]:
Copied!
# Example sequences (seq 4 and seq 5 are highly similar)
sequence_dict = {
"seq1": "MGAWQPLIRKSTFNVADCEYLMKQHTGYPWVRESDTAHKLQNVGICFPYQAMTNYLG",
"seq2": "QVLATPRWDKEFYHMSNCGRILQAVPTKDYFSHGRWMNEKAPTYLQSDVCHAFGKLT",
"seq3": "DLNQWPKYARHTMGSLVEFACRQNIYTDHKPLWYGVSMEAFDQTCKPRYNLHGIVKT",
"seq4": "AKFVMPDRQWHLYTGNECSKQRLYVWFYDGAPILKTQSDNMGAYRCPLFHVTKNWEI",
"seq5": "AKFVMPDRQWHLYTGNECSKQRLYVWFHDGAPILKTQSDNMGAYRCPLFHVTKNWEI",
}
# Initialize MMSeqs with custom parameters
mmseqs = MMSeqs(
min_seq_id=0.8, # 80% sequence identity threshold
coverage=0.8, # 80% coverage required
cov_mode=0, # Bidirectional coverage mode
)
# Perform clustering
clusters = mmseqs.cluster_sequence_dict(sequence_dict)
print("Clustering results:")
for cluster in clusters:
print(f"\nCluster representative: {cluster.representative_id}")
print(f"Cluster members: {cluster.represented_ids}")
# Example sequences (seq 4 and seq 5 are highly similar)
sequence_dict = {
"seq1": "MGAWQPLIRKSTFNVADCEYLMKQHTGYPWVRESDTAHKLQNVGICFPYQAMTNYLG",
"seq2": "QVLATPRWDKEFYHMSNCGRILQAVPTKDYFSHGRWMNEKAPTYLQSDVCHAFGKLT",
"seq3": "DLNQWPKYARHTMGSLVEFACRQNIYTDHKPLWYGVSMEAFDQTCKPRYNLHGIVKT",
"seq4": "AKFVMPDRQWHLYTGNECSKQRLYVWFYDGAPILKTQSDNMGAYRCPLFHVTKNWEI",
"seq5": "AKFVMPDRQWHLYTGNECSKQRLYVWFHDGAPILKTQSDNMGAYRCPLFHVTKNWEI",
}
# Initialize MMSeqs with custom parameters
mmseqs = MMSeqs(
min_seq_id=0.8, # 80% sequence identity threshold
coverage=0.8, # 80% coverage required
cov_mode=0, # Bidirectional coverage mode
)
# Perform clustering
clusters = mmseqs.cluster_sequence_dict(sequence_dict)
print("Clustering results:")
for cluster in clusters:
print(f"\nCluster representative: {cluster.representative_id}")
print(f"Cluster members: {cluster.represented_ids}")
Output()
Clustering results: Cluster representative: seq1 Cluster members: ['seq1'] Cluster representative: seq2 Cluster members: ['seq2'] Cluster representative: seq3 Cluster members: ['seq3'] Cluster representative: seq5 Cluster members: ['seq5', 'seq4']
Clustering from Database¶
You can also cluster sequences directly from the PyEED database:
In [7]:
Copied!
# Connect to database
pyeed = Pyeed(uri="bolt://localhost:7687", user="neo4j", password="12345678")
# Get first 100 protein IDs from database
from pyeed.model import Protein
accession_ids = [protein.accession_id for protein in Protein.nodes.all()][:100]
# Cluster sequences
clusters = mmseqs.cluster_from_db(accession_ids, pyeed.db)
print(f"Found {len(clusters)} clusters")
# Connect to database
pyeed = Pyeed(uri="bolt://localhost:7687", user="neo4j", password="12345678")
# Get first 100 protein IDs from database
from pyeed.model import Protein
accession_ids = [protein.accession_id for protein in Protein.nodes.all()][:100]
# Cluster sequences
clusters = mmseqs.cluster_from_db(accession_ids, pyeed.db)
print(f"Found {len(clusters)} clusters")
📡 Connected to database. Found 34 clusters
Customizing MMSeqs Parameters¶
MMSeqs offers several parameters to control clustering behavior:
In [8]:
Copied!
mmseqs = MMSeqs(
# Sequence identity and coverage
min_seq_id=0.5, # 50% sequence identity threshold
coverage=0.8, # 80% coverage required
# Coverage mode
cov_mode=0, # 0: bidirectional, 1: query, 2: target
# Performance settings
threads=4, # Number of CPU threads
sensitivity=7.5, # 1.0 (fast) to 9.0 (sensitive)
# Clustering behavior
cluster_mode=0, # 0: set-cover, 1: connected-component, 2: greedy
seq_id_mode=0, # 0: alignment length, 1: shorter sequence
rescore_mode=0, # Whether to rescore overlapping alignments
)
# The parameters affect both clustering methods
clusters = mmseqs.cluster_sequence_dict(sequence_dict)
# or
clusters = mmseqs.cluster_from_db(accession_ids, pyeed.db)
mmseqs = MMSeqs(
# Sequence identity and coverage
min_seq_id=0.5, # 50% sequence identity threshold
coverage=0.8, # 80% coverage required
# Coverage mode
cov_mode=0, # 0: bidirectional, 1: query, 2: target
# Performance settings
threads=4, # Number of CPU threads
sensitivity=7.5, # 1.0 (fast) to 9.0 (sensitive)
# Clustering behavior
cluster_mode=0, # 0: set-cover, 1: connected-component, 2: greedy
seq_id_mode=0, # 0: alignment length, 1: shorter sequence
rescore_mode=0, # Whether to rescore overlapping alignments
)
# The parameters affect both clustering methods
clusters = mmseqs.cluster_sequence_dict(sequence_dict)
# or
clusters = mmseqs.cluster_from_db(accession_ids, pyeed.db)
Understanding Cluster Results¶
Each cluster is represented by a Cluster
object with two attributes:
representative_id
: The sequence chosen as the cluster representativerepresented_ids
: List of all sequences in the cluster (including the representative)
Sequences with high similarity (based on min_seq_id
and coverage
) will be grouped together.