Clustering with MMSeqs2¶
For clustering, please make sure that the pyeed Docker container with MMSeqs2 is running.
PyEED provides a convenient interface to MMSeqs2 for sequence clustering. This notebook demonstrates how to use the MMSeqs class for:
- Clustering sequences from a dictionary
- Clustering sequences directly from the database
In [1]:
Copied!
from pyeed import Pyeed
from pyeed.model import Protein
from pyeed.tools.mmseqs import MMSeqs
from pyeed import Pyeed
from pyeed.model import Protein
from pyeed.tools.mmseqs import MMSeqs
Direct Sequence Clustering¶
You can cluster sequences directly by providing a dictionary of sequences:
In [3]:
Copied!
# Example sequences (seq 4 and seq 5 are highly similar)
sequence_dict = {
"seq1": "MGAWQPLIRKSTFNVADCEYLMKQHTGYPWVRESDTAHKLQNVGICFPYQAMTNYLG",
"seq2": "QVLATPRWDKEFYHMSNCGRILQAVPTKDYFSHGRWMNEKAPTYLQSDVCHAFGKLT",
"seq3": "DLNQWPKYARHTMGSLVEFACRQNIYTDHKPLWYGVSMEAFDQTCKPRYNLHGIVKT",
"seq4": "AKFVMPDRQWHLYTGNECSKQRLYVWFYDGAPILKTQSDNMGAYRCPLFHVTKNWEI",
"seq5": "AKFVMPDRQWHLYTGNECSKQRLYVWFHDGAPILKTQSDNMGAYRCPLFHVTKNWEI",
}
# Initialize MMSeqs with custom parameters
mmseqs = MMSeqs(
min_seq_id=0.8, # 80% sequence identity threshold
coverage=0.8, # 80% coverage required
cov_mode=0, # Bidirectional coverage mode
)
# Perform clustering
clusters = mmseqs.cluster_sequence_dict(sequence_dict)
print("Clustering results:")
for cluster in clusters:
print(f"\nCluster representative: {cluster.representative_id}")
print(f"Cluster members: {cluster.represented_ids}")
# Example sequences (seq 4 and seq 5 are highly similar)
sequence_dict = {
"seq1": "MGAWQPLIRKSTFNVADCEYLMKQHTGYPWVRESDTAHKLQNVGICFPYQAMTNYLG",
"seq2": "QVLATPRWDKEFYHMSNCGRILQAVPTKDYFSHGRWMNEKAPTYLQSDVCHAFGKLT",
"seq3": "DLNQWPKYARHTMGSLVEFACRQNIYTDHKPLWYGVSMEAFDQTCKPRYNLHGIVKT",
"seq4": "AKFVMPDRQWHLYTGNECSKQRLYVWFYDGAPILKTQSDNMGAYRCPLFHVTKNWEI",
"seq5": "AKFVMPDRQWHLYTGNECSKQRLYVWFHDGAPILKTQSDNMGAYRCPLFHVTKNWEI",
}
# Initialize MMSeqs with custom parameters
mmseqs = MMSeqs(
min_seq_id=0.8, # 80% sequence identity threshold
coverage=0.8, # 80% coverage required
cov_mode=0, # Bidirectional coverage mode
)
# Perform clustering
clusters = mmseqs.cluster_sequence_dict(sequence_dict)
print("Clustering results:")
for cluster in clusters:
print(f"\nCluster representative: {cluster.representative_id}")
print(f"Cluster members: {cluster.represented_ids}")
Output()
Clustering results: Cluster representative: seq1 Cluster members: ['seq1'] Cluster representative: seq2 Cluster members: ['seq2'] Cluster representative: seq3 Cluster members: ['seq3'] Cluster representative: seq5 Cluster members: ['seq5', 'seq4']
Clustering from Database¶
You can also cluster sequences directly from the PyEED database:
In [7]:
Copied!
# Connect to database
pyeed = Pyeed(uri="bolt://localhost:7687", user="neo4j", password="12345678")
# Get first 100 protein IDs from database
accession_ids = [protein.accession_id for protein in Protein.nodes.all()][:100]
# Cluster sequences
clusters = mmseqs.cluster_from_db(accession_ids, pyeed.db)
print(f"Found {len(clusters)} clusters")
# Connect to database
pyeed = Pyeed(uri="bolt://localhost:7687", user="neo4j", password="12345678")
# Get first 100 protein IDs from database
accession_ids = [protein.accession_id for protein in Protein.nodes.all()][:100]
# Cluster sequences
clusters = mmseqs.cluster_from_db(accession_ids, pyeed.db)
print(f"Found {len(clusters)} clusters")
📡 Connected to database. Found 34 clusters
Customizing MMSeqs Parameters¶
MMSeqs offers several parameters to control clustering behavior:
In [8]:
Copied!
mmseqs = MMSeqs(
# Sequence identity and coverage
min_seq_id=0.5, # 50% sequence identity threshold
coverage=0.8, # 80% coverage required
# Coverage mode
cov_mode=0, # 0: bidirectional, 1: query, 2: target
# Performance settings
threads=4, # Number of CPU threads
sensitivity=7.5, # 1.0 (fast) to 9.0 (sensitive)
# Clustering behavior
cluster_mode=0, # 0: set-cover, 1: connected-component, 2: greedy
seq_id_mode=0, # 0: alignment length, 1: shorter sequence
rescore_mode=0, # Whether to rescore overlapping alignments
)
# The parameters affect both clustering methods
clusters = mmseqs.cluster_sequence_dict(sequence_dict)
# or
clusters = mmseqs.cluster_from_db(accession_ids, pyeed.db)
mmseqs = MMSeqs(
# Sequence identity and coverage
min_seq_id=0.5, # 50% sequence identity threshold
coverage=0.8, # 80% coverage required
# Coverage mode
cov_mode=0, # 0: bidirectional, 1: query, 2: target
# Performance settings
threads=4, # Number of CPU threads
sensitivity=7.5, # 1.0 (fast) to 9.0 (sensitive)
# Clustering behavior
cluster_mode=0, # 0: set-cover, 1: connected-component, 2: greedy
seq_id_mode=0, # 0: alignment length, 1: shorter sequence
rescore_mode=0, # Whether to rescore overlapping alignments
)
# The parameters affect both clustering methods
clusters = mmseqs.cluster_sequence_dict(sequence_dict)
# or
clusters = mmseqs.cluster_from_db(accession_ids, pyeed.db)
Understanding Cluster Results¶
Each cluster is represented by a Cluster
object with two attributes:
representative_id
: The sequence chosen as the cluster representativerepresented_ids
: List of all sequences in the cluster (including the representative)
Sequences with high similarity (based on min_seq_id
and coverage
) will be grouped together.