Clustering with MMSeqs2¶

For clustering, please make sure that the pyeed Docker container with MMSeqs2 is running.

PyEED provides a convenient interface to MMSeqs2 for sequence clustering. This notebook demonstrates how to use the MMSeqs class for:

Clustering sequences from a dictionary
Clustering sequences directly from the database

In [1]:

Copied!

from pyeed import Pyeed
from pyeed.model import Protein
from pyeed.tools.mmseqs import MMSeqs
from pyeed import Pyeed
from pyeed.model import Protein
from pyeed.tools.mmseqs import MMSeqs

Direct Sequence Clustering¶

You can cluster sequences directly by providing a dictionary of sequences:

In [3]:

Copied!





# Example sequences (seq 4 and seq 5 are highly similar)
sequence_dict = {
    "seq1": "MGAWQPLIRKSTFNVADCEYLMKQHTGYPWVRESDTAHKLQNVGICFPYQAMTNYLG",
    "seq2": "QVLATPRWDKEFYHMSNCGRILQAVPTKDYFSHGRWMNEKAPTYLQSDVCHAFGKLT",
    "seq3": "DLNQWPKYARHTMGSLVEFACRQNIYTDHKPLWYGVSMEAFDQTCKPRYNLHGIVKT",
    "seq4": "AKFVMPDRQWHLYTGNECSKQRLYVWFYDGAPILKTQSDNMGAYRCPLFHVTKNWEI",
    "seq5": "AKFVMPDRQWHLYTGNECSKQRLYVWFHDGAPILKTQSDNMGAYRCPLFHVTKNWEI",
}

# Initialize MMSeqs with custom parameters
mmseqs = MMSeqs(
    min_seq_id=0.8,  # 80% sequence identity threshold
    coverage=0.8,  # 80% coverage required
    cov_mode=0,  # Bidirectional coverage mode
)

# Perform clustering
clusters = mmseqs.cluster_sequence_dict(sequence_dict)
print("Clustering results:")
for cluster in clusters:
    print(f"\nCluster representative: {cluster.representative_id}")
    print(f"Cluster members: {cluster.represented_ids}")
# Example sequences (seq 4 and seq 5 are highly similar)
sequence_dict = {
    "seq1": "MGAWQPLIRKSTFNVADCEYLMKQHTGYPWVRESDTAHKLQNVGICFPYQAMTNYLG",
    "seq2": "QVLATPRWDKEFYHMSNCGRILQAVPTKDYFSHGRWMNEKAPTYLQSDVCHAFGKLT",
    "seq3": "DLNQWPKYARHTMGSLVEFACRQNIYTDHKPLWYGVSMEAFDQTCKPRYNLHGIVKT",
    "seq4": "AKFVMPDRQWHLYTGNECSKQRLYVWFYDGAPILKTQSDNMGAYRCPLFHVTKNWEI",
    "seq5": "AKFVMPDRQWHLYTGNECSKQRLYVWFHDGAPILKTQSDNMGAYRCPLFHVTKNWEI",
}

# Initialize MMSeqs with custom parameters
mmseqs = MMSeqs(
    min_seq_id=0.8,  # 80% sequence identity threshold
    coverage=0.8,  # 80% coverage required
    cov_mode=0,  # Bidirectional coverage mode
)

# Perform clustering
clusters = mmseqs.cluster_sequence_dict(sequence_dict)
print("Clustering results:")
for cluster in clusters:
    print(f"\nCluster representative: {cluster.representative_id}")
    print(f"Cluster members: {cluster.represented_ids}")

Output()

Clustering results:

Cluster representative: seq1
Cluster members: ['seq1']

Cluster representative: seq2
Cluster members: ['seq2']

Cluster representative: seq3
Cluster members: ['seq3']

Cluster representative: seq5
Cluster members: ['seq5', 'seq4']

Clustering from Database¶

You can also cluster sequences directly from the PyEED database:

In [7]:

Copied!





# Connect to database
pyeed = Pyeed(uri="bolt://localhost:7687", user="neo4j", password="12345678")

# Get first 100 protein IDs from database
accession_ids = [protein.accession_id for protein in Protein.nodes.all()][:100]

# Cluster sequences
clusters = mmseqs.cluster_from_db(accession_ids, pyeed.db)
print(f"Found {len(clusters)} clusters")
# Connect to database
pyeed = Pyeed(uri="bolt://localhost:7687", user="neo4j", password="12345678")

# Get first 100 protein IDs from database
accession_ids = [protein.accession_id for protein in Protein.nodes.all()][:100]

# Cluster sequences
clusters = mmseqs.cluster_from_db(accession_ids, pyeed.db)
print(f"Found {len(clusters)} clusters")

📡 Connected to database.
Found 34 clusters

Customizing MMSeqs Parameters¶

MMSeqs offers several parameters to control clustering behavior:

In [8]:

Copied!





mmseqs = MMSeqs(
    # Sequence identity and coverage
    min_seq_id=0.5,  # 50% sequence identity threshold
    coverage=0.8,  # 80% coverage required
    # Coverage mode
    cov_mode=0,  # 0: bidirectional, 1: query, 2: target
    # Performance settings
    threads=4,  # Number of CPU threads
    sensitivity=7.5,  # 1.0 (fast) to 9.0 (sensitive)
    # Clustering behavior
    cluster_mode=0,  # 0: set-cover, 1: connected-component, 2: greedy
    seq_id_mode=0,  # 0: alignment length, 1: shorter sequence
    rescore_mode=0,  # Whether to rescore overlapping alignments
)

# The parameters affect both clustering methods
clusters = mmseqs.cluster_sequence_dict(sequence_dict)
# or
clusters = mmseqs.cluster_from_db(accession_ids, pyeed.db)
mmseqs = MMSeqs(
    # Sequence identity and coverage
    min_seq_id=0.5,  # 50% sequence identity threshold
    coverage=0.8,  # 80% coverage required
    # Coverage mode
    cov_mode=0,  # 0: bidirectional, 1: query, 2: target
    # Performance settings
    threads=4,  # Number of CPU threads
    sensitivity=7.5,  # 1.0 (fast) to 9.0 (sensitive)
    # Clustering behavior
    cluster_mode=0,  # 0: set-cover, 1: connected-component, 2: greedy
    seq_id_mode=0,  # 0: alignment length, 1: shorter sequence
    rescore_mode=0,  # Whether to rescore overlapping alignments
)

# The parameters affect both clustering methods
clusters = mmseqs.cluster_sequence_dict(sequence_dict)
# or
clusters = mmseqs.cluster_from_db(accession_ids, pyeed.db)

Understanding Cluster Results¶

Each cluster is represented by a Cluster object with two attributes:

representative_id: The sequence chosen as the cluster representative
represented_ids: List of all sequences in the cluster (including the representative)

Sequences with high similarity (based on min_seq_id and coverage) will be grouped together.