Mutation Analysis¶
Mutations between sequences can be comprehensively analyzed.
In [1]:
Copied!
import sys
from loguru import logger
from pyeed import Pyeed
from pyeed.analysis.mutation_detection import MutationDetection
from pyeed.analysis.standard_numbering import StandardNumberingTool
logger.remove()
level = logger.add(sys.stderr, level="WARNING")
import sys
from loguru import logger
from pyeed import Pyeed
from pyeed.analysis.mutation_detection import MutationDetection
from pyeed.analysis.standard_numbering import StandardNumberingTool
logger.remove()
level = logger.add(sys.stderr, level="WARNING")
Pyeed
: Main class for interacting with the PyEED databaseMutationDetection
: Class for identifying differences between protein sequencesStandardNumberingTool
: Ensures consistent position numbering across different protein sequences
In [2]:
Copied!
uri = "bolt://129.69.129.130:7687"
user = "neo4j"
password = "12345678"
eedb = Pyeed(uri, user=user, password=password)
eedb.db.wipe_database(date="2025-03-14")
uri = "bolt://129.69.129.130:7687"
user = "neo4j"
password = "12345678"
eedb = Pyeed(uri, user=user, password=password)
eedb.db.wipe_database(date="2025-03-14")
📡 Connected to database. All data has been wiped from the database.
- Establishes connection parameters to a local Neo4j database
- Creates a PyEED instance with these credentials
- Wipes existing database data (with date "2025-01-19")
- Removes all database constraints for a fresh start
This ensures we're working with a clean database state.
Sequence Retrieval¶
In [3]:
Copied!
ids = ["AAM15527.1", "AAF05614.1", "AFN21551.1", "CAA76794.1", "AGQ50511.1"]
eedb.fetch_from_primary_db(ids, db="ncbi_protein")
eedb.fetch_dna_entries_for_proteins()
ids = ["AAM15527.1", "AAF05614.1", "AFN21551.1", "CAA76794.1", "AGQ50511.1"]
eedb.fetch_from_primary_db(ids, db="ncbi_protein")
eedb.fetch_dna_entries_for_proteins()
- Defines two protein sequence IDs to analyze
- Fetches these sequences from NCBI's protein database
- All sequences are beta-lactamase proteins
- The sequences are automatically parsed and stored in the Neo4j database
- Additional metadata like organism information and CDS (Coding Sequence) details are also stored
Apply Standard Numbering¶
In [4]:
Copied!
sn_protein = StandardNumberingTool(name="test_standard_numbering_protein")
sn_protein.apply_standard_numbering(
base_sequence_id="AAM15527.1", db=eedb.db, list_of_seq_ids=ids
)
sn_dna = StandardNumberingTool(name="test_standard_numbering_dna")
sn_dna.apply_standard_numbering(
base_sequence_id="AF190695.1", db=eedb.db, node_type="DNA"
)
sn_protein = StandardNumberingTool(name="test_standard_numbering_protein")
sn_protein.apply_standard_numbering(
base_sequence_id="AAM15527.1", db=eedb.db, list_of_seq_ids=ids
)
sn_dna = StandardNumberingTool(name="test_standard_numbering_dna")
sn_dna.apply_standard_numbering(
base_sequence_id="AF190695.1", db=eedb.db, node_type="DNA"
)
- Creates a new StandardNumberingTool instance named "test_standard_numbering"
- Uses KJO56189.1 as the reference sequence for numbering
- Performs multiple sequence alignment (MSA) using CLUSTAL
- The alignment output shows:
- Asterisks (*) indicate identical residues
- Colons (:) indicate conserved substitutions
- Periods (.) indicate semi-conserved substitutions
- This step is crucial for ensuring mutations are correctly identified relative to consistent positions
Mutation Detection¶
In [5]:
Copied!
md = MutationDetection()
seq1 = "AAM15527.1"
seq2 = "AAF05614.1"
name_of_standard_numbering_tool = "test_standard_numbering_protein"
mutations_protein = md.get_mutations_between_sequences(
seq1, seq2, eedb.db, name_of_standard_numbering_tool
)
md = MutationDetection()
seq1 = "AAM15527.1"
seq2 = "AAF05614.1"
name_of_standard_numbering_tool = "test_standard_numbering_protein"
mutations_protein = md.get_mutations_between_sequences(
seq1, seq2, eedb.db, name_of_standard_numbering_tool
)
In [6]:
Copied!
md = MutationDetection()
seq1 = "AF190695.1"
seq2 = "JX042489.1"
name_of_standard_numbering_tool = "test_standard_numbering_dna"
mutations_dna = md.get_mutations_between_sequences(
seq1, seq2, eedb.db, name_of_standard_numbering_tool, node_type="DNA"
)
md = MutationDetection()
seq1 = "AF190695.1"
seq2 = "JX042489.1"
name_of_standard_numbering_tool = "test_standard_numbering_dna"
mutations_dna = md.get_mutations_between_sequences(
seq1, seq2, eedb.db, name_of_standard_numbering_tool, node_type="DNA"
)
- Creates a MutationDetection instance
- Compares the two sequences using the standard numbering scheme
- Identifies all positions where amino acids differ
- Automatically saves the mutations to the database
- Returns a dictionary containing mutation information
Results¶
In [7]:
Copied!
print(mutations_protein)
print(mutations_protein)
{'from_positions': [241, 272, 125], 'to_positions': [241, 272, 125], 'from_monomers': ['R', 'D', 'V'], 'to_monomers': ['S', 'N', 'I']}
Outputs a detailed mutation map showing:
from_positions
: [102, 162, 236] - Where mutations occur in the sequenceto_positions
: [102, 162, 236] - Corresponding positions in the second sequencefrom_monomers
: ['E', 'S', 'G'] - Original amino acidsto_monomers
: ['K', 'R', 'S'] - Mutated amino acids
This means we found three mutations:
- Position 102: Glutamic acid (E) → Lysine (K)
- Position 162: Serine (S) → Arginine (R)
- Position 236: Glycine (G) → Serine (S)
In [8]:
Copied!
for i in range(len(mutations_dna['from_positions'])):
print(f"Mutation on position {mutations_dna['from_positions'][i]} -> {mutations_dna['to_positions'][i]} with a nucleotide change of {mutations_dna['from_monomers'][i]} -> {mutations_dna['to_monomers'][i]}")
for i in range(len(mutations_dna['from_positions'])):
print(f"Mutation on position {mutations_dna['from_positions'][i]} -> {mutations_dna['to_positions'][i]} with a nucleotide change of {mutations_dna['from_monomers'][i]} -> {mutations_dna['to_monomers'][i]}")
Mutation on position 682 -> 615 with a nucleotide change of T -> C Mutation on position 407 -> 340 with a nucleotide change of C -> A Mutation on position 92 -> 25 with a nucleotide change of C -> A Mutation on position 162 -> 95 with a nucleotide change of G -> T Mutation on position 929 -> 862 with a nucleotide change of A -> C Mutation on position 346 -> 279 with a nucleotide change of A -> G Mutation on position 87 -> 20 with a nucleotide change of C -> A Mutation on position 88 -> 21 with a nucleotide change of T -> C Mutation on position 130 -> 63 with a nucleotide change of C -> T Mutation on position 175 -> 108 with a nucleotide change of G -> A Mutation on position 131 -> 64 with a nucleotide change of T -> C Mutation on position 132 -> 65 with a nucleotide change of A -> T Mutation on position 914 -> 847 with a nucleotide change of G -> A Mutation on position 604 -> 537 with a nucleotide change of T -> G Mutation on position 925 -> 858 with a nucleotide change of G -> A Mutation on position 226 -> 159 with a nucleotide change of T -> C
In [ ]:
Copied!