Protein Embedding Analysis Pipeline¶
This notebook demonstrates how to fetch protein data, calculate embeddings, and visualize protein relationships using t-SNE.
This notebook demonstrates an end-to-end pipeline for protein embedding analysis. Here, we connect to a Neo4j database to fetch protein sequences from the NCBI protein database, compute high-dimensional embeddings using a pre-trained model, and then project these embeddings into two dimensions using t-SNE. The final visualization provides insights into the relationships among proteins based on their sequence properties.
import sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from loguru import logger
from pyeed import Pyeed
from pyeed.analysis.embedding_analysis import EmbeddingTool
logger.remove()
level = logger.add(sys.stderr, level="ERROR")
2025-05-29 12:00:51.520 | INFO | pyeed.embeddings.processor:_initialize_devices:44 - Initialized 3 GPU device(s): [device(type='cuda', index=0), device(type='cuda', index=1), device(type='cuda', index=2)]
Database Connection and Setup¶
Now we'll connect to our Neo4j database and prepare it for new data by:
- Establishing a connection
- Wiping any existing data
- Removing old constraints
uri = "bolt://129.69.129.130:7688"
user = "neo4j"
password = "12345678"
eedb = Pyeed(uri, user=user, password=password)
eedb.db.wipe_database(date='2025-05-29')
Pyeed Graph Object Mapping constraints not defined. Use _install_labels() to set up model constraints. 📡 Connected to database. All data has been wiped from the database.
et = EmbeddingTool()
Fetching Protein Data¶
We'll now fetch protein sequences from the NCBI protein database using a list of protein IDs.
# these are example ids
df = pd.read_csv("resources/data_example.csv", delimiter=";")
print("The data has the following columns:")
print(df.columns)
# create a dict with protein_id_database as key and phenotype as value
dict_data = dict(zip(df["protein_id_database"], df["phenotype"]))
The data has the following columns: Index(['protein_name', 'phenotype', 'protein_id', 'protein_id_database'], dtype='object')
# now fecth all of the proteins from the database
eedb.fetch_from_primary_db(df["protein_id_database"].tolist(), db="ncbi_protein")
Calculate Sequence Embeddings¶
Now we'll convert our protein sequences into numerical representations (embeddings) using a pre-trained model.
eedb.calculate_sequence_embeddings(model_name="esmc_300m", embedding_type="final_embeddings", num_gpus=3)
Fetching 4 files: 0%| | 0/4 [00:00<?, ?it/s]
Fetching 4 files: 0%| | 0/4 [00:00<?, ?it/s]
Fetching 4 files: 0%| | 0/4 [00:00<?, ?it/s]
Visualization with t-SNE¶
Finally, we'll project our high-dimensional embeddings into 2D space using t-SNE and create a visualization. The resulting plot will show how proteins relate to each other in terms of their sequence properties.
# get the dimensions of the embeddings get one protein and then get the dimensions of the embedding
# any random protein will do
query = "MATCH (p:Protein) RETURN p.embedding LIMIT 20"
result = eedb.db.execute_read(query)
dimensions = np.array(result[10]['p.embedding']).shape
print(dimensions)
(960,)
# get all of the data from embedding
embedding_tool = EmbeddingTool()
protein_ids, embeddings_2d, labels, colors = (
embedding_tool.calculate_2d_projection_tsne(
db=eedb.db,
ids_list=df["protein_id_database"].tolist(),
perplexity=1,
n_iter=3000,
ids_list_labels=dict_data,
)
)
print(len(protein_ids))
labels_already_plotted = []
# create a scatter plot for each unique label
for i in range(len(protein_ids)):
if labels[i] not in labels_already_plotted:
plt.scatter(
embeddings_2d[i, 0],
embeddings_2d[i, 1],
alpha=0.7,
s=50,
edgecolor="k",
color=colors[i],
label=labels[i],
)
labels_already_plotted.append(labels[i])
else:
plt.scatter(
embeddings_2d[i, 0],
embeddings_2d[i, 1],
alpha=0.7,
s=50,
edgecolor="k",
color=colors[i],
)
plt.title("2D t-SNE Visualization of Protein Embeddings")
plt.xlabel("t-SNE Dimension 1")
plt.ylabel("t-SNE Dimension 2")
plt.legend()
plt.tight_layout()
plt.show()
68
Protein Embedding Search Methods¶
There are two main approaches implemented for searching similar proteins using embeddings:
1. Simple Search Method¶
The simple method (find_closest_matches_simple
) loads all embeddings into memory to calculate distances. While straightforward, it's best suited for smaller datasets:
# this is the simple way to find the closest matches it loads all proteins into memory
# so this is not the best way to do this for large datasets
# on very small datasets it is still fine though, and works easily
results = et.find_closest_matches_simple(
start_sequence_id=df["protein_id_database"].tolist()[0],
db=eedb.db,
metric="cosine",
n=10,
)
print(f"Resulst for index {df['protein_id_database'].tolist()[0]} are:")
print(results)
Resulst for index AAP20891.1 are: [('AAP20891.1', 0.0), ('AGQ50511.1', 0.00016200621801287785), ('ABB97007.1', 0.0001810048295400879), ('AFN21551.1', 0.00018909362988450695), ('CAC67290.1', 0.00021654775310264718), ('ADL13944.1', 0.0002567003210336427), ('AAK30619.1', 0.0002616398020808264), ('AAL29433.1', 0.0002646931927183793), ('ACJ43254.1', 0.0002669990760338914), ('ACB22021.1', 0.0002755243601859636)]
Advantages:¶
- Simple implementation
- Works well for small datasets
- Flexible distance metrics
Limitations:¶
- Memory intensive for large datasets
- Slower for large-scale searches
- Not scalable for production use with big databases
2. Vector Index Search¶
The vector index method uses Neo4j's native vector indexing capabilities (HNSW algorithm) for efficient similarity search. This is the recommended approach for larger datasets and production use.
Setting up the Vector Index:¶
# the more complex was involes creating a vector index of the embeddings
# the vector index math and implementation is provied by neo4j
# here we drop the vector index
et.drop_vector_index(
db=eedb.db,
index_name="vector_index_Protein_embedding",
)
# here we create the vector index
et.create_embedding_vector_index_neo4j(
db=eedb.db,
index_name="vector_index_Protein_embedding",
similarity_function="cosine",
dimensions=960,
m=16,
ef_construction=512,
)
Parameters Explained:¶
dimensions
: The size of your embedding vectors- Use 960 for ESM-C model
- Use 1280 for ESM-2 model
m
(default=16): Maximum number of connections per node in the HNSW graph- Higher values = better accuracy but slower indexing
- Lower values = faster indexing but potentially less accurate
- Recommended range: 5-100
ef_construction
(default=512): Controls index quality during construction- Higher values = better quality but slower indexing
- Lower values = faster indexing but potentially lower quality
- Recommended range: 100-1000
Searching with the Vector Index:¶
# here we use the vector index to find the closest matches
results = et.find_nearest_neighbors_based_on_vector_index(
db=eedb.db,
query_id=df["protein_id_database"].tolist()[0],
index_name="vector_index_Protein_embedding",
number_of_neighbors=10,
)
print(results)
Output()
[('AAP20891.1', 1.0), ('AGQ50511.1', 0.9999189376831055), ('ABB97007.1', 0.999909520149231), ('AFN21551.1', 0.9999054670333862), ('CAC67290.1', 0.9998918771743774), ('ADL13944.1', 0.9998717904090881), ('AAK30619.1', 0.9998692274093628), ('AAL29433.1', 0.9998676776885986), ('ACJ43254.1', 0.9998666048049927), ('CBX53726.1', 0.9998624920845032)]
Advantages of Vector Index:¶
- Highly efficient for large datasets
- Optimized for high-dimensional data
- Scales well with database size
- Maintains good performance with growing data
Important Notes:¶
The vector index needs time to populate after creation. The code automatically waits and shows a progress bar during population.
The similarity scores returned by the vector index search are:
- For cosine similarity: Range [-1 to 1], where 1 is most similar
- For euclidean similarity: Lower values indicate more similarity
The index is persistent in the database and only needs to be created once, unless you want to change its parameters.
Memory usage is optimized as the index handles the heavy lifting instead of loading all embeddings into memory.
For most production use cases, the vector index method is recommended as it provides the best balance of performance and scalability.