Protein Embedding Analysis Pipeline¶
This notebook demonstrates how to fetch protein data, calculate embeddings, and visualize protein relationships using t-SNE.
This notebook demonstrates an end-to-end pipeline for protein embedding analysis. Here, we connect to a Neo4j database to fetch protein sequences from the NCBI protein database, compute high-dimensional embeddings using a pre-trained model, and then project these embeddings into two dimensions using t-SNE. The final visualization provides insights into the relationships among proteins based on their sequence properties.
import sys
import matplotlib.pyplot as plt
import pandas as pd
from loguru import logger
from pyeed import Pyeed
from pyeed.analysis.embedding_analysis import EmbeddingTool
logger.remove()
level = logger.add(sys.stderr, level="INFO")
/home/nab/anaconda3/envs/pyeed_niklas_env/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Database Connection and Setup¶
Now we'll connect to our Neo4j database and prepare it for new data by:
- Establishing a connection
- Wiping any existing data
- Removing old constraints
uri = "bolt://129.69.129.130:7687"
user = "neo4j"
password = "12345678"
eedb = Pyeed(uri, user=user, password=password)
eedb.db.wipe_database(date='2025-03-26')
📡 Connected to database. All data has been wiped from the database.
et = EmbeddingTool()
Fetching Protein Data¶
We'll now fetch protein sequences from the NCBI protein database using a list of protein IDs.
# these are example ids
df = pd.read_csv("resources/data_example.csv", delimiter=";")
print("The data has the following columns:")
print(df.columns)
# create a dict with protein_id_database as key and phenotype as value
dict_data = dict(zip(df["protein_id_database"], df["phenotype"]))
The data has the following columns: Index(['protein_name', 'phenotype', 'protein_id', 'protein_id_database'], dtype='object')
# now fecth all of the proteins from the database
eedb.fetch_from_primary_db(df["protein_id_database"].tolist(), db="ncbi_protein")
2025-03-26 11:37:31.838 | INFO | pyeed.main:fetch_from_primary_db:87 - Found 0 sequences in the database. 2025-03-26 11:37:31.839 | INFO | pyeed.main:fetch_from_primary_db:89 - Fetching 68 sequences from ncbi_protein. 2025-03-26 11:37:31.880 | INFO | pyeed.adapter.primary_db_adapter:execute_requests:140 - Starting requests for 7 batches. 2025-03-26 11:37:32.848 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAP20891.1 in database 2025-03-26 11:37:32.891 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAJ85677.1 in database 2025-03-26 11:37:32.937 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein SAQ02853.1 in database 2025-03-26 11:37:32.957 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CDR98216.1 in database 2025-03-26 11:37:33.001 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein WP_109963600.1 in database 2025-03-26 11:37:33.050 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAA41038.1 in database 2025-03-26 11:37:33.068 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein WP_109874025.1 in database 2025-03-26 11:37:33.087 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAA46344.1 in database 2025-03-26 11:37:33.107 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein APG33178.1 in database 2025-03-26 11:37:33.159 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AKC98298.1 in database 2025-03-26 11:37:33.212 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein KJO56189.1 in database 2025-03-26 11:37:33.238 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein KLP91446.1 in database 2025-03-26 11:37:33.263 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAA46346.1 in database 2025-03-26 11:37:33.287 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAA74912.2 in database 2025-03-26 11:37:33.311 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AFN21551.1 in database 2025-03-26 11:37:33.334 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein ACB22021.1 in database 2025-03-26 11:37:33.362 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAA76794.1 in database 2025-03-26 11:37:33.385 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAA76795.1 in database 2025-03-26 11:37:33.440 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CCG28759.1 in database 2025-03-26 11:37:33.464 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein KLG19745.1 in database 2025-03-26 11:37:33.980 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAC32891.1 in database 2025-03-26 11:37:34.008 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAA76796.1 in database 2025-03-26 11:37:34.032 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAD24670.1 in database 2025-03-26 11:37:34.055 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein ARF45649.1 in database 2025-03-26 11:37:34.079 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CTA52364.1 in database 2025-03-26 11:37:34.102 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein ADL13944.1 in database 2025-03-26 11:37:34.127 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AGQ50511.1 in database 2025-03-26 11:37:34.152 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AKA60778.1 in database 2025-03-26 11:37:34.177 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein APT65830.1 in database 2025-03-26 11:37:34.229 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein HAH6232254.1 in database 2025-03-26 11:37:34.263 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein QDO66746.1 in database 2025-03-26 11:37:34.288 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CBX53726.1 in database 2025-03-26 11:37:34.312 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAC32889.2 in database 2025-03-26 11:37:34.337 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAA64682.1 in database 2025-03-26 11:37:34.361 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAA71322.1 in database 2025-03-26 11:37:34.386 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAA71323.1 in database 2025-03-26 11:37:34.409 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAA71324.1 in database 2025-03-26 11:37:34.433 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AEC32455.1 in database 2025-03-26 11:37:34.456 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAD22538.1 in database 2025-03-26 11:37:34.479 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAD22539.1 in database 2025-03-26 11:37:34.997 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein ABB97007.1 in database 2025-03-26 11:37:35.021 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein ACJ43254.1 in database 2025-03-26 11:37:35.046 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAC05975.1 in database 2025-03-26 11:37:35.069 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein BCD58813.1 in database 2025-03-26 11:37:35.093 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAK17194.1 in database 2025-03-26 11:37:35.126 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAD33116.2 in database 2025-03-26 11:37:35.150 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAB92324.1 in database 2025-03-26 11:37:35.175 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAL03985.1 in database 2025-03-26 11:37:35.200 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAF19151.1 in database 2025-03-26 11:37:35.224 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAF05613.1 in database 2025-03-26 11:37:35.257 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAF05614.1 in database 2025-03-26 11:37:35.282 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAF05612.1 in database 2025-03-26 11:37:35.307 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAF05611.1 in database 2025-03-26 11:37:35.330 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAM15527.1 in database 2025-03-26 11:37:35.354 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAL29433.1 in database 2025-03-26 11:37:35.378 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAL29434.1 in database 2025-03-26 11:37:35.403 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAL29435.1 in database 2025-03-26 11:37:35.427 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAL29436.1 in database 2025-03-26 11:37:35.451 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAC43229.1 in database 2025-03-26 11:37:35.475 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAC43230.1 in database 2025-03-26 11:37:35.893 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAG44570.1 in database 2025-03-26 11:37:35.911 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAK14792.1 in database 2025-03-26 11:37:35.928 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAK30619.1 in database 2025-03-26 11:37:35.946 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein BAB16308.1 in database 2025-03-26 11:37:35.964 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein AAF66653.1 in database 2025-03-26 11:37:35.983 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAC85660.1 in database 2025-03-26 11:37:36.004 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAC85661.1 in database 2025-03-26 11:37:36.025 | INFO | pyeed.adapter.ncbi_protein_mapper:add_to_db:301 - Added/updated NCBI protein CAC67290.1 in database
Calculate Sequence Embeddings¶
Now we'll convert our protein sequences into numerical representations (embeddings) using a pre-trained model.
eedb.calculate_sequence_embeddings(model_name="facebook/esm2_t36_3B_UR50D")
/home/nab/anaconda3/envs/pyeed_niklas_env/lib/python3.10/site-packages/transformers/modeling_utils.py:3437: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead. warnings.warn( Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 3.69it/s] Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t36_3B_UR50D and are newly initialized: ['esm.pooler.dense.bias', 'esm.pooler.dense.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. /home/nab/anaconda3/envs/pyeed_niklas_env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1899: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead. warnings.warn( Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Visualization with t-SNE¶
Finally, we'll project our high-dimensional embeddings into 2D space using t-SNE and create a visualization. The resulting plot will show how proteins relate to each other in terms of their sequence properties.
# get the dimensions of the embeddings get one protein and then get the dimensions of the embedding
# any random protein will do
query = "MATCH (p:Protein) RETURN p.embedding LIMIT 1"
result = eedb.db.execute_read(query)
dimensions = np.array(result[0]['p.embedding']).shape
print(dimensions)
(2560,)
# get all of the data from embedding
embedding_tool = EmbeddingTool()
protein_ids, embeddings_2d, labels, colors = (
embedding_tool.calculate_2d_projection_tsne(
db=eedb.db,
ids_list=df["protein_id_database"].tolist(),
perplexity=1,
n_iter=3000,
ids_list_labels=dict_data,
)
)
print(len(protein_ids))
labels_already_plotted = []
# create a scatter plot for each unique label
for i in range(len(protein_ids)):
if labels[i] not in labels_already_plotted:
plt.scatter(
embeddings_2d[i, 0],
embeddings_2d[i, 1],
alpha=0.7,
s=50,
edgecolor="k",
color=colors[i],
label=labels[i],
)
labels_already_plotted.append(labels[i])
else:
plt.scatter(
embeddings_2d[i, 0],
embeddings_2d[i, 1],
alpha=0.7,
s=50,
edgecolor="k",
color=colors[i],
)
plt.title("2D t-SNE Visualization of Protein Embeddings")
plt.xlabel("t-SNE Dimension 1")
plt.ylabel("t-SNE Dimension 2")
plt.legend()
plt.tight_layout()
plt.show()
68
Protein Embedding Search Methods¶
There are two main approaches implemented for searching similar proteins using embeddings:
1. Simple Search Method¶
The simple method (find_closest_matches_simple
) loads all embeddings into memory to calculate distances. While straightforward, it's best suited for smaller datasets:
# this is the simple way to find the closest matches it loads all proteins into memory
# so this is not the best way to do this for large datasets
# on very small datasets it is still fine though, and works easily
results = et.find_closest_matches_simple(
start_sequence_id=df["protein_id_database"].tolist()[0],
db=eedb.db,
metric="cosine",
n=10,
)
print(f"Resulst for index {df['protein_id_database'].tolist()[0]} are:")
print(results)
Resulst for index AAP20891.1 are: [('AAP20891.1', 0.0), ('ADL13944.1', 1.2696941380951898e-05), ('AGQ50511.1', 2.3084859425925863e-05), ('CBX53726.1', 2.3443578533011156e-05), ('AAL29433.1', 3.0809776502382924e-05), ('CAA76796.1', 3.2400445545976986e-05), ('CAC67290.1', 4.856582147116928e-05), ('AFN21551.1', 4.953471590429803e-05), ('CAA74912.2', 5.021707417551813e-05), ('CTA52364.1', 6.113568903631794e-05)]
Advantages:¶
- Simple implementation
- Works well for small datasets
- Flexible distance metrics
Limitations:¶
- Memory intensive for large datasets
- Slower for large-scale searches
- Not scalable for production use with big databases
2. Vector Index Search¶
The vector index method uses Neo4j's native vector indexing capabilities (HNSW algorithm) for efficient similarity search. This is the recommended approach for larger datasets and production use.
Setting up the Vector Index:¶
# the more complex was involes creating a vector index of the embeddings
# the vector index math and implementation is provied by neo4j
# here we drop the vector index
et.drop_vector_index(
db=eedb.db,
index_name="vector_index_Protein_embedding",
)
# here we create the vector index
et.create_embedding_vector_index_neo4j(
db=eedb.db,
index_name="vector_index_Protein_embedding",
similarity_function="cosine",
dimensions=960,
m=16,
ef_construction=512,
)
Parameters Explained:¶
dimensions
: The size of your embedding vectors- Use 960 for ESM-C model
- Use 1280 for ESM-2 model
m
(default=16): Maximum number of connections per node in the HNSW graph- Higher values = better accuracy but slower indexing
- Lower values = faster indexing but potentially less accurate
- Recommended range: 5-100
ef_construction
(default=512): Controls index quality during construction- Higher values = better quality but slower indexing
- Lower values = faster indexing but potentially lower quality
- Recommended range: 100-1000
Searching with the Vector Index:¶
# here we use the vector index to find the closest matches
results = et.find_nearest_neighbors_based_on_vector_index(
db=eedb.db,
query_protein_id=df["protein_id_database"].tolist()[0],
index_name="vector_index_Protein_embedding",
number_of_neighbors=10,
)
print(results)
--------------------------------------------------------------------------- ClientError Traceback (most recent call last) Cell In[11], line 2 1 # here we use the vector index to find the closest matches ----> 2 results = et.find_nearest_neighbors_based_on_vector_index( 3 db=eedb.db, 4 query_protein_id=df["protein_id_database"].tolist()[0], 5 index_name="vector_index_Protein_embedding", 6 number_of_neighbors=10, 7 ) 9 print(results) File ~/Niklas/pyeed/src/pyeed/analysis/embedding_analysis.py:415, in EmbeddingTool.find_nearest_neighbors_based_on_vector_index(self, db, query_protein_id, index_name, number_of_neighbors) 406 logger.info(f"Index {index_name} is populated, finding nearest neighbors") 408 query_find_nearest_neighbors = f""" 409 MATCH (source:Protein {{accession_id: '{query_protein_id}'}}) 410 WITH source.embedding AS embedding (...) 413 RETURN fprotein.accession_id, score 414 """ --> 415 results = db.execute_read(query_find_nearest_neighbors) 416 neighbors: list[tuple[str, float]] = [ 417 (str(record["fprotein.accession_id"]), float(record["score"])) 418 for record in results 419 ] 420 return neighbors File ~/Niklas/pyeed/src/pyeed/dbconnect.py:45, in DatabaseConnector.execute_read(self, query, parameters) 34 """ 35 Executes a read (MATCH) query using the Neo4j driver. 36 (...) 42 list[dict]: The result of the query as a list of dictionaries. 43 """ 44 with self.driver.session() as session: ---> 45 return session.execute_read(self._run_query, query, parameters) File ~/anaconda3/envs/pyeed_niklas_env/lib/python3.10/site-packages/neo4j/_sync/work/session.py:661, in Session.execute_read(self, transaction_function, *args, **kwargs) 592 @NonConcurrentMethodChecker.non_concurrent_method 593 def execute_read( 594 self, (...) 598 *args: _P.args, **kwargs: _P.kwargs 599 ) -> _R: 600 """Execute a unit of work in a managed read transaction. 601 602 .. note:: (...) 659 .. versionadded:: 5.0 660 """ --> 661 return self._run_transaction( 662 READ_ACCESS, TelemetryAPI.TX_FUNC, 663 transaction_function, args, kwargs 664 ) File ~/anaconda3/envs/pyeed_niklas_env/lib/python3.10/site-packages/neo4j/_sync/work/session.py:552, in Session._run_transaction(self, access_mode, api, transaction_function, args, kwargs) 550 tx = self._transaction 551 try: --> 552 result = transaction_function(tx, *args, **kwargs) 553 except asyncio.CancelledError: 554 # if cancellation callback has not been called yet: 555 if self._transaction is not None: File ~/Niklas/pyeed/src/pyeed/dbconnect.py:222, in DatabaseConnector._run_query(tx, query, parameters) 220 """Executes a Cypher query in the provided transaction.""" 221 result = tx.run(query, parameters) --> 222 return [record.data() for record in result] File ~/Niklas/pyeed/src/pyeed/dbconnect.py:222, in <listcomp>(.0) 220 """Executes a Cypher query in the provided transaction.""" 221 result = tx.run(query, parameters) --> 222 return [record.data() for record in result] File ~/anaconda3/envs/pyeed_niklas_env/lib/python3.10/site-packages/neo4j/_sync/work/result.py:270, in Result.__iter__(self) 268 yield self._record_buffer.popleft() 269 elif self._streaming: --> 270 self._connection.fetch_message() 271 elif self._discarding: 272 self._discard() File ~/anaconda3/envs/pyeed_niklas_env/lib/python3.10/site-packages/neo4j/_sync/io/_common.py:178, in ConnectionErrorHandler.__getattr__.<locals>.outer.<locals>.inner(*args, **kwargs) 176 def inner(*args, **kwargs): 177 try: --> 178 func(*args, **kwargs) 179 except (Neo4jError, ServiceUnavailable, SessionExpired) as exc: 180 assert not asyncio.iscoroutinefunction(self.__on_error) File ~/anaconda3/envs/pyeed_niklas_env/lib/python3.10/site-packages/neo4j/_sync/io/_bolt.py:850, in Bolt.fetch_message(self) 846 # Receive exactly one message 847 tag, fields = self.inbox.pop( 848 hydration_hooks=self.responses[0].hydration_hooks 849 ) --> 850 res = self._process_message(tag, fields) 851 self.idle_since = monotonic() 852 return res File ~/anaconda3/envs/pyeed_niklas_env/lib/python3.10/site-packages/neo4j/_sync/io/_bolt5.py:369, in Bolt5x0._process_message(self, tag, fields) 367 self._server_state_manager.state = self.bolt_states.FAILED 368 try: --> 369 response.on_failure(summary_metadata or {}) 370 except (ServiceUnavailable, DatabaseUnavailable): 371 if self.pool: File ~/anaconda3/envs/pyeed_niklas_env/lib/python3.10/site-packages/neo4j/_sync/io/_common.py:245, in Response.on_failure(self, metadata) 243 handler = self.handlers.get("on_summary") 244 Util.callback(handler) --> 245 raise Neo4jError.hydrate(**metadata) ClientError: {code: Neo.ClientError.Procedure.ProcedureCallFailed} {message: Failed to invoke procedure `db.index.vector.queryNodes`: Caused by: java.lang.IllegalArgumentException: Index query vector has 2560 dimensions, but indexed vectors have 960.}
Advantages of Vector Index:¶
- Highly efficient for large datasets
- Optimized for high-dimensional data
- Scales well with database size
- Maintains good performance with growing data
Important Notes:¶
The vector index needs time to populate after creation. The code automatically waits and shows a progress bar during population.
The similarity scores returned by the vector index search are:
- For cosine similarity: Range [-1 to 1], where 1 is most similar
- For euclidean similarity: Lower values indicate more similarity
The index is persistent in the database and only needs to be created once, unless you want to change its parameters.
Memory usage is optimized as the index handles the heavy lifting instead of loading all embeddings into memory.
For most production use cases, the vector index method is recommended as it provides the best balance of performance and scalability.