Evaluating Local Embedding Models for Buffaly Semantic Retrieval

Matt Furnari
Matt Furnari, CTO
6/9/2026

Buffaly uses semantic retrieval to turn natural-language directives into concrete actions and entities. That sounds simple until you look at what the system is actually searching over: action phrases, entity labels, prototype metadata, tool descriptions, and the history of what agents selected during real turns.

We wanted to know whether a local embedding model running on an Apple Silicon Mac Mini could replace OpenAI embeddings for that short action/entity retrieval workload. The answer, after several rounds of testing, is that mxbai-embed-large is a credible local candidate. In the strongest benchmark we ran, it placed 293 of 300 selected action targets in the top 5% of the candidate corpus, and all 72 filtered selected entity targets in the top 20%.

The more interesting part of the work was not getting a local model to return vectors. That was relatively straightforward. The harder problem was designing a benchmark that made sense for a live agent system whose ontology and tool catalog are constantly changing.

This post walks through the evaluation, the false starts, the final selected-label benchmark, and the database design that lets Buffaly compare multiple embedding models side by side.

Executive summary

Buffaly evaluated whether local embedding models running on an Apple Silicon Mac Mini could replace OpenAI embeddings for semantic action and entity retrieval. The evaluation used real Buffaly data: existing SemanticDB fragments, historical semantic-search queries, and selected action/entity outcomes from actual session turns.

The benchmark evolved through several stages. We started with runtime feasibility and latency checks, then looked at historical OpenAI result replay, current-corpus nearest-neighbor comparison, and query-fragment retrieval. Each stage taught us something useful, but each also exposed a limitation. The biggest limitation was corpus drift: Buffaly's action and entity catalog changes over time, so historical ranks are not stable ground truth.

The final benchmark used selected-label similarity. For each real directive, we measured how strongly each model associated that directive with the action or entity Buffaly actually selected. This avoided requiring local models to mimic OpenAI's historical ranked lists and focused instead on the relationship Buffaly needs to preserve.

The best local candidate was mxbai-embed-large, served through llama.cpp with an OpenAI-compatible embeddings API. nomic-embed-text was also viable, but had weaker percentile concentration and several below-median action cases.

The conclusion is that mxbai-embed-large is a reasonable local drop-in candidate for Buffaly's short action/entity semantic retrieval workload, provided vectors are stored and searched under the correct EmbeddingID. The work also produced a repeatable framework: new embedding models can be added to the same evaluation database, embedded against the same selected-label corpus, and compared with the same metrics.

Evidence base

This article is grounded in a full scan of the Embedding Models session timeline plus the evaluation SQL database. The scan produced several large evidence exports, which were retained outside the article draft so the published post can cite the evidence base without exposing internal file paths.

Publication note: Internal evidence exports, task notes, and scratch files were used to verify the counts and run identifiers below. They are intentionally not linked from the public article because they contain environment-specific paths.
Topic Matched turns First message ID Last message ID
Problem 173 15925062 15955820
Setup 338 15925062 15955825
Methodology 349 15925134 15955825
Results 379 15925062 15955825
Architecture 455 15925140 15955825

The problem

Semantic retrieval is part of Buffaly's routing layer. When a user or agent asks for something in ordinary language, Buffaly has to resolve that request to a concrete tool, action, entity, prototype, or workflow. The search space includes infinitive-style action phrases, entity names and descriptions, prototype metadata, and session traces showing what was selected in previous turns.

The question was whether local embedding models could replace OpenAI for this workload without losing retrieval quality. The practical deployment target was an Apple Silicon Mac Mini serving embeddings over an OpenAI-compatible HTTP API that Buffaly could call from another host.

A simple benchmark would have been misleading because Buffaly is constantly evolving. Tools are added. Action phrases change. Entity descriptions improve. The ontology grows. A historical OpenAI result list is therefore not timeless ground truth. A selected action that was rank 2 last month might reasonably be rank 8 today because six similar actions were added after the original turn.

That meant the evaluation had to do more than compare today's local result list with yesterday's OpenAI result list. It had to preserve the real directive-to-selection relationship while accounting for corpus drift.

Goals

The evaluation had six goals:

  1. Find local embedding candidates that are practical on Apple Silicon.
  2. Serve candidate models through a simple local API.
  3. Benchmark candidates against OpenAI using real Buffaly data.
  4. Account for corpus drift in the action and entity catalog.
  5. Store multiple embedding identities side by side without mixing vector spaces.
  6. Build a repeatable framework for evaluating future models.

How the methodology evolved

The final benchmark was not obvious at the start. We worked through several approaches, and each one helped narrow the problem.

1. Feasibility research and local runtime smoke test

The first question was operational: could a local embedding model run reliably in the target environment at all?

The candidate runtime options were Apple Silicon-friendly tools such as Ollama, llama.cpp, MLX, and similar local runtimes. The model shortlist included mxbai-embed-large, nomic-embed-text, bge-m3, E5-family models, Jina embeddings, Snowflake Arctic embeddings, MiniLM-family models, and other local candidates that looked plausible for semantic retrieval.

The first model installed and served was mxbai-embed-large. Ollama was useful for pulling and storing model artifacts, while llama.cpp became the reliable OpenAI-compatible serving runtime. Buffaly reached the endpoint through a configured network route. The smoke tests used short Buffaly-style strings and checked response shape and vector dimensionality rather than retrieval quality.

The result was encouraging: mxbai-embed-large could be served through llama.cpp with an OpenAI-compatible /v1/embeddings endpoint, and Buffaly-side tests received valid 1024-dimensional vectors.

The smoke test also made the storage issue concrete. OpenAI text-embedding-3-small produces 1536-dimensional vectors, while mxbai produces 1024-dimensional vectors. That did not rule out mxbai, but it meant the evaluation needed a controlled storage strategy and strict model identity filtering.

2. Latency and endpoint comparison

Once the endpoint worked, the next question was whether it was fast enough to matter. Semantic action and entity lookup can happen in the middle of an agent turn, and reindexing may require many embedding calls.

We sent small embedding requests to the local mxbai endpoint and compared response shape and rough responsiveness against the OpenAI path already used by Buffaly. The inputs were short semantic strings like action phrases and entity labels, not long documents, because the target workload was short action/entity retrieval.

The local endpoint returned valid vectors and was responsive enough to justify deeper evaluation. This still did not prove semantic quality, but it removed the basic operational objection: local embeddings could be served and called from Buffaly in a practical way.

3. Historical semantic-search replay

The obvious next move was to use Buffaly's historical semantic search calls. The sessions database contains real calls to ToSearchCandidateActions and ToSearchCandidateEntities, often with semantic query strings and OpenAI-derived candidates or scores.

At first, this looked like the cleanest benchmark: take historical search queries, run them through the local model, and compare the local ranked list with the saved OpenAI ranked list.

The problem was corpus drift. Historical OpenAI ranks were produced against whatever action/entity corpus existed at that time. Today's corpus is different. If a local model finds a semantically better candidate in the current corpus, pure replay can still mark it as wrong because it did not reproduce an old OpenAI list.

This approach confirmed that historical semantic-search data was available and useful, but it was not a reliable final metric. It risked measuring OpenAI mimicry instead of task success.

4. Current-corpus neighborhood comparison

To avoid stale target sets, we moved to the current corpus. The idea was to embed the same queries and compare nearest-neighbor neighborhoods under OpenAI and local models against today's action and entity targets.

This made better use of the data already in SemanticDB. We found about 33,331 untagged vectorized fragments, including about 17,996 query-like fragments starting with to . The action target corpus included roughly 1,712 vectorized ProtoScript Action fragments, and entity targets had smaller but still useful coverage.

This stage produced an important piece of infrastructure: a separate evaluation database named buffaly_semanticdb_embedding_eval. Existing OpenAI vectors were copied there, and local vectors could be added under separate embedding identities. Smaller local vectors were right-zero-padded to the existing 1536-dimensional native vector shape for evaluation, while strict EmbeddingID filtering kept vector spaces separate.

Current-corpus comparison solved the stale target problem, but it still did not directly answer the question we cared about. Neighborhood preservation can show whether a local model's vector space resembles OpenAI's vector space. It cannot prove the local model retrieves the action or entity Buffaly actually needed.

5. Saved query fragments against current targets

After finding thousands of query-like fragments, we considered using those saved to ... fragments as the benchmark query set. The target side would be the current ProtoScript Action and Entity corpora. Each model would embed the same query fragments and search the same current target set.

This was closer to Buffaly's real query shape, but it still lacked ground truth. A query fragment tells us what someone searched for. It does not tell us which action or entity was actually selected in the turn.

Without the selected target, the benchmark could compare neighborhoods or OpenAI similarity behavior, but it could not score whether the local model preserved the real directive-to-selection relationship.

6. Selected-label extraction from same-turn outcomes

The breakthrough was to treat actual Buffaly behavior as the label.

Instead of starting from every semantic search and trying to infer what happened afterward, we started from selected actions and entities and looked backward within the same TurnKey for the semantic search terms and candidate lists that led to them. This made the unit of evaluation a real directive paired with the target Buffaly actually selected.

A new Buffaly.Embeddings C# extractor was created for this work. The extractor produced examples containing the directive or search text, the selected target phrase or entity, selected prototype/tool identity, OpenAI candidate information where available, and later local-model comparison data.

The schema was hardened after a possible conflation risk was noticed. The benchmark explicitly stored DirectiveText and query-fragment identity separately from SelectedTargetPhrase and target-fragment identity. That made it auditable that we were comparing query-vector to selected-target-vector, not accidentally comparing a target phrase to itself.

Manual extraction of 10 examples showed that the approach was feasible. A 100-action/100-entity smoke benchmark then gave us a restartable comparison set. Entity extraction was harder than action extraction because entities appear in more varied argument positions, and not every selected entity has a clean same-turn semantic-search trail. Low-confidence labels were filtered out.

This selected-label corpus gave us the right evaluation units, but rank was still brittle because candidate set size and corpus state can change. That led to the final metric.

7. Similarity-only selected-label benchmark

The final benchmark measured the relationship Buffaly needs to preserve: the semantic closeness between a directive and the action or entity actually selected.

For each fixed pair, we compared embedding(directive) with embedding(selected target) in the candidate model's own vector space. Then we placed that selected-target similarity in context against the current target corpus.

This avoided stale historical ranks and avoided requiring local models to reproduce OpenAI's exact ranked lists. It also worked with corpus drift: the selected pair stayed fixed, while the selected target's percentile showed whether it remained strongly associated with the directive under each model.

The large filtered selected-label benchmark used 300 deduplicated action examples and 72 filtered entity examples. OpenAI remained the baseline. mxbai and nomic vectors were generated for the same evaluation fragments in buffaly_semanticdb_embedding_eval, under separate embedding identities. Searches and metrics always remained within a single embedding space.

The primary mxbai run was mxbai-vs-openai-selected-label-large-001. The nomic comparison run was nomic-vs-openai-selected-label-large-001. Both used the same selected-label corpus, making the comparison stable and repeatable.

Evaluation database design

The evaluation database, buffaly_semanticdb_embedding_eval, was created so live SemanticDB storage would not be mutated. It reused the SemanticDB-style fragment/tag/vector layout and added benchmark tables for runs, selected labels, candidates, and similarity metrics.

The database is built around embedding identities. A fragment can have multiple vectors at the same time: one for OpenAI, one for mxbai, one for nomic, and so on. The fragment text does not have to be duplicated, and the existing OpenAI vector does not have to be overwritten.

Conceptually:

`r`n

Fragment: "to upload a local file to google drive"
  -> EmbeddingID 2: OpenAI text-embedding-3-small vector
  -> EmbeddingID 3: mxbai-embed-large vector
  -> EmbeddingID 4: nomic-embed-text vector

This only works if vector searches filter by EmbeddingID. Vectors from different models are different spaces and cannot be compared directly. For evaluation, smaller local vectors were right-zero-padded to 1536 stored dimensions so they could live in the same native vector table, while EmbeddingID preserved vector-space separation.

Embedding identities and evaluation runs

EmbeddingID Model
1 OpenAI Ada 2
2 text-embedding-3-small
3 mxbai-embed-large
4 nomic-embed-text
EmbeddingID Vector rows
2 2,208
3 3,138
4 369
EvalRunID Run name Baseline EmbeddingID Candidate EmbeddingID Status
2 mxbai-vs-openai-action-pilot-001 2 3 mxbai_vectors_ready
3 mxbai-vs-openai-selected-label-smoke-001 2 3 selected_label_smoke
4 mxbai-vs-openai-selected-label-large-001 2 3 selected_label_smoke
5 mxbai-vs-openai-selected-label-large-unfiltered-001 2 3 selected_label_smoke
6 nomic-vs-openai-selected-label-large-001 2 4 selected_label_smoke

bge-m3 was considered but not completed because the model pull timed out.

Results

EvalRunID Candidate model Label kind Rows Avg candidate percentile Min candidate percentile Top 5% Top 20% Below median
3 mxbai-embed-large action 100 0.9924 0.9334 97 100 0
3 mxbai-embed-large entity 99 0.9619 0.3424 78 97 2
4 mxbai-embed-large action 300 0.9941 0.8528 293 300 0
4 mxbai-embed-large entity 72 0.9818 0.8899 63 72 0
5 mxbai-embed-large action 300 0.9941 0.8528 293 300 0
5 mxbai-embed-large entity 83 0.9594 0.3424 66 81 2
6 nomic-embed-text action 300 0.9599 0.1205 245 282 3
6 nomic-embed-text entity 72 0.9565 0.6667 52 70 0

mxbai-embed-large was the best evaluated local candidate. In EvalRunID 4, it placed 293 of 300 action labels in the top 5% of the action candidate corpus. All 72 filtered entity labels landed in the top 20%, with no below-median entity cases.

nomic-embed-text was also viable, but weaker by percentile concentration. It placed 245 of 300 action labels in the top 5%, 282 in the top 20%, and had three below-median action cases. Its filtered entity run was reasonable, with 52 of 72 entity labels in the top 5% and 70 of 72 in the top 20%.

The unfiltered mxbai run is useful context: action results were unchanged from the filtered run, while the larger entity set included 83 rows, 66 in the top 5%, 81 in the top 20%, and 2 below median. That reinforced the decision to filter lower-confidence entity labels for the headline comparison.

Conclusions

Six conclusions came out of the work:

  1. The strongest benchmark was the final one: similarity-only selected-label evaluation.
  2. mxbai-embed-large is a reasonable local drop-in candidate for Buffaly's short action/entity semantic retrieval workload.
  3. nomic-embed-text is viable but weaker by the final percentile metric.
  4. Historical rank replay is not sufficient because Buffaly's corpus evolves over time.
  5. SemanticDB's multi-embedding design allows OpenAI and local vectors to live side by side under different EmbeddingID values.
  6. The evaluation database provides a repeatable framework for testing future embedding models against the same real-world corpus.

Recommendation

Use mxbai-embed-large as the first local embedding provider candidate for Buffaly action/entity semantic retrieval, served by llama.cpp over an OpenAI-compatible API.

Keep OpenAI as the production-safe default until provider selection, reindexing, and vector coverage are first-class in Buffaly. Action/entity SemanticDB and session semantic DB should also be treated separately. This evaluation covered short action/entity phrases. Long session and log text is a different workload and needs separate validation, chunking, and input guards before moving to a local provider.

Reproducibility appendix

The reproducibility package for this evaluation consists of the full session timeline, topic-scan exports, the evaluation SQL database, the durable task record, scratch notes, plan history, and the architecture design notes used during the benchmark work.

The public article does not publish local filesystem paths for those artifacts. The important reproducibility anchors are the run names, EvalRunID values, EmbeddingID values, vector-row counts, selected-label corpus sizes, and result metrics preserved in the tables above.

Final conclusion

The selected-label similarity benchmark was the strongest approach because it measured the exact relationship Buffaly needs: a real directive and the action or entity actually selected. Under that methodology, mxbai-embed-large is a credible local replacement candidate for the action/entity SemanticDB path, and the evaluation database provides a repeatable framework for testing future models against the same real-world corpus.