Intelligence Factory AI

Chunking Strategies for Retrieval-Augmented Generation (RAG): A Deep Dive into SemDB’s Approach

Matt Furnari, CTO

•

11/19/2024

•

Tech Overview

SemDB

RAG

Summary & Key Insights

Effective chunking is essential to Retrieval-Augmented Generation (RAG), but most traditional methods disrupt meaning and reduce retrieval quality. This post explores semantic chunking, contextual retrieval, and SemDB’s advanced approach—combining preprocessing, recursive chunking, and ontology-based enhancements. By resolving pronouns, clarifying dependencies, and integrating multiple strategies, SemDB improves retrieval precision and scalability, enabling language models to extract more accurate, context-rich responses from complex source material.

In the ever-evolving landscape of AI and natural language processing, Retrieval-Augmented Generation (RAG) has emerged as a cornerstone technology. RAG systems allow large language models (LLMs) to access vast knowledge bases by retrieving relevant snippets of information, or "chunks," to generate coherent and accurate responses. However, creating these chunks is not a trivial task. One of the most critical challenges in RAG is the chunking strategy itself—how we break down complex documents into meaningful, retrievable pieces.

‍

What is Chunking, and Why is it Necessary?

Chunking is the process of dividing large bodies of text into smaller, semantically coherent units. Effective chunking ensures that these segments are meaningful enough to provide context but concise enough to fit within the LLM's context window.For instance, traditional methods like character-based chunking or recursive splitting by separators often fail to preserve semantic meaning, leading to fragmented information. Semantic chunking, by contrast, leverages advanced transformer models to create embeddings of text, identifying natural breakpoints based on conceptual differences. This approach improves information retrieval, enabling tasks like summarization, contextual retrieval, and structured understanding of extensive texts.

‍

Traditional Chunking Methods

Traditional chunking methods aim to divide text into smaller segments for processing but often fall short in preserving the semantic integrity of the information. The two primary approaches in this category are character-based chunking and recursive chunking:

Character-Based Chunking: This approach splits text into fixed-length segments, typically measured by the number of characters or tokens. While it ensures predictable and uniform chunk sizes, it often disrupts sentences or ideas mid-way, leading to incomplete or nonsensical chunks. For example, a sentence might be split across two chunks, losing coherence and context.
Recursive Chunking: Recursive chunking uses natural separators like paragraphs, headings, or punctuation to create chunks. This approach produces more natural divisions compared to character-based methods. However, it doesn’t guarantee that each chunk is semantically coherent, as it relies purely on structural cues rather than the meaning of the content.

While these methods are straightforward to implement, they often result in fragmented or contextually incomplete segments, making them suboptimal for advanced workflows like Retrieval-Augmented Generation.

‍

Semantic Chunking: A Smarter Approach to Text Segmentation

Semantic chunking is a cutting-edge technique designed to segment text into meaningful, conceptually distinct groups. Unlike traditional methods, which often rely on arbitrary separators or fixed lengths, semantic chunking ensures that each chunk represents a coherent idea, making it an essential tool for workflows like Retrieval-Augmented Generation (RAG) and beyond.

‍

‍How Semantic Chunking Works

‍The process begins by breaking text into small initial chunks, often using recursive chunking methods as a foundation. These chunks are then embedded into high-dimensional vectors using transformer-based models, such as OpenAI’s text-embeddings-3-small or SentenceTransformers. The embeddings encode the semantic meaning of each chunk, enabling precise comparisons.

The next step involves calculating the cosine distances between embeddings of sequential chunks. Breakpoints are identified where the distances exceed a certain threshold, signaling significant semantic shifts. This approach ensures that the resulting chunks are both coherent within themselves and distinct from one another.

‍

‍Refinements: Semantic Double Chunk Merging

‍To enhance this process further, an extension known as semantic double chunk merging has been introduced. This technique performs a second pass to re-evaluate and refine the chunking boundaries. For example, if chunks 1 and 3 are semantically similar but separated by chunk 2 (e.g., a mathematical formula or code block), they can be regrouped into a single coherent unit. This additional step improves the accuracy and utility of the chunking process.Applications and

‍

Benefits

‍Semantic chunking proves invaluable in scenarios where understanding the underlying concepts of text is crucial:

Retrieval-Augmented Generation (RAG): By creating semantically coherent chunks, RAG systems can retrieve and interpret relevant information more effectively.
Text Summarization and Clustering: Large documents, such as books or research articles, can be grouped into clusters of related content, enabling faster insights.
Visual Exploration: Dimensionality reduction techniques like UMAP, combined with clustering and labeling via LLMs, allow users to visualize the structure and flow of a document, providing both development insights and practical tools for analysis.

‍

Challenges and Considerations

‍
‍Despite its advantages, semantic chunking presents challenges. Determining optimal cosine distance thresholds and understanding what each chunk represents are highly application-dependent tasks. Fine-tuning these parameters requires careful consideration of the specific use case and the nature of the text.Semantic chunking is a powerful advancement in text processing, offering a meaningful way to dissect and interpret large volumes of information. Its ability to group related concepts and isolate distinct ideas makes it a valuable tool in both research and practical applications.

‍

Contextual Retrieval: Enhancing Knowledge Access for AI Models

Contextual Retrieval is a technique designed to address this challenge by enhancing the context of each chunk before it is embedded and indexed. This method uses two key techniques: Contextual Embeddings and Contextual BM25.

Contextual Embeddings: Before creating embeddings for text chunks, explanatory context is added to each chunk. This context is specific to the chunk and situates it within the broader document, improving its relevance when retrieved. For example, a chunk stating "The company's revenue grew by 3%" might be augmented with the context "This chunk is from an SEC filing on ACME Corp's performance in Q2 2023."
Contextual BM25: BM25 is a ranking function that uses lexical matching to find exact term matches. By applying BM25 in conjunction with semantic embeddings, Contextual Retrieval ensures that both exact matches and semantic similarities are used to retrieve the most relevant chunks, improving the overall retrieval accuracy.

This dual approach significantly reduces the number of failed retrievals, with improvements of up to 49% in accuracy. When combined with reranking, retrieval success can be enhanced by 67%.Implementing Contextual RetrievalTo implement Contextual Retrieval, each chunk in a knowledge base is processed by adding context before embedding it. Claude, a powerful tool for this task, is used to automatically generate the contextual information. The process is simple and effective:

Contextual Embeddings: Before creating embeddings for text chunks, explanatory context is added to each chunk. This context is specific to the chunk and situates it within the broader document, improving its relevance when retrieved. For example, a chunk stating "The company's revenue grew by 3%" might be augmented with the context "This chunk is from an SEC filing on ACME Corp's performance in Q2 2023."
Contextual BM25: BM25 is a ranking function that uses lexical matching to find exact term matches. By applying BM25 in conjunction with semantic embeddings, Contextual Retrieval ensures that both exact matches and semantic similarities are used to retrieve the most relevant chunks, improving the overall retrieval accuracy.

‍

Why Contextual Retrieval Works

‍Contextual Retrieval addresses a significant flaw in traditional RAG systems by ensuring that each chunk is rich in context. This method ensures that the AI model has a better understanding of the surrounding information, leading to more accurate and reliable responses.As knowledge bases grow larger, Contextual Retrieval becomes even more critical, allowing AI systems to scale while maintaining retrieval accuracy. By combining the power of semantic embeddings with lexical matching through BM25, Contextual Retrieval provides a comprehensive solution for improving the performance of AI models in specialized domains.

‍

How SemDB Does It Better

SemDB goes beyond these traditional and emerging techniques by reimagining the chunking process from the ground up.

Preprocessing for Context Clarity: Unlike standard systems, SemDB preprocesses the text before chunking or embedding. Pronouns are replaced with explicit references, long-range dependencies are resolved, and sentences are rewritten for clarity. This ensures that each sentence captures its full context independently, leading to more accurate embeddings.
Recursive Chunking for Precision: Using recursive semantic chunking, SemDB can isolate highly specific sections without relying on comparisons between sentences. This approach enhances retrieval by ensuring that each chunk is both meaningful and distinct.
Combining Multiple Strategies: SemDB doesn’t rely solely on contextual chunking. Its robust pipeline includes:‍‍
Combining Multiple Strategies: SemDB doesn’t rely solely on contextual chunking. Its robust pipeline includes:‍‍
Context Chunking: For preserving local context.
Recursive Chunking: To create semantically coherent segments.
Ontology-Based Enhancements: Leveraging domain-specific ontologies to enrich understanding and retrieval.
Scalability for Large Documents: SemDB is adept at handling massive documents, such as 150+ page financial PDFs, by combining contextual embeddings with recursive chunking. This ensures that even granular details remain accessible while preserving overarching context.

‍

Conclusion

Chunking is the unsung hero of Retrieval-Augmented Generation, enabling LLMs to process vast amounts of text effectively. While traditional and contextual chunking methods have improved retrieval accuracy, SemDB’s innovative approach redefines the process. By combining advanced preprocessing, recursive chunking, and ontology-driven strategies, SemDB ensures unparalleled precision and scalability.The result? A system that doesn’t just retrieve information but truly understands it—delivering actionable insights, whether for analyzing financial documents, summarizing journal articles, or navigating complex knowledge bases.

‍

Additional Resources

Introducing Contextual Retrieval at Anthropic
A Visual Exploration of Semantic Text Chunking at Toward Data Science by Robert Martin-Short‍

CMS’s 2026 Updates Signal a New Era for In-House Remote Care Coordination

10/21/25

Healthcare is on the brink of a fundamental shift. The forthcoming 2026 CMS Physician Fee Schedule updates are far more significant than mere billing adjustments, they signal a new era in remote care coordination. Practices that adapt early will not only enhance patient care but also secure long-term operational advantages...

CMS Brings Behavioral Health into the APCM Model: What It Means for Primary Care

10/9/25

CMS is quietly reshaping how primary care teams can be paid for mental and emotional health support. Starting in 2026 (if finalized), practices using the new Advanced Primary Care Management (APCM) codes will be able to add small, monthly payments for behavioral health integration...

Stop Choosing Between APCM and Your RPM/RTM Revenue

10/7/25

If your practice adopted APCM by shutting down RPM and RTM programs, you left money on the table. If you're running all three programs separately, you're burning cash on duplicate documentation and exposing yourself to compliance risk...

‍

APCM vs. CCM Explained: Medicare’s 2025 Coding Shift Every Primary Care Leader Must Understand

10/1/25

On January 1, CMS introduced a brand-new benefit called Advanced Primary Care Management (APCM), a monthly payment designed to roll up the core elements of care coordination under a single code. For primary care leaders, this changes the landscape in profound ways. APCM overlaps with Chronic Care Management (CCM)...

‍

Neurosymbolic Ontologies with Buffaly

9/24/25

This document outlines a groundbreaking proof of concept for reimagining medical ontologies and artificial intelligence. Buffaly demonstrates how large language models (LLMs) can unexpectedly enable symbolic methods to reach unprecedented levels of effectiveness. This fusion delivers the best of both worlds: completely transparent, "white box" systems capable of autonomous learning directly from raw data...

APCM and the “Coordination of Care Transitions” Requirement: How To Get It Right

9/23/25

Advanced Primary Care Management (APCM) represents one of the more meaningful changes in the CMS Physician Fee Schedule. As of January 1, 2025, practices that adopt this model will be reimbursed through monthly, risk-stratified codes rather than only episodic, time-based billing...

APCM, Explained: What It Is, Why It Matters, What Patients Gain

9/18/25

Primary care is carrying more risk, more responsibility, and more expectation than ever. The opportunity is that we finally have a model that pays for the work most teams already do between visits. The risk is jumping into tooling and tactics before we agree on the basics....

Noncompete Clauses In Healthcare: The FTC Warning, APCM Staffing, And Platform Partnerships

9/16/25

The Federal Trade Commission’s Sept. 12 warning to healthcare employers is a simple message with real operational consequences. Overbroad noncompetes, no‑poach language, and “de facto” restraints chill worker mobility and can limit patients’ ability to choose their clinicians. For practices building Advanced Primary Care Management teams, restrictive templates do more than create legal risk...

‍

The APCM Quick Start Guide: Converting Medicare's Complex Care Program Into Practice Growth

9/9/25

Advanced Primary Care Management represents Medicare's most ambitious attempt to transform primary care economics. Unlike previous programs that nibbled at the margins, APCM fundamentally restructures how practices organize, deliver, and bill for comprehensive care...

13 Things You Need To Implement Advanced Primary Care Management (APCM)

9/5/25

Advanced Primary Care Management (APCM) is Medicare’s newest program, introduced in 2025 with three billing codes: G0556, G0557, and G0558. This represents a pivotal shift toward value-based primary care by offering monthly reimbursements for delivering continuous, patient-focused services. You're already providing these services—why not get paid for it?

When Women's Health Can't Wait: How Remote Care Creates Presence in Life's Most Critical Moments

8/26/25

At 2 AM, a new mother in rural Alabama feels her heart racing. She's two weeks postpartum, alone with a newborn while her husband works the night shift. Her blood pressure reading on the home monitor shows 158/95. Within minutes, her care team receives an alert. By 6 AM, a nurse has called, medications are adjusted, and what could have been a stroke becomes a story of crisis averted.

Medical Remote Care: How Vendor Models Shift Margin and When to Bring RPM In-House

8/18/25

Many health systems pay full-service RPM vendors $40–$80 PMPM for services they can in-source for far less. With 2025 Medicare rates and OIG scrutiny, it's time to revisit the build-vs-buy math.

Why 73% of Practices Still Fear Remote Care and How the Winning 27% Think Differently

8/11/25

A few months ago, a physician at a 12-doctor practice in rural California called me frustrated. His practice was hemorrhaging money on readmissions, his nurses were burning out from phone tag with chronic disease patients, and his administrator was getting pressure from...

Reclaiming Revenue: How Smart Medical Executives Are Transforming Remote Care into Sustainable Profit Centers

8/6/25

Medical executives today face an uncomfortable reality: while navigating shrinking margins and mounting operational pressures, many are unknowingly surrendering millions in Medicare reimbursements to third-party vendors. The culprit? Poorly structured Remote Patient Monitoring (RPM), Chronic Care Management (CCM)...

RPM’s $16.9B Gold Rush: Why 88% of Claims Skip CMS Review (And How Industry Leaders Are Responding)

7/23/25

Remote Patient Monitoring (RPM) has rapidly evolved from emerging healthcare innovation into a strategic necessity. Driven aggressively by CMS reimbursement policies, RPM adoption has accelerated at unprecedented rates...

Medicare's $4.5 Billion Wake-Up Call: What the VBID Sunset Reveals About Risk, Equity, and the Next Era of Value

7/17/25

In a single December blog post, CMS just rewrote the playbook for $400 billion in annual Medicare Advantage spending. The termination of the Medicare Advantage Value-Based Insurance Design...

Why the AMA’s 2026 RPM Changes Are Exactly What Your Practice Needs

7/8/25

If you've spent any time managing a remote patient monitoring (RPM) program, you already know the drill: juggling the 16-day rule, keeping track of clinical minutes, chasing compliance, and often wondering if this is really what patient-centered care was meant to feel like...

Healthcare Needs a Group Chat, And Digital Twins Are the Invite

7/1/25

Let’s be honest. Managing your health today feels like trying to coordinate a group project where nobody checks their messages. Your cardiologist, endocrinologist...

The Great Code Shift: Turning the ICD-11 Mandate into a Competitive Advantage

6/25/25

The healthcare industry still has scars from the ICD-9 to ICD-10 transition. The stories are legendary in Health IT circles: coder productivity plummeting, claim denials surging, and revenue cycles seizing up for months. It was a painful lesson in underestimation...

Beyond the Box: Finding the Signal in RPM's Next Chapter

6/19/25

In my work with healthcare organizations across the country, I see two distinct patient profiles coming into focus. They represent the past and future of remote care, and every successful practice must now build a bridge between them...

The Living Echo: How Digital Twins Are Reshaping Personalized Healthcare and Operational Excellence

6/11/25

The healthcare landscape is continuously evolving, and among the most profound shifts emerging is the concept of the Digital Twin for Patients. This technology isn't merely an abstract idea...

Why the MIPS MVP Model is the Future—and How Your Practice Can Win

6/2/25

Change is inevitable in healthcare. Often, it feels overwhelming—but occasionally, a new shift arrives that genuinely makes things simpler...

Does RPM Miss What Patients Really Need?

5/27/25

It starts with a data spike… a sudden drop in movement, a rise in reported pain. The alert pings the provider dashboard, hinting at deterioration. But what if that signal isn’t telling the whole truth

Transforming Chronic Pain: The Power of RPM, RTM, and CCM

5/19/25

Chronic pain isn’t just a condition, it’s a thief. It steals time, joy, and freedom from over 51 million Americans, according to the CDC, costing the economy $560 billion a year. As someone passionate about healthcare innovation, I’ve seen how this silent struggle affects patients, families, and providers...

Introduction: Demystifying Ontology—Returning to the Roots

5/16/25

In the tech industry today, we frequently toss around sophisticated terms like "ontology", often treating them like magic words that instantly confer depth and meaning. Product managers, software engineers, data scientists—everyone seems eager to invoke..

APCM Codes: The Quiet Revolution in Primary Care

5/13/25

Picture Mary, 62, balancing a job and early diabetes. Her doctor, Dr. Patel, is her anchor—reviewing labs, coordinating with a nutritionist, tweaking her care plan. But until 2025, Dr. Patel wasn’t paid for this invisible work...

It Always Starts Small: Lessons from the Front Lines of Healthcare Audits

4/28/25

In healthcare, most of the time, trouble doesn't announce itself with sirens and red flags. It starts quietly. A free dinner here. A paid talk there. An event that feels more like networking than education...

Unveiling RPM Fraud Risks—A Technical Dive into OIG Findings and FairPath’s AI Fix

4/24/25

The Office of Inspector General’s (OIG) 2024 report, Additional Oversight of Remote Patient Monitoring in Medicare Is Needed (OEI-02-23-00260), isn't just an alert—it's a detailed playbook exposing critical vulnerabilities in Medicare’s Remote Patient Monitoring (RPM) system...

‍

Telemedicine App Ends Gender Preference Issues with AWS Powered AI

4/19/24

AWS machine learning enhances MEDEK telemedicine solution to ease gender bias for sensitive online doctor visits...

Chunking Strategies for Retrieval-Augmented Generation (RAG): A Deep Dive into SemDB’s Approach

Summary & Key Insights

What is Chunking, and Why is it Necessary?

Traditional Chunking Methods

Semantic Chunking: A Smarter Approach to Text Segmentation

‍

‍How Semantic Chunking Works

‍

‍Refinements: Semantic Double Chunk Merging

Benefits

Challenges and Considerations

Contextual Retrieval: Enhancing Knowledge Access for AI Models

Why Contextual Retrieval Works

How SemDB Does It Better

Conclusion

Additional Resources

Read More

CMS’s 2026 Updates Signal a New Era for In-House Remote Care Coordination

CMS Brings Behavioral Health into the APCM Model: What It Means for Primary Care

Stop Choosing Between APCM and Your RPM/RTM Revenue

APCM vs. CCM Explained: Medicare’s 2025 Coding Shift Every Primary Care Leader Must Understand

Neurosymbolic Ontologies with Buffaly

APCM and the “Coordination of Care Transitions” Requirement: How To Get It Right

APCM, Explained: What It Is, Why It Matters, What Patients Gain

Noncompete Clauses In Healthcare: The FTC Warning, APCM Staffing, And Platform Partnerships

The APCM Quick Start Guide: Converting Medicare's Complex Care Program Into Practice Growth

13 Things You Need To Implement Advanced Primary Care Management (APCM)

When Women's Health Can't Wait: How Remote Care Creates Presence in Life's Most Critical Moments

Medical Remote Care: How Vendor Models Shift Margin and When to Bring RPM In-House

Why 73% of Practices Still Fear Remote Care and How the Winning 27% Think Differently

Reclaiming Revenue: How Smart Medical Executives Are Transforming Remote Care into Sustainable Profit Centers

RPM’s $16.9B Gold Rush: Why 88% of Claims Skip CMS Review (And How Industry Leaders Are Responding)

Medicare's $4.5 Billion Wake-Up Call: What the VBID Sunset Reveals About Risk, Equity, and the Next Era of Value

Why the AMA’s 2026 RPM Changes Are Exactly What Your Practice Needs

Healthcare Needs a Group Chat, And Digital Twins Are the Invite

The Great Code Shift: Turning the ICD-11 Mandate into a Competitive Advantage

Beyond the Box: Finding the Signal in RPM's Next Chapter

The Living Echo: How Digital Twins Are Reshaping Personalized Healthcare and Operational Excellence

Why the MIPS MVP Model is the Future—and How Your Practice Can Win

Does RPM Miss What Patients Really Need?

Transforming Chronic Pain: The Power of RPM, RTM, and CCM

Introduction: Demystifying Ontology—Returning to the Roots

APCM Codes: The Quiet Revolution in Primary Care

It Always Starts Small: Lessons from the Front Lines of Healthcare Audits

Unveiling RPM Fraud Risks—A Technical Dive into OIG Findings and FairPath’s AI Fix

The Cost of Shortcuts: Lessons From a $4.9 Million Mistake

One Biller, One Gap: How a Missing Piece Reshapes Everything

The System Is Rigged: How AI Helps Independent Docs Fight Back

Trust Is the Real Technology: A Lesson in Healthcare Partnerships

Million Dollar Surprise

Unlocking AI: A Practical Guide for IT Companies Ready to Make the Leap

Agentic RAG: Separating Hype from Reality

From Black Boxes to Clarity: Buffaly's Transparent AI Framework

Bridging the Gap Between Language and Action: How Buffaly is Revolutionizing AI

When Retrieval Augmented Generation (RAG) Fails

SemDB: Solving the Challenges of Graph RAG

Metagraphs and Hypergraphs with ProtoScript and Buffaly

Is Your AI a Toy or a Tool? Here’s How to Tell (And Why It Matters)

Stop Going Solo: Why Tech Founders Need a Business-Savvy Co-Founder (And How to Find Yours)

Why OGAR is the Future of AI-Driven Data Retrieval

The AI Mirage: How Broken Systems Are Undermining the Future of Business Innovation

A Sales Manager’s Perspective on AI: Boosting Efficiency and Saving Time

Prioritizing Patients for Clinical Monitoring Through Exploration

10X Your Outbound Sales Productivity with Intelligence Factory's AI for Twilio: A VP of Sales Perspective

Practical Application of AI in Business

AI: What the Heck is Going On?

Paper Review: Compression Represents Intelligence Linearly

SQL for JSON

Telemedicine App Ends Gender Preference Issues with AWS Powered AI