Skip to main content

Indexing & Retrieval

Indexing Process

Dewy's indexing mechanism is customizable and central for its retrieval capabilities. Indexing is primarily configured through collections, which determine the choice of embedding model and distance metric. Specifically, the following indexing configuration exists at the collection level (see addCollection for additional details):

ConfigurationDescriptionDefault
text_embedding_modelText Embedding Modelopenai:text-embedding-ada-002
text_distance_metricThe distance metric used for similarity searchescosine

The process of extracting data from documents is dependent on the document type and is informed by the results of Ragas benchmarks. Although currently limited to text chunks, we plan to expand these capabilities to include images, tables, and other data formats, aiming to offer a more comprehensive indexing solution.

tip

Despite exploring various model-based extraction techniques such as summarization, propositionizing, and extracting questions answered by the text, we have found that the modest performance improvements provided by these techniques doesn't justify the signficant increase in ingestion cost they impose.

Chunking configuration and the extraction of embeddings are also guided by benchmarks. This ensures that the chunking parameters and embedding extraction processes are optimized for retrieval, facilitating a balance between precision and performance.

tip

If you're interested in our benchmarking, we published some of our high-level findings in a blog post titled "Extraction Matters Most"

Retrieval Process

Querying in Dewy involves embedding the user's query and then matching this embedded query to the document embeddings based on vector similarity. The configured distance metric plays a critical role in this matching process, ensuring that the most relevant results are identified.

Queries support several configuration options (see retrieveChunks for additional details):

ConfigurationDescriptionDefault
nThe number of chunks to return10
include_text_chunkIf text chunks should be returned in the responseTrue
include_image_chunksIf image chunks should be returned in the response (not yet implemented)True
include_summaryIf chunk summaries should be returned in the response (not yet implemented)False

Once embeddings that closely match the query are identified, they are resolved back to their corresponding chunks. The system then returns the top N chunks, ranked according to their maximum embedding score, ensuring users receive the most relevant and informative responses.

tip

Looking forward, Dewy is exploring the introduction of re-ranking options, such as Mean Reciprocal Rank (MRR), to refine the retrieval process further. However, experimentation with MRR has shown that it may not always improve performance and can actually reduce performance in some cases, indicating the complexity of optimizing retrieval mechanisms.

Future Directions

Dewy is actively investigating several advanced techniques to enrich its indexing and retrieval framework. These include:

  • The integration of knowledge graphs, like GraphRAG, to leverage structured relationships between data points
  • The extraction of entities and named terms to provide additional context associated with chunks is under consideration, aiming to improve the specificity and relevance of retrieved information.
  • The prospect of multimodal search, encompassing images, tables, and other non-textual data formats.

Through continuous innovation and the integration of advanced techniques, Dewy seeks to remain at the forefront of knowledge base technology, offering a powerful tool for developers and organizations looking to harness the full potential of their unstructured data.