RAG: Data Ingestion

In this session, we discussed various aspects of data preparation in RAG pipelines and general data processing techniques. The importance of the ingestion phase in a RAG pipeline was emphasized, as it ensures data is cleaned, organized, and enriched with metadata for efficient retrieval and embedding. Examples include adding timestamps to Telegram messages for better query relevance and using tools like Airflow to automate data processing. Ingestion also prevents errors, supports timely updates, and enables customized preparation, such as removing irrelevant details or focusing on thematic content.

Next, classical data cleaning methods were outlined, starting with your mention of lowercasing. Additional techniques include removing duplicates, handling missing values (e.g., through imputation or deletion), managing outliers, and normalizing data. For text-specific cleaning, methods like removing stop words, tokenization, correcting typos, and trimming whitespace were covered. These strategies ensure the data is consistent and high-quality, supporting the performance of downstream models.

Finally, various chunking strategies were explored, critical for preparing data for embeddings. Fixed-length and sliding window chunking ensure manageable sizes and overlap for context retention, while sentence-based or semantic chunking preserves meaning. Other methods, like token-based or punctuation-based chunking, cater to specific constraints, such as embedding model input limits or natural breaks in text. Advanced approaches like thematic or hierarchical chunking adapt to complex documents, while dynamic chunking uses language models for optimal segmentation. These strategies balance the trade-offs between context preservation and computational efficiency.

Session 2: RAG, Data Ingestion

Presenters

RAG: Data Ingestion