Sunbird AI Assistant
  • Overview
  • Functional Overview
    • The Problem
    • The Solution
    • Use Cases
      • e-Jaadui Pitara
    • Capabilities
  • Technical Overview
    • Architecture
    • Technology Stack
  • Get Started with AI Assistant
    • Key Steps to role out an AI Assistant Solution
    • Pre-requisites
    • Installation
    • Data Ingestion Process
    • Configuration
    • APIs
    • Bot Creation 101
  • Components
    • Sakhi API Service
      • Environment Variables
      • Pluggability of LLM Chat Model
      • Pluggability of Cloud Storage
      • Pluggability of Transaltion service
      • Pluggability of Vector Store
  • Release Notes
    • Release Convention
    • 3.0.0 (Latest)
    • 2.0.0
    • 1.0.0
  • Roadmap
  • Contribution Guide
  • FAQs
  • Knowledge Base
    • Best Practices
    • Indexing CSV Data
  • Contact us
Powered by GitBook
On this page
  • Release 3.0.0
  • Before Release 3.0.0
  1. Get Started with AI Assistant

Data Ingestion Process

PreviousInstallationNextConfiguration

Last updated 10 months ago

After completing the installation, follow these steps to index all contents related to a specific use case:

Release 3.0.0

  1. Install Python on the machine where the files need to be ingested.

  2. Clone Git Repo from .

  3. Go to the root directory and update the .env file with the necessary values.

  4. Run the following:

Step 1: pip install -r requirements-dev.txt 
Step 2: python3 index_documents.py --folder_path=<PATH_TO_INPUT_FILE_DIRECTORY> --fresh_index --chunk_size=1024 --chunk_overlap=100

# --fresh_index: Create a new index from scratch.
# --chunk_size: Divide the documents into chunks of 1024 characters. Default: 1024
# --chunk_overlap: Overlap each chunk by 100 characters for context. Default: 100

Before Release 3.0.0

  1. Install Python on the machine where the files need to be ingested.

  2. Place the files to be indexed in a folder on the machine.

  3. Download index_documents.py and requirements-dev.txt file from

  4. Run the following:

Step 1: pip install -r requirements-dev.txt 
Step 2: python3 index_documents.py --marqo_url=<MARQO_URL> --index_name=<MARQO_INDEX_NAME> --folder_path=<PATH_TO_INPUT_FILE_DIRECTORY> --fresh_index

Notes:

  1. Please run the commands via screen background, as it will take a couple of hours to run

  2. “--fresh_index” is to be used when you run the indexing for the first time or delete the existing index and freshly index it. If you want to append new files to the existing index, run it without --fresh_index

  3. For running without --fresh_index, ensure your new files are kept in a new folder and the --folder_path is pointed to only the new files.

https://github.com/Sunbird-AIAssistant/sakhi-api-service
vector store configuration
https://github.com/Sunbird-AIAssistant/sakhi-api-service