Findings and Lessons Learned
The assistant enhances the data search and provides an interactive way to explore and learn more about the data sources provided by Copernicus data stores.
The user asks questions to the assistant and the assistant can answer using two ways. The two approaches are described below:
-
First approach (Local Database):
- Fetch data from various data sources
- Index the data in Chroma vector db
- Match the stored documents in database with the user query
- Return top 5 matching documents
- Using the 5 documents generate a response using LLM
-
Second approach (Web search):
- For the given user query, find relevant web sources
- Fetch the data from these sources and index into a temporary collection in the vector database
- Match the query with the documents in database
- Return the top 5 documents
- Generate a response using LLM using the returned documents
The diagram below shows the two approaches as discussed above and the workflow.
The LLM model can be changed as per user choice and the available options are: - phi4:latest - llama 3.2:latest - llama 3.1:8b - llama3:latest - deepseek-r1:1.5b
All of the above are open sourced models and can be adopted for usage.
The main learnings of the process are outlined below:
- The indexed data needs to be processed in a form of markdown which makes the indexing easier and in a way which LLMs can understand highlighting the need for a LLM friendly data format.
- The data needs to be cleaned and then indexed, a lot of irrelevant or extra things will increase the total documents and increased time to find match
- The local db search approach re-ranks the searched documents adopts BM25 algorithm
- The response time in case of CPU is lot higher than the GPU response, hence scaling and providing more better results a GPU is necessity
- The prompt given by user becomes very important for the case of web search