Learnings from codeQA - Part 2

Quick recap

I recommend reading Part 1 before proceeding further; atleast refer the problem statement.

[Learnings from codeQA - Part 1](https://sankalp1999.notion.site/Learnings-from-codeQA-Part-1-5eb12ceb948040789d0a0aca1ac23329)

Github Link

Demo Link

In the last post, I wrote about making a question-answering system for codebases. I discussed why GPT-4 can't answer questions about your code out of the box, and the limits of the in-context learning approach. I talked about semantic code search with embeddings and why it matters to chunk the codebase properly.

I got into the details of syntax-level chunking using abstract syntax trees (ASTs) and the tree-sitter library. At the end, I showed how to extract methods, classes, constructor declarations, and find references across the whole codebase using tree-sitter.

What to expect in Part 2

Key topics covered in this part:

Final step for codebase indexing - adding LLM based comments
Considerations for choosing embeddings and vector databases
Techniques to improve retrieval (HyDE, BM25, re-ranking)
Choosing re-rankers; In-depth explanation of re-ranking (bi-encoding vs. cross-encoders)

If you have ever worked with embeddings, you will know that embedding search is not as effective and you need to do lot of pre-processing, re-ranking and post-processing stuff to get good results. In next few sections, we discuss methods to improve our overall search and more implementation details and decisions.

Adding LLM comments

codebase indexing

Quick recap

What to expect in Part 2

Adding LLM comments

Initial thoughts: framing codeQA as topK RAG on text