Quick recap

I recommend reading Part 1 before proceeding further; atleast refer the problem statement.

[Learnings from codeQA - Part 1](https://sankalp1999.notion.site/Learnings-from-codeQA-Part-1-5eb12ceb948040789d0a0aca1ac23329)

Github Link

Demo Link

In the last post, I wrote about making a question-answering system for codebases. I discussed why GPT-4 can't answer questions about your code out of the box, and the limits of the in-context learning approach. I talked about semantic code search with embeddings and why it matters to chunk the codebase properly.

I got into the details of syntax-level chunking using abstract syntax trees (ASTs) and the tree-sitter library. At the end, I showed how to extract methods, classes, constructor declarations, and find references across the whole codebase using tree-sitter.

What to expect in Part 2

Key topics covered in this part:

If you have ever worked with embeddings, you will know that embedding search is not as effective and you need to do lot of pre-processing, re-ranking and post-processing stuff to get good results. In next few sections, we discuss methods to improve our overall search and more implementation details and decisions.

Adding LLM comments

codebase indexing

codebase indexing

Initial thoughts: framing codeQA as topK RAG on text