I recommend reading Part 1 before proceeding further; atleast refer the problem statement.
[Learnings from codeQA - Part 1](https://sankalp1999.notion.site/Learnings-from-codeQA-Part-1-5eb12ceb948040789d0a0aca1ac23329)
In the last post, I wrote about making a question-answering system for codebases. I discussed why GPT-4 can't answer questions about your code out of the box, and the limits of the in-context learning approach. I talked about semantic code search with embeddings and why it matters to chunk the codebase properly.
I got into the details of syntax-level chunking using abstract syntax trees (ASTs) and the tree-sitter library. At the end, I showed how to extract methods, classes, constructor declarations, and find references across the whole codebase using tree-sitter.
Key topics covered in this part:
If you have ever worked with embeddings, you will know that embedding search is not as effective and you need to do lot of pre-processing, re-ranking and post-processing stuff to get good results. In next few sections, we discuss methods to improve our overall search and more implementation details and decisions.
codebase indexing