Learnings from codeQA - Part 1

Introduction

Have you ever wondered how to make a question-answering system that can answer queries about a specific codebase? Imagine being able to ask questions in natural language about a project you're working on and getting accurate answers with relevant code snippets and references.

This is the problem I worked on in my recent project - codeQA. It provides a minimal UI where the user can ask questions in English (just like you ask chatGPT) and it is able to answer those questions about the codebase with relevant snippets, file names and references. It supports Java, Python, Rust and Javascript and can be extended to other languages easily.

Github Link

**Demo Link**

What to expect

This post is like a survey with focus on concepts and approach taken rather than focusing on code. Some key topics covered are - why in-context learning doesn’t work, naive semantic code search, types of chunking, syntax level chunking using the tree-sitter library.

I assume knowledge of basic deep learning topics like word embeddings. I try to provide relevant links and notion toggles to explain some basic topics for those who are not familiar.

<aside> 💡 query → question natural language → the language we speak like english LLM → Large language model

</aside>

In next few sections, let’s try to break down the problem into sub-problems and see what we can do to make things work.

Problem statement

User will ask a natural language question about the codebase and app should be able to answer it
App should be able to provide relevant code snippets and answer questions like how a class is being used, tell about it’s various constructors
App should be able to provide references or use references to answer questions
App should build upon above answers to help user generate more code

Questions can range from being simple (single hop) to complex (multi-hop) that might require the LLM to gather data from multiple sources to be able to answer.

Examples

what is the git repository about