top of page

Enhancing Code Understanding and Efficiency with LangChain and GPT

Updated: Feb 12



Code understanding with langchain



In the realm of software development, understanding and navigating through extensive codebases can be a daunting task. As developers, we often find ourselves spending valuable time searching for specific code snippets or trying to comprehend complex logic. To address this challenge, we present a solution powered by LangChain and GPT - a streamlined Streamlit application designed to improve code understanding and efficiency. With this app, users can input questions in plain text, and the system analyzes the entire codebase to provide precise code snippets and insightful analysis. Join us as we explore the development and functionality of the app.


LangChain


LangChain is a framework for developing applications powered by language models. The two significant advantages of LangChain are:

  1. Easily connect a language model to other sources of data

  2. Allows a language model to interact with its environment

LangChain provides multiple tools to work with LLMs. The ones used in this blog are:


1. Indexes: Indexes refer to ways to structure documents so that LLMs can best interact with them. Once the document is loaded, The text is split into smaller chunks. On input query, only relevant documents are retrieved using similarity scores to combine with language models.

2. Models: There are two types of models, one is LLM: Here we will define the llm for our question-answering use case, in our case it is GPT-4, and the other is Text Embedding Models: This model is used to get the embeddings for document and is later used to retrieve similar documents

3. Prompts: a prompt refers to the input given to the model to generate a response, you typically provide a prompt to guide the model's response in a specific direction. The prompt can be a question, an incomplete sentence, or a statement that sets the context for the generated text. The model then uses the provided prompt as a starting point to generate a continuation, completing the text based on the learned patterns and knowledge encoded in its training.

4. Chains: combines a PromptTemplate, a Model, and Guardrails to take user input, format it accordingly, pass it to the model and get a response, and then validate and fix (if necessary) the model output. 5. Memory: In some applications (chatbots being a GREAT example) it is highly important to remember previous interactions, both at a short term but also at a long term level. Memory does exactly that.


The Challenge of Code Understanding in Large Repositories


1. Complexity and Scale: Large repositories often contain thousands of code files, making it challenging to locate specific code snippets or understand the overall structure and logic of the codebase. Navigating through numerous directories and files becomes time-consuming and hampers developer productivity.

2. Lack of Contextual Understanding: Understanding the context and purpose of code is crucial for making informed decisions and implementing changes effectively. In large repositories, it can be difficult to grasp the broader picture and connect different code components without spending significant time and effort.

3. Time-Intensive Exploration: Traditional methods of exploring code, such as manually searching through files or relying on keyword-based search tools, require a substantial time investment. Developers may need to review multiple code files, test different scenarios, and consult external resources, leading to inefficiencies in code understanding and problem-solving.

4. Inefficient Collaboration: When multiple developers work on the same codebase, sharing knowledge and understanding code changes can become cumbersome. Lack of effective code understanding practices hampers collaboration, increases the likelihood of errors, and slows down the development process.

5. Keeping Up with Codebase Evolution: As codebases evolve with new features, bug fixes, and optimizations, it becomes challenging to stay up to date with the latest changes and understand their implications. Without efficient code understanding mechanisms, developers may struggle to grasp the impact of modifications on the overall system.


In the next sections, we will delve into the development and functionality of our Streamlit app, showcasing how it addresses these challenges and improves code understanding in large repositories.


Developing the Streamlit App


We will create a streamlit app to demonstrate the use case. We will use colab as the execution environment. The user here will be able to input the link to repo and ask relevant questions.


Approach: The development process follows a structured three-step approach:

  1. Data Input and Preprocessing:

  • The Streamlit app provides a user-friendly interface for inputting the repository link directly (e.g., https://github.com/<****>.git).

  • Once the link is uploaded, the app efficiently reads and preprocesses the files, ensuring that the data is in an optimal format for subsequent analysis.

  • This crucial step prepares the data for further processing.


2. Model Integration:

  • This step focuses on creating a vectorstore and loading the relevant language model, which powers the analysis.

  • By leveraging the loaded model, the app generates responses based on the user's input and the linked repository.

  • The generated chain encompasses a wide range of code snippets from the entire code base, enhancing the depth of analysis.


3. Output Postprocessing and Visualization:

  • The final step involves refining the generated output to ensure clarity and relevance.

  • The app incorporates a memory tool that allows users to ask follow-up questions based on the previous output, fostering a seamless interaction.

  • The processed output is displayed on the Streamlit interface, providing users with a user-friendly and intuitive experience. Additionally, users can conveniently access their conversation history and download the entire conversation for future reference.


1. Download the repository locally and preprocess the files




The save_files function: Within this function, a dedicated folder named "repo" will be created to store the downloaded files. If the folder already exists, it will be deleted to ensure a clean slate and a new folder will be created. The repository specified by the user input will be downloaded and saved within this folder, enabling seamless access and organization of the codebase.


The read_files function: Once the files are successfully downloaded, the read_files function comes into play. It performs the crucial task of reading all the Python files from the designated folder. We will segment it by splitting these files by functions and include the source of each function.


2. Build the Model with OpenAI embeddings and GPT4


In the aforementioned "model" function, our approach utilises faiss to establish a vectorstore specifically designed for the document chunks. These chunks, containing pertinent code segments, are then efficiently stored locally. This strategy eliminates the necessity of recreating the vectorstore for each question, resulting in enhanced computational efficiency and streamlined processing.

To facilitate user interaction and query-based code analysis, a concise and informative prompt is employed, clearly indicating the nature of the document as a code repository requiring analysis and the retrieval of relevant code snippets. Leveraging the capabilities of the load_qa_chain function, which utilizes an Index for efficient document lookup, our application effectively generates answers to user queries within the specified code context.

For memory management and maintaining continuity in the conversation, our implementation incorporates the ConversationalBufferMemory tool. Additionally, the session states provided by the Streamlit framework enable effective memory utilization, ensuring a seamless and intuitive user experience.


3. Output Postprocessing



Within the presented code snippet, when a user submits a question, our application effectively retrieves the relevant output from the chain, ensuring accurate and insightful responses. These responses, which include associated sources, are displayed within the Streamlit interface, providing a user-friendly and interactive experience.

To enhance usability and facilitate a comprehensive understanding of the conversation, our implementation incorporates the utilization of session states. This functionality enables the display of the complete conversation history, allowing users to effortlessly review past interactions and maintain context throughout their engagement with the application.

Moreover, the option to download the complete conversation is made available, providing users with the ability to save and access the conversation at their convenience.


To get the streamlit interface in colab, we can use localtunnel, you will receive a URL on !npx localtunnel --port command. You will be prompted to provide an endpoint IP address upon clicking the URL. To obtain the IP address, use the command "!curl ipv4.icanhazip.com" within the notebook. Once you enter the IP address and submit it, you will be redirected to the Streamlit app. The following images depict the step-by-step process.


local tunnel on colab

access public ip on colab

Langchain Output

example:


Git Repo: https://github.com/openai/openai-python


Question 1: How to find the common suffix


GPT Answer:


You can use the get_common_xfix function to find the longest common suffix or prefix of all the values in a series. The code snippet for this function is provided below:
def get_common_xfix(series, xfix="suffix"):     """     Finds the longest common suffix or prefix of all the values in a series     """     common_xfix = ""     while True:         common_xfixes = (             series.str[-(len(common_xfix) + 1) :]             if xfix == "suffix"             else series.str[: len(common_xfix) + 1]         )  # first few or last few characters         if (             common_xfixes.nunique() != 1         ):  # we found the character at which we don't have a unique xfix anymore             break         elif (             common_xfix == common_xfixes.values[0]         ):  # the entire first row is a prefix of every other row             break         else:  # the first or last few characters are still common across all rows - let's try to add one more             common_x Source: {'source': 'file name: C:/Users/shrig/Desktop/feehdb/test_code_langchain/test_omne/openai-python\\openai\\validators.py function name: common_completion_suffix_validator'}

Follow-up Question: can we code it in any other way?

GPT Answer:


Yes, you can also use the os.path.commonprefix or os.path.commonsuffix functions to find the longest common prefix or suffix of all the values in a series. The code snippet for this is provided below:
import os
def get_common_xfix(series, xfix="suffix"): 
""" Finds the longest common suffix or prefix of all the values in a series """ if xfix == "suffix": common_xfix = os.path.commonprefix(series.values) else: common_xfix = os.path.commonsuffix(series.values) return common_xfix 
Source: {'source': 'file name: C:/Users/shrig/Desktop/feehdb/test_code_langchain/test_omne/openai-python\openai\tests\test_util.py function name: api_key_file'}

Conclusion


Efficiently understanding and analyzing code is an essential aspect of software development, especially in large repositories. Our Streamlit app, powered by LangChain and GPT, changes the way developers interact with code by enabling them to input plain text questions and receive accurate code analysis and snippets. This powerful tool enhances productivity, saves valuable time, and significantly improves code comprehension.


Future Work


  1. Currently, during the preprocessing stage only python files are handled and are split into functions using def keywords, we can extend it to other file formats.

  2. Improve the formatting of the output

References:

419 views0 comments
bottom of page