ミツモア Tech blog

「ミツモア」を運営する株式会社ミツモアの技術ブログです

Improving aider's repo map to do large, simple refactors automatically.

Introduction

aider is an LLM based tool you can run on your command line and ask it to write code for you.

It’s similar to tools like Cursor and Copilot X, in that you can ask it to make changes to your code and build new features and it can do so.

However, it has a few nice tricks that it can do which make it a little more useful:

  • You can script it, so you can use it as part of automation or more complex scripts.
  • It builds a repo map of your repository so that it can have contextual awareness of your codebase, even without any files added to the context.

These two points will be very useful for our specific use case.

At MeetsMore, we have some legacy codebases that have strict mode disabled in TypeScript, because they used to be JavaScript codebases many years ago that were then migrated to TypeScript.

We really want to enable it, but we have something like 10,000~ tsc errors when we enable it.

This isn’t really feasible to fix by hand, and the errors are mostly very easy to fix one-by-one, so we turned to aider to help.

aider is nice, but one of the issues of asking an LLM to do such a big refactor is the cost, we would be putting far too many tokens into aider if we added all of the code we’d like to add to fix our TS problems, and this would be a huge cost, and also beyond most context windows of many LLMs.

So we built a little scripting tool that takes a tsc output as its input, and it groups all of the errors by file, and allows you to select which files and errors you want to fix.

For each file, this then invokes aider , asking it to fix the errors in the file.

We’re using Claude Haiku for this, and it results in a cost of less than $0.05c per file.

On our test repositories, it worked great, but when we tried it on our monorepo, it wasn’t working at all. 😟

So we investigated a little, and discovered the issue is with the aider repo map.

Repo Map

You should read Paul’s description of the aider repo map yourself, but here is a quick explanation of what it does:

  1. Goes through all files in your codebase and collects all symbols (functions, constants, etc).
  2. Creates a graph of all symbols and files and how closely related they are using the Pagerank algorithm.
  3. Sorts the symbols by the most relevant ones.
  4. Adds as much of the sorted list as you permit it (with a limit on context) to the LLM context.

This works great on small repos, allowing it to ask you to look at files it thinks are relevant to the task, and adding them to context by asking, but there are a few flaws which make it problematic on big monorepos:

  1. It assumes all symbols are unique. If you have 10 symbols called fetchRequest , it assumes those are all the same symbol.
  2. It doesn’t model relevance according to the files you are targeting for changes.
  3. Roughly, it assumes frequency of usage increases relevance.

On our repo, this resulted, for example, in our entire repo map mostly being our feature flag methods, or really general symbols like name() .

Solution

We figured this could be fixed by improving aider and giving it alternative repo map strategies that you can select for different task types.

The aider code is clean, with well separated concerns, so we found it easy to find the repo map code and refactor it a little so that we could provide different strategies for building the repo map:

        ranked_tags = self.ranking_strategy.get_ranked_tags(
            chat_fnames,
            other_fnames,
            mentioned_fnames,
            mentioned_idents,
            progress=spin.step,
            get_rel_fname=self.get_rel_fname,
            get_tags=self.get_tags,
        )

This outsources the job of listing the ranked tags to a strategy class, and we can then build many of them.

First, we extracted the default PageRankStrategy , and then created our new ImportFloodStrategy.

Now, we could invoke aider like this:

aider --ranking-strategy importflood --haiku --show-repo-map apps/mobile/src/components/PhotoCarousel/PhotoCarousel.tsx 

This, combined with the editable mode of pip (pip install -e), with venv, gave us a very easy way to test our changes, aider would print the repo map, and we could easily test against multiple repos to see if the repo map ‘looked’ any more relevant.

ImportFloodStrategy

First, let’s discuss what we really want for our specific use case.

Mostly, the TypeScript errors we were having fell into two categories:

  1. Easily fixed without any context.
  2. Type mismatches, need to know the types in order to fix the problem.

2 are the ones we need to improve aider to fix.

Imagine a situation whereby a /user api route imports a User model type, which is also dependent on an Article model type, and the issue is that a field in Article is optional, and isn’t being handled.

This is a very common type of error in our 10,000~ errors.

For those, we want to build a repo map that looks something like this:

api/routes/user.ts imports all of the files in the diagram. I’ve highlighted in green the ones that we really need to see to fix the problem.

So, let’s write a new strategy that works like this:

  1. Identify all symbols we use in our target file.
  2. Follow the imports and find their definitions.
  3. Add those definitions to our repo map with a high priority.
  4. For the remainder of the repo map, if we have any space, populate it with the default Pagerank behaviour.

Our get_ranked_tags implementation looks like this:

    def get_ranked_tags(
        self,
        chat_fnames,
        other_fnames,
        mentioned_fnames,
        mentioned_idents,
        progress=None,
        get_rel_fname=None,
        get_tags=None,
    ):
        # Collect all files
        chat_rel_fnames = set(get_rel_fname(fname) for fname in chat_fnames)

        # Collect files with definitions for imports from chat files
        ranked_files = set()
        for chat_fname in chat_fnames:
            chat_rel_fname = get_rel_fname(chat_fname)
            imported_modules = self.parse_imported_modules(chat_fname)
            # Using a language specific resolver, resolve the imports.
            resolve_from_chat_fname = lambda module_name: Resolver.resolve_any(chat_fname, module_name)
            imported_files = [f for f in map(resolve_from_chat_fname, imported_modules) if f is not None]
            
            for fname in imported_files:
                rel_fname = get_rel_fname(fname)
                if not rel_fname.startswith('..'):
                    ranked_files.add(rel_fname)

        # Remove chat files from ranked files
        ranked_files -= chat_rel_fnames

        # Create ranked tags list
        ranked_tags = []
        for file in ranked_files:
            tags = get_tags(file, file)
            ranked_tags.extend(tags)

        return ranked_tags

Now, we encounter a problem.

Before, aider could use tree-sitter to parse any language file, build an AST, and find all the symbols, and tree-sitter has a large set of language parsers available that it can use.

However, the repo map used to be built by just parsing every file in the repository (it was cached, to speed up future invocations), and now we need to actually resolve imports.

This can be tricky, because there are many ways to import things and they’re different across all languages, TypeScript, for example, can import things like:

import * as myUtils from '../lib/utils'
import { awsUtils } from '@aws/utils'
import { monorepoUtils } from '@meetsmore/utils'
// many more

And these files could be in node_modules, they could be relative, they could be in our monorepo workspace, and how they’re resolved even depends on our resolution strategy and package manager!

So, we need to write Resolvers for each language.

class PythonResolver(Resolver):
    def resolve(from_fname, module_name):
        """Resolve a module name to its defining file (python only)"""
        # Add the directory of the importing file to sys.path
        file_dir = str(Path(from_fname).parent.absolute())
        sys.path.insert(0, file_dir)

        try:
            # Try to find the module spec
            spec = importlib.util.find_spec(module_name)
            if spec and spec.origin and Path(spec.origin).exists():
                return spec.origin
        except (ImportError, AttributeError):
            pass
        finally:
            # Remove the added path
            sys.path.pop(0)

        return None

Current Status

We’ve done the following:

  1. Built a harness for scripting aider based on TypeScript errors.
  2. Modified aider so it supports alternate repo mapping strategies.
  3. Built an alternative mapping strategy that follows imports.
  4. Built Python and JavaScript resolvers.

Next steps:

  1. Finish the JavaScript resolver.
  2. Use the modified aider to fix our TypeScript errors.
  3. Open a PR in the aider repository to add our work back to the public domain.

We’ve paused this work for now as we have some other more urgent priorities, but we expect to finish it in January 2025.