Semantic Coupling

Library for analyzing semantic relationships between source code files.

What is it?

Semantic Coupling is a library I built to analyze relationships between source code files by examining their semantic similarity. Rather than relying solely on structural dependencies (imports, inheritance), it uses natural language processing to identify how closely related different files are based on their content and concepts.

This reveals implicit connections that traditional static analysis might miss – like two classes that deal with related concepts but don’t directly import each other.

GitHub: https://github.com/loehnertz/semantic-coupling

Why I built it

Traditional coupling metrics (afferent/efferent coupling, instability, etc.) only capture explicit structural relationships. But codebases also have semantic relationships – classes that should probably be in the same module because they deal with related concepts, even if they don’t directly depend on each other.

I built this library to explore whether NLP techniques could identify these hidden relationships. Turns out they can, and it’s quite useful for things like microservice decomposition or validating module boundaries.

How it works

The library uses the SemanticCouplingCalculator as its main entry point. You provide:

A map of file names to their raw source code contents
The programming language (currently Java)
The natural language context (currently English)

The calculator returns a list of SemanticCoupling objects representing detected relationships with similarity scores.

Behind the scenes, it analyzes the textual content of your code – variable names, method names, comments, string literals – and computes semantic similarity using NLP techniques. Files that use similar terminology and concepts score higher, even if they’re not structurally coupled.

Use cases

Semantic coupling analysis is particularly useful for:

Microservice Decomposition – Identifying which classes belong together conceptually, even if they don’t have direct dependencies
Code Review – Spotting files that should be reviewed together because they deal with related concepts
Architecture Validation – Checking if module boundaries align with semantic groupings
Legacy Code Understanding – Discovering implicit relationships in unfamiliar codebases

Tech stack

Language: Pure Kotlin (100% of codebase)
Build System: Maven with pom.xml configuration
Distribution: JitPack repository for easy dependency management
License: Apache 2.0

The API provides both Kotlin and Java interfaces, making it easy to integrate into existing toolchains regardless of which JVM language you’re using.

What I learned

Building this library taught me that semantic analysis is a powerful complement to structural analysis, but it’s not a replacement. You need both to get a complete picture of code relationships.

I also learned that language matters – the same code written in different natural languages (English variable names vs German comments) produces different semantic coupling scores. This is obvious in hindsight but wasn’t something I’d thought about initially.

The library is open to contributions for additional programming languages and natural language support. Currently it’s Java and English only, but the architecture is designed to be extensible for other combinations.