Related Transaction Clue Tracking System Technical Documentation

📋 Table of Contents

I. Background: What System Are We Building and Why? II. Unified Path Query: One Entry, Automatically Selects the Best Engine III. Path Query API: Parameter Governance and Full-Link Blacklist IV. Cycle Detection: From “Extended Paths” to “Interpretable Loops” V. Graph Construction and Fusion: Transforming an “Account-Level Graph” into a “Entity-Level Relationship Graph” VI. Summary: Turning Graph Algorithm Capabilities into a Viable Business Product

I. Background: What System Are We Building and Why?

The essence of many business analyses is not “querying a single table,” but finding interpretable links and patterns among a large number of entities and relationships. Clue discovery systems are designed for this type of problem: organizing data such as enterprises/persons/accounts/transactions/relationships into a graph, performing queries, merging, and analysis on the graph, and outputting a visualized relationship network and path results for the front-end, helping business personnel locate clues more quickly.

1.1 System Function

It transforms scattered entity and transaction data into a “traceable, interpretable, and visualizable” relationship network, used in scenarios such as relationship link discovery, fund flow tracking, and anomaly pattern identification.

1.2 Typical Application Scenarios

Enterprise Relationship Network Analysis: Query the relationship chain between a company and its upstream/downstream partners, related companies/personnel to assist with due diligence, risk control, and compliance reviews.
Fund Flow Tracking: Starting from a single entity, trace the destination/source of funds to identify intermediary accounts and key nodes.
Transaction Path Discovery: Given two groups of entities (or accounts), find the “shortest link” or “multiple alternative links” between them to explain “why they are related.”
Anomaly Pattern Identification (Cycle/Return Flow): Detect if funds circulate back to the starting point, helping to identify suspicious patterns like matched orders or round-tripping.
Account Fusion/Gang Identification: Merge the “account layer” into the “account holder layer,” or merge individuals into the “gang layer,” simplifying the graph into a business perspective that is easier to understand.

1.3 Data and Technology Landscape

Data Landscape: Nodes (person/enterprise/bank account/gang/industry, etc.) + relationships (transactions, account ownership, social relationships, regular connections, etc.). Relationship edges often carry attributes (transaction time, amount, method, notes, etc.).
Storage Landscape: Graph data is stored in Neo4j, while metadata and configurations are stored in MySQL (e.g., schema, display configurations, default database selection, blocked nodes).
Service Landscape: Django provides APIs, and the front-end visualizes and interacts with nodes + edges + path_list after receiving them.

1.4 “One Path Query” Implemented in Three Ways

Why implement “one path query” in three ways? Because user requirements vary significantly in different interaction contexts:

Need only one shortest explanatory link: It must be fast and stable, preferably providing results in seconds.
Need multiple alternative links: It must be controllable (direction, filtering, termination, blocking) without slowing down the system.
Deep paths (high number of hops): The search space expands exponentially, requiring stronger algorithms and engineering safeguards (e.g., temporary subgraphs, dynamic K, resource reclamation).

Correspondingly, we use the three implementations in the system as follows (detailed in Section II):

Shortest Single Path (Interaction “Only Need 1”): When limit == 1, the system prioritizes the shortest path capability. If constraints like blocked nodes exist, it uses a controllable path extension method to ensure “nodes that should not appear are excluded during the query phase.” The goal is maximum response speed + interpretable results.
Shallow Multi-Paths (Small Depth): When maxLevel is small and multiple alternative links are needed, an extensible query with direction/filtering/termination/blocking capabilities is used. The goal is constraint-friendly + stable performance.
Deep Multi-Paths (Large Depth): When maxLevel is large, the system switches to a graph algorithm engine (e.g., K-shortest path), combined with temporary subgraphs and dynamic parameter control. The goal is to avoid search explosion + provide sufficient candidate paths within controllable resources.

II. Unified Path Query: One Entry, Automatically Selects the Best Engine

Path queries are the function in graph analysis systems most prone to “slow results for demands, incomplete results for speed.” We’ve converged its implementation into a unified entry point: input start/end node sets, direction, depth range, number of results, transaction type preferences, and optional “blocked nodes,” with internal strategy routing.

2.1 Key Inputs and Business Semantics

Start/End Points: Supports combinations of multiple start and end points to avoid multiple front-end requests.
Direction Control (Transaction Direction):
- 0: Forward (start → end)
- 1: Reverse (start ← end)
- 2: Bidirectional (undirected/mixed)
Depth Range: minLevel/maxLevel controls path length (hops).
limit: Number of paths to return. We specifically optimize for limit == 1 as a strong scenario for “the shortest/most appropriate one.”
Transaction Type Switch: Supports filtering preferences like “only large amounts/only small amounts/both.”
Blocked Nodes (blacklist): Used for entities that should be “hidden/excluded” for business reasons (e.g., false-positive nodes, sensitive nodes, noisy nodes), requiring they cannot appear in the path.

2.2 Strategy 1: Ultra-Fast Shortest Path for `limit == 1`

When the user only needs one path, the system prioritizes the most efficient route:

Without Blocked Nodes: Directly uses Neo4j’s native shortest path capability, matching with direction patterns (forward/reverse/bidirectional) for optimal performance.
With Blocked Nodes: Since the native shortest path doesn’t easily allow strict exclusion of nodes during the “search phase,” it switches to APOC for path extension, using blacklistNodes in the extension parameters to filter out blocked nodes “preemptively.” Results are then sorted by hop count, and the shortest one is taken.

The benefit is that common interactions (viewing just one shortest relationship chain) are as fast as possible, and blocking rules take effect during the query phase, avoiding the poor experience of “fetching results and then deleting them, potentially leading to empty results.”

2.3 Strategy 2: Shallow Multi-Paths (`maxLevel < 6`) Use APOC, Balancing Performance and Control

When depth is shallow but multiple paths are needed, we use APOC’s extension capabilities:

Relationship Filtering: Combine “base relationships” (e.g., social relationships, bank accounts) with “transaction-type relationships” (account transactions/large/small) into a relationshipFilter, generating different filter strings based on the direction parameter.
Terminator Nodes: Explicitly set the end node set using terminatorNodes to avoid meaningless diffusion.
Blocked Nodes: Directly use APOC’s blacklistNodes for exclusion during the search phase.
Deduplication: Merge duplicate paths with the same node sequence to prevent identical paths from occupying multiple result slots.

The characteristic of this path is: very friendly to business conditions (direction, filtering, termination, blocking), and performance is stable when depth is shallow.

2.4 Strategy 3: Deep Multi-Paths (`maxLevel >= 6`) Use GDS Yen’s + Temporary Subgraph Projection

When depth increases, APOC extensions can easily lead to “search space explosion.” At this point, we switch to the GDS (Graph Data Science) solution. The core approach is:

Dynamically Create Temporary Subgraph Projection: Create a projection graph valid only for the lifecycle of this query, deleting it immediately after the query to avoid polluting the global graph and prevent long-term memory usage.
More Granular Direction Control: Not all relationships are affected by “transaction direction.” In the current implementation, “transaction-type relationships” (at least including account transaction, and can include large/small transactions based on front-end switches) are set with controllable direction:
- Forward: NATURAL
- Reverse: REVERSE
- Bidirectional: UNDIRECTED Other relationships remain NATURAL to ensure non-transaction edges like “social relationships/account relationships” are not incorrectly flipped.
Estimate Shortest Distance d First, Then Dynamically Set K:
- First, use a shortest path calculation to get the shortest hop count d.
- Then set K = 2^d, adding lower/upper limits (e.g., ensure enough candidate paths, but cap to prevent explosion). The intuition for this strategy is: The further the node pair, the more candidate paths are needed to find multiple “business-usable” links, but a hard upper limit must be set to protect the system.
Reconstruct Relationship Information After Yen’s K-Shortest Path Output: GDS outputs a node sequence. We match actual relationships based on adjacent node pairs to bring edge attributes (transaction time, amount, method, etc.) back to the front-end for display.
Blocked Node Handling: Due to the relationship between projection and reconstruction, blocked nodes in deep path scenarios are filtered again in the result phase (ensuring the final path contains no blocked nodes).
Resource Release: Regardless of query success or failure, ensure the temporary projection graph is deleted to avoid resource leakage.

The main benefits of adopting this solution for deep paths are: stronger stability and controllability, preventing a single deep query from crippling the system.

III. Path Query API: Parameter Governance and Full-Link Blacklist

The API entry point for path queries does two main things:

Parameter Normalization: Direction, depth, number of results, time range, amount range, transaction count range, etc., are all converted to unified backend types and default values at the entry point.
Blacklist Injection: Read the “blocked node list” from the current default database configuration and pass it as blacklist_nodes to the path query engine.

The value is that blocking rules don’t need to be scattered across various query logics; the API entry point is only responsible for “supplementing the rules,” and the underlying layer executes them uniformly.

Additionally, before path query results are returned, they undergo graph construction (see Section IV), and the results are written to a user-specific cache. This is used to “avoid re-querying the database when the same user performs consecutive operations (hide/append/switch fusion mode).”

IV. Cycle Detection: From “Extended Paths” to “Interpretable Loops”

The business goal of cycle detection is to start from accounts related to a certain entity and find cyclic paths where funds “circle back to themselves” in the network, used to identify patterns like round-tripping, matched orders, and money laundering.

4.1 Starting Set: Locate the “Account Set” from “Person/Enterprise”

The user selects “a certain entity” in the interface, but the real carriers of fund flow in the graph are bank account nodes. In implementation, a set of account node IDs is first queried based on the entity information (e.g., partial matching of the account holder field) as the starting set for subsequent expansion.

4.2 Use APOC for BFS Expansion with “Return to Starting Set” as Termination Condition

The core query logic is APOC expansion:

relationshipFilter: Limited to fund-related and necessary auxiliary relationships (e.g., account transactions, cash deposits, withdrawals, account relationships, social relationships, etc.), can include direction tendencies (common practice is mainly “outward diffusion”).
bfs: Using breadth-first is better for discovering short cycles first.
terminatorNodes: Set to the starting account set, allowing the path to naturally converge when “returning to any starting account,” forming a cycle.
blacklistNodes: Also supports excluding blocked nodes during the search phase.
uniqueness: Uses a global relationship deduplication strategy to reduce redundant traversal.

4.3 Temporal Validation: Turning “Having a Cycle” into “A Reasonable Fund Chain”

The existence of a cycle in a graph does not necessarily mean the fund chain is reasonable. Cycle detection applies a layer of temporal validation to the results:

Enforce time order constraints only on transaction-type edges (e.g., account transactions).
Eliminate paths where the time order is significantly reversed.

The significance of this step is to filter “topological cycles” into “cycles that better align with the actual sequence of business events,” improving the interpretability and usability of the results.

Finally, the cycle detection results reuse the same graph construction and fusion logic for output to the front-end (see Section IV).

V. Graph Construction and Fusion: Transforming an “Account-Level Graph” into an “Entity-Level Relationship Graph”

The raw query results from Neo4j are typically a collection of “nodes(path) + relationships(path),” but what the front-end needs is:

Nodes: With styles, icons, display names, clustering information.
Edges: With direction, categories, display fields (transaction time/amount), and the ability to be highlighted by the path list.

Therefore, we designed a unified graph construction function to convert the query results into a graph data structure suitable for visualization. At this stage, we also complete “account fusion” and “gang fusion.”

The system supports a graph simplification: when one end of a social relationship edge is a “gang” node, the related person is mapped to the gang node. In implementation, it first traverses the social relationship edges in the query results to build a person_id -> groupNode mapping.

Placing this step before graph construction is crucial because subsequent processing of both nodes and edges can use the same mapping table for fast substitution without repeatedly querying the database.

5.2 Account Fusion: Batch Query Account Holders to Avoid N+1

The goal of account fusion is to replace “bank account nodes” with their holders (person/enterprise) nodes, thereby elevating the presentation from “between accounts” to “between entities.”

A key optimization here is that instead of querying the holder for each account individually, we:

Collect all distinct bank account IDs appearing in the query results.
Perform one batch query to get the “account → holder” mapping.
Construct lightweight holder node objects and cache them for subsequent replacement.

This reduces holder queries from N times to 1 batch, making a significant difference when there are many path results or many accounts.

5.3 Endpoint Replacement: Replace Not Only Nodes but Also Edge Endpoints

If account fusion only replaces nodes without replacing edge endpoints, it can lead to broken graph structures or confusion where “edges still point to account nodes.” During graph construction, for each relationship, we simultaneously:

If the start point is a bank account → replace with its holder.
If the end point is a bank account → replace with its holder.
Then apply gang fusion replacements.
If the start and end points ultimately fall on the same node (self-loop), skip it.

5.4 Path Visualization Support: Each Edge Carries a “Set of Paths It Belongs To”

To support the front-end “clicking a path in the list → highlighting that path in the graph,” the graph construction phase attaches, for each edge, a list of path IDs it belongs to (e.g., id2path_ids). It also generates a path_list for the UI (path titles, length, summary of nodes passed, list of edge IDs).

5.5 Deduplication and Graph Quality: Edge Deduplication + Bidirectional Edge Folding + Discrete Point Filtering

It’s common in real data that:

The same relationship appears in multiple paths.
Bidirectional relationships cause both “A—B” and “B—A” to appear in the graph.
Disconnected discrete points appear after graph construction (affecting readability).

Therefore, several post-processing steps are applied after graph construction:

Deduplicate by Edge ID: Keep only one instance of the same relationship.
Fold Bidirectional Edges: For specific categories (considered “undirected relationships”), keep only one edge for the same unordered pair to reduce visual clutter.
Filter Discrete Points: Keep only nodes that appear at least in the set of edge endpoints; then apply blocked node filtering to ensure the front-end sees a “compact and interpretable subgraph.”

Finally, node size calculation, edge style adjustment, and legend statistics are performed to enhance readability and interaction experience.

VI. Summary: Turning Graph Algorithm Capabilities into a Viable Business Product

Looking back at this implementation, the core isn’t “which algorithm is more advanced,” but rather:

Unified Entry: A single API/single core function reduces branching and maintenance costs.
Layered Strategy: limit == 1, shallow multi-paths, and deep multi-paths each use the most appropriate engine.
Rule Pre-positioning: Blocked nodes are applied during the query phase as much as possible, with a final safeguard in the result phase.
Engineering Optimizations: Batch queries replace N+1 queries, temporary subgraph lifecycle management, deduplication, and folding improve visualization quality.
Interpretability: Cycle detection adds temporal validation, path lists are highlightable, and edges carry transaction summaries.

If you’re also working on systems involving “graph databases + relationship analysis + visualization,” I hope this approach can provide some inspiration for your work.