Related Transaction Clue Tracking System Technical Documentation
đ Table of Contents
I. Background: What System Are We Building and Why? II. Unified Path Query: One Entry, Automatically Selects the Best Engine III. Path Query API: Parameter Governance and Full-Link Blacklist IV. Cycle Detection: From âExtended Pathsâ to âInterpretable Loopsâ V. Graph Construction and Fusion: Transforming an âAccount-Level Graphâ into a âEntity-Level Relationship Graphâ VI. Summary: Turning Graph Algorithm Capabilities into a Viable Business Product
I. Background: What System Are We Building and Why?
The essence of many business analyses is not âquerying a single table,â but finding interpretable links and patterns among a large number of entities and relationships. Clue discovery systems are designed for this type of problem: organizing data such as enterprises/persons/accounts/transactions/relationships into a graph, performing queries, merging, and analysis on the graph, and outputting a visualized relationship network and path results for the front-end, helping business personnel locate clues more quickly.
1.1 System Function
It transforms scattered entity and transaction data into a âtraceable, interpretable, and visualizableâ relationship network, used in scenarios such as relationship link discovery, fund flow tracking, and anomaly pattern identification.
1.2 Typical Application Scenarios
- Enterprise Relationship Network Analysis: Query the relationship chain between a company and its upstream/downstream partners, related companies/personnel to assist with due diligence, risk control, and compliance reviews.
- Fund Flow Tracking: Starting from a single entity, trace the destination/source of funds to identify intermediary accounts and key nodes.
- Transaction Path Discovery: Given two groups of entities (or accounts), find the âshortest linkâ or âmultiple alternative linksâ between them to explain âwhy they are related.â
- Anomaly Pattern Identification (Cycle/Return Flow): Detect if funds circulate back to the starting point, helping to identify suspicious patterns like matched orders or round-tripping.
- Account Fusion/Gang Identification: Merge the âaccount layerâ into the âaccount holder layer,â or merge individuals into the âgang layer,â simplifying the graph into a business perspective that is easier to understand.
1.3 Data and Technology Landscape
- Data Landscape: Nodes (person/enterprise/bank account/gang/industry, etc.) + relationships (transactions, account ownership, social relationships, regular connections, etc.). Relationship edges often carry attributes (transaction time, amount, method, notes, etc.).
- Storage Landscape: Graph data is stored in Neo4j, while metadata and configurations are stored in MySQL (e.g., schema, display configurations, default database selection, blocked nodes).
- Service Landscape: Django provides APIs, and the front-end visualizes and interacts with
nodes + edges + path_listafter receiving them.
1.4 âOne Path Queryâ Implemented in Three Ways
Why implement âone path queryâ in three ways? Because user requirements vary significantly in different interaction contexts:
- Need only one shortest explanatory link: It must be fast and stable, preferably providing results in seconds.
- Need multiple alternative links: It must be controllable (direction, filtering, termination, blocking) without slowing down the system.
- Deep paths (high number of hops): The search space expands exponentially, requiring stronger algorithms and engineering safeguards (e.g., temporary subgraphs, dynamic K, resource reclamation).
Correspondingly, we use the three implementations in the system as follows (detailed in Section II):
- Shortest Single Path (Interaction “Only Need 1”): When
limit == 1, the system prioritizes the shortest path capability. If constraints like blocked nodes exist, it uses a controllable path extension method to ensure ânodes that should not appear are excluded during the query phase.â The goal is maximum response speed + interpretable results. - Shallow Multi-Paths (Small Depth): When
maxLevelis small and multiple alternative links are needed, an extensible query with direction/filtering/termination/blocking capabilities is used. The goal is constraint-friendly + stable performance. - Deep Multi-Paths (Large Depth): When
maxLevelis large, the system switches to a graph algorithm engine (e.g., K-shortest path), combined with temporary subgraphs and dynamic parameter control. The goal is to avoid search explosion + provide sufficient candidate paths within controllable resources.
II. Unified Path Query: One Entry, Automatically Selects the Best Engine
Path queries are the function in graph analysis systems most prone to âslow results for demands, incomplete results for speed.â Weâve converged its implementation into a unified entry point: input start/end node sets, direction, depth range, number of results, transaction type preferences, and optional âblocked nodes,â with internal strategy routing.
2.1 Key Inputs and Business Semantics
- Start/End Points: Supports combinations of multiple start and end points to avoid multiple front-end requests.
- Direction Control (Transaction Direction):
0: Forward (start â end)1: Reverse (start â end)2: Bidirectional (undirected/mixed)
- Depth Range:
minLevel/maxLevelcontrols path length (hops). - limit: Number of paths to return. We specifically optimize for
limit == 1as a strong scenario for âthe shortest/most appropriate one.â - Transaction Type Switch: Supports filtering preferences like âonly large amounts/only small amounts/both.â
- Blocked Nodes (blacklist): Used for entities that should be âhidden/excludedâ for business reasons (e.g., false-positive nodes, sensitive nodes, noisy nodes), requiring they cannot appear in the path.
2.2 Strategy 1: Ultra-Fast Shortest Path for limit == 1
When the user only needs one path, the system prioritizes the most efficient route:
- Without Blocked Nodes: Directly uses Neo4jâs native shortest path capability, matching with direction patterns (forward/reverse/bidirectional) for optimal performance.
- With Blocked Nodes: Since the native shortest path doesnât easily allow strict exclusion of nodes during the âsearch phase,â it switches to APOC for path extension, using
blacklistNodesin the extension parameters to filter out blocked nodes âpreemptively.â Results are then sorted by hop count, and the shortest one is taken.
The benefit is that common interactions (viewing just one shortest relationship chain) are as fast as possible, and blocking rules take effect during the query phase, avoiding the poor experience of âfetching results and then deleting them, potentially leading to empty results.â
2.3 Strategy 2: Shallow Multi-Paths (maxLevel < 6) Use APOC, Balancing Performance and Control
When depth is shallow but multiple paths are needed, we use APOCâs extension capabilities:
- Relationship Filtering: Combine âbase relationshipsâ (e.g., social relationships, bank accounts) with âtransaction-type relationshipsâ (account transactions/large/small) into a
relationshipFilter, generating different filter strings based on the direction parameter. - Terminator Nodes: Explicitly set the end node set using
terminatorNodesto avoid meaningless diffusion. - Blocked Nodes: Directly use APOCâs
blacklistNodesfor exclusion during the search phase. - Deduplication: Merge duplicate paths with the same node sequence to prevent identical paths from occupying multiple result slots.
The characteristic of this path is: very friendly to business conditions (direction, filtering, termination, blocking), and performance is stable when depth is shallow.
2.4 Strategy 3: Deep Multi-Paths (maxLevel >= 6) Use GDS Yenâs + Temporary Subgraph Projection
When depth increases, APOC extensions can easily lead to âsearch space explosion.â At this point, we switch to the GDS (Graph Data Science) solution. The core approach is:
- Dynamically Create Temporary Subgraph Projection: Create a projection graph valid only for the lifecycle of this query, deleting it immediately after the query to avoid polluting the global graph and prevent long-term memory usage.
- More Granular Direction Control: Not all relationships are affected by âtransaction direction.â In the current implementation, âtransaction-type relationshipsâ (at least including
account transaction, and can includelarge/small transactionsbased on front-end switches) are set with controllable direction:- Forward: NATURAL
- Reverse: REVERSE
- Bidirectional: UNDIRECTED Other relationships remain NATURAL to ensure non-transaction edges like âsocial relationships/account relationshipsâ are not incorrectly flipped.
- Estimate Shortest Distance d First, Then Dynamically Set K:
- First, use a shortest path calculation to get the shortest hop count
d. - Then set
K = 2^d, adding lower/upper limits (e.g., ensure enough candidate paths, but cap to prevent explosion). The intuition for this strategy is: The further the node pair, the more candidate paths are needed to find multiple âbusiness-usableâ links, but a hard upper limit must be set to protect the system.
- First, use a shortest path calculation to get the shortest hop count
- Reconstruct Relationship Information After Yenâs K-Shortest Path Output: GDS outputs a node sequence. We match actual relationships based on adjacent node pairs to bring edge attributes (transaction time, amount, method, etc.) back to the front-end for display.
- Blocked Node Handling: Due to the relationship between projection and reconstruction, blocked nodes in deep path scenarios are filtered again in the result phase (ensuring the final path contains no blocked nodes).
- Resource Release: Regardless of query success or failure, ensure the temporary projection graph is deleted to avoid resource leakage.
The main benefits of adopting this solution for deep paths are: stronger stability and controllability, preventing a single deep query from crippling the system.
III. Path Query API: Parameter Governance and Full-Link Blacklist
The API entry point for path queries does two main things:
- Parameter Normalization: Direction, depth, number of results, time range, amount range, transaction count range, etc., are all converted to unified backend types and default values at the entry point.
- Blacklist Injection: Read the âblocked node listâ from the current default database configuration and pass it as
blacklist_nodesto the path query engine.
The value is that blocking rules donât need to be scattered across various query logics; the API entry point is only responsible for âsupplementing the rules,â and the underlying layer executes them uniformly.
Additionally, before path query results are returned, they undergo graph construction (see Section IV), and the results are written to a user-specific cache. This is used to âavoid re-querying the database when the same user performs consecutive operations (hide/append/switch fusion mode).â
IV. Cycle Detection: From âExtended Pathsâ to âInterpretable Loopsâ
The business goal of cycle detection is to start from accounts related to a certain entity and find cyclic paths where funds âcircle back to themselvesâ in the network, used to identify patterns like round-tripping, matched orders, and money laundering.
4.1 Starting Set: Locate the âAccount Setâ from âPerson/Enterpriseâ
The user selects âa certain entityâ in the interface, but the real carriers of fund flow in the graph are bank account nodes. In implementation, a set of account node IDs is first queried based on the entity information (e.g., partial matching of the account holder field) as the starting set for subsequent expansion.
4.2 Use APOC for BFS Expansion with âReturn to Starting Setâ as Termination Condition
The core query logic is APOC expansion:
- relationshipFilter: Limited to fund-related and necessary auxiliary relationships (e.g., account transactions, cash deposits, withdrawals, account relationships, social relationships, etc.), can include direction tendencies (common practice is mainly âoutward diffusionâ).
- bfs: Using breadth-first is better for discovering short cycles first.
- terminatorNodes: Set to the starting account set, allowing the path to naturally converge when âreturning to any starting account,â forming a cycle.
- blacklistNodes: Also supports excluding blocked nodes during the search phase.
- uniqueness: Uses a global relationship deduplication strategy to reduce redundant traversal.
4.3 Temporal Validation: Turning âHaving a Cycleâ into âA Reasonable Fund Chainâ
The existence of a cycle in a graph does not necessarily mean the fund chain is reasonable. Cycle detection applies a layer of temporal validation to the results:
- Enforce time order constraints only on transaction-type edges (e.g., account transactions).
- Eliminate paths where the time order is significantly reversed.
The significance of this step is to filter âtopological cyclesâ into âcycles that better align with the actual sequence of business events,â improving the interpretability and usability of the results.
Finally, the cycle detection results reuse the same graph construction and fusion logic for output to the front-end (see Section IV).
V. Graph Construction and Fusion: Transforming an âAccount-Level Graphâ into an âEntity-Level Relationship Graphâ
The raw query results from Neo4j are typically a collection of ânodes(path) + relationships(path),â but what the front-end needs is:
- Nodes: With styles, icons, display names, clustering information.
- Edges: With direction, categories, display fields (transaction time/amount), and the ability to be highlighted by the path list.
Therefore, we designed a unified graph construction function to convert the query results into a graph data structure suitable for visualization. At this stage, we also complete âaccount fusionâ and âgang fusion.â
5.1 Gang Fusion: First Refine âSocial Relationshipsâ into a âPerson â Gangâ Mapping
The system supports a graph simplification: when one end of a social relationship edge is a âgangâ node, the related person is mapped to the gang node. In implementation, it first traverses the social relationship edges in the query results to build a person_id -> groupNode mapping.
Placing this step before graph construction is crucial because subsequent processing of both nodes and edges can use the same mapping table for fast substitution without repeatedly querying the database.
5.2 Account Fusion: Batch Query Account Holders to Avoid N+1
The goal of account fusion is to replace âbank account nodesâ with their holders (person/enterprise) nodes, thereby elevating the presentation from âbetween accountsâ to âbetween entities.â
A key optimization here is that instead of querying the holder for each account individually, we:
- Collect all distinct bank account IDs appearing in the query results.
- Perform one batch query to get the âaccount â holderâ mapping.
- Construct lightweight holder node objects and cache them for subsequent replacement.
This reduces holder queries from N times to 1 batch, making a significant difference when there are many path results or many accounts.
5.3 Endpoint Replacement: Replace Not Only Nodes but Also Edge Endpoints
If account fusion only replaces nodes without replacing edge endpoints, it can lead to broken graph structures or confusion where âedges still point to account nodes.â During graph construction, for each relationship, we simultaneously:
- If the start point is a bank account â replace with its holder.
- If the end point is a bank account â replace with its holder.
- Then apply gang fusion replacements.
- If the start and end points ultimately fall on the same node (self-loop), skip it.
5.4 Path Visualization Support: Each Edge Carries a âSet of Paths It Belongs Toâ
To support the front-end âclicking a path in the list â highlighting that path in the graph,â the graph construction phase attaches, for each edge, a list of path IDs it belongs to (e.g., id2path_ids). It also generates a path_list for the UI (path titles, length, summary of nodes passed, list of edge IDs).
5.5 Deduplication and Graph Quality: Edge Deduplication + Bidirectional Edge Folding + Discrete Point Filtering
Itâs common in real data that:
- The same relationship appears in multiple paths.
- Bidirectional relationships cause both âAâBâ and âBâAâ to appear in the graph.
- Disconnected discrete points appear after graph construction (affecting readability).
Therefore, several post-processing steps are applied after graph construction:
- Deduplicate by Edge ID: Keep only one instance of the same relationship.
- Fold Bidirectional Edges: For specific categories (considered âundirected relationshipsâ), keep only one edge for the same unordered pair to reduce visual clutter.
- Filter Discrete Points: Keep only nodes that appear at least in the set of edge endpoints; then apply blocked node filtering to ensure the front-end sees a âcompact and interpretable subgraph.â
Finally, node size calculation, edge style adjustment, and legend statistics are performed to enhance readability and interaction experience.
VI. Summary: Turning Graph Algorithm Capabilities into a Viable Business Product
Looking back at this implementation, the core isnât âwhich algorithm is more advanced,â but rather:
- Unified Entry: A single API/single core function reduces branching and maintenance costs.
- Layered Strategy:
limit == 1, shallow multi-paths, and deep multi-paths each use the most appropriate engine. - Rule Pre-positioning: Blocked nodes are applied during the query phase as much as possible, with a final safeguard in the result phase.
- Engineering Optimizations: Batch queries replace N+1 queries, temporary subgraph lifecycle management, deduplication, and folding improve visualization quality.
- Interpretability: Cycle detection adds temporal validation, path lists are highlightable, and edges carry transaction summaries.
If youâre also working on systems involving âgraph databases + relationship analysis + visualization,â I hope this approach can provide some inspiration for your work.