Refactoring Legacy Code to Data Model¶
Related docs: Main AGENTS.md | Create Module | Add Node Type
IMPORTANT: A critical task for AI agents is refactoring legacy Cartography code from handwritten Cypher queries to the modern data model approach. This guide provides a step-by-step procedure to safely perform these refactors.
Overview¶
Legacy Cartography modules use handwritten Cypher queries to create nodes and relationships. The modern approach uses declarative data models that automatically generate optimized queries. Refactoring improves maintainability, performance, and consistency.
Step 1: Prevent Regressions (CRITICAL)¶
Before touching any code, ensure you have comprehensive test coverage:
1a. Identify the Sync Function¶
Locate the main
sync_*()function for the moduleThis is usually named like
sync_ec2_instances(),sync_users(), etc.Example:
cartography.intel.aws.ec2.instances.sync()
1b. Ensure Integration Test Exists¶
Check for integration tests in
tests/integration/cartography/intel/[module]/The test MUST call the sync function directly
If no test exists, CREATE IT FIRST before any refactoring:
# Example: tests/integration/cartography/intel/aws/ec2/test_instances.py
from unittest.mock import patch
import cartography.intel.aws.ec2.instances
from tests.data.aws.ec2.instances import MOCK_INSTANCES_DATA
from tests.integration.util import check_nodes, check_rels
TEST_UPDATE_TAG = 123456789
TEST_AWS_ACCOUNT_ID = "123456789012"
@patch.object(cartography.intel.aws.ec2.instances, "get", return_value=MOCK_INSTANCES_DATA)
def test_sync_ec2_instances(mock_get, neo4j_session):
"""Test that EC2 instances sync correctly"""
# Act - Call the sync function
cartography.intel.aws.ec2.instances.sync(
neo4j_session,
boto3_session=None, # Mocked
regions=["us-east-1"],
current_aws_account_id=TEST_AWS_ACCOUNT_ID,
update_tag=TEST_UPDATE_TAG,
common_job_parameters={
"UPDATE_TAG": TEST_UPDATE_TAG,
"AWS_ID": TEST_AWS_ACCOUNT_ID,
},
)
# Assert - Check expected nodes exist
expected_nodes = {
("i-1234567890abcdef0", "running"),
("i-0987654321fedcba0", "stopped"),
}
assert check_nodes(neo4j_session, "EC2Instance", ["id", "state"]) == expected_nodes
CRITICAL: Run the test and ensure it passes before proceeding
If the test doesn’t exist or fails, fix it first - no exceptions
Step 2: Convert to Data Model¶
Now safely convert the legacy code to use the modern data model:
2a. Create Data Model Schema Files¶
Create schema files in cartography/models/[module]/:
# cartography/models/aws/ec2/instances.py
from dataclasses import dataclass
from cartography.models.core.common import PropertyRef
from cartography.models.core.nodes import CartographyNodeProperties, CartographyNodeSchema
from cartography.models.core.relationships import CartographyRelSchema, LinkDirection, make_target_node_matcher
@dataclass(frozen=True)
class EC2InstanceNodeProperties(CartographyNodeProperties):
id: PropertyRef = PropertyRef("id")
lastupdated: PropertyRef = PropertyRef("lastupdated", set_in_kwargs=True)
instanceid: PropertyRef = PropertyRef("InstanceId")
state: PropertyRef = PropertyRef("State")
# ... other properties
@dataclass(frozen=True)
class EC2InstanceToAWSAccountRel(CartographyRelSchema):
target_node_label: str = "AWSAccount"
target_node_matcher: TargetNodeMatcher = make_target_node_matcher({
"id": PropertyRef("AWS_ID", set_in_kwargs=True),
})
direction: LinkDirection = LinkDirection.INWARD
rel_label: str = "RESOURCE"
properties: EC2InstanceToAWSAccountRelProperties = EC2InstanceToAWSAccountRelProperties()
@dataclass(frozen=True)
class EC2InstanceSchema(CartographyNodeSchema):
label: str = "EC2Instance"
properties: EC2InstanceNodeProperties = EC2InstanceNodeProperties()
sub_resource_relationship: EC2InstanceToAWSAccountRel = EC2InstanceToAWSAccountRel()
2b. Replace load_* Functions¶
Replace handwritten Cypher in load functions with data model load() calls:
# Before (legacy)
def load_ec2_instances(neo4j_session, data, region, current_aws_account_id, update_tag):
ingest_instances = """
UNWIND $instances_list as instance
MERGE (i:EC2Instance{id: instance.id})
ON CREATE SET i.firstseen = timestamp()
SET i.instanceid = instance.InstanceId,
i.state = instance.State,
i.lastupdated = $update_tag
WITH i
MATCH (owner:AWSAccount{id: $aws_account_id})
MERGE (owner)-[r:RESOURCE]->(i)
ON CREATE SET r.firstseen = timestamp()
SET r.lastupdated = $update_tag
"""
neo4j_session.run(ingest_instances, instances_list=data, aws_account_id=current_aws_account_id, update_tag=update_tag)
# After (data model)
def load_ec2_instances(neo4j_session, data, region, current_aws_account_id, update_tag):
load(
neo4j_session,
EC2InstanceSchema(),
data,
lastupdated=update_tag,
AWS_ID=current_aws_account_id,
)
If you still need a handwritten write query during a refactor, do not keep neo4j_session.run(...) for that write path. Use run_write_query() so the query executes with Cartography’s managed transaction and retry behavior.
2c. Replace cleanup_* Functions¶
Replace handwritten cleanup with data model cleanup:
# Before (legacy)
def cleanup_ec2_instances(neo4j_session, common_job_parameters):
run_cleanup_job('aws_import_ec2_instances_cleanup.json', neo4j_session, common_job_parameters)
# After (data model)
def cleanup_ec2_instances(neo4j_session, common_job_parameters):
GraphJob.from_node_schema(EC2InstanceSchema(), common_job_parameters).run(neo4j_session)
2d. Test Continuously¶
Run your integration test after each change
Ensure it still passes - if not, debug before continuing
You may need to update minor details in tests due to data model differences
Step 3: Cleanup Legacy Files¶
Once tests pass, clean up legacy infrastructure:
3a. Remove Index Entries¶
Remove manual index entries from cartography/data/indexes.cypher:
# Remove entries like these - data model creates indexes automatically
CREATE INDEX IF NOT EXISTS FOR (n:EC2Instance) ON (n.id);
CREATE INDEX IF NOT EXISTS FOR (n:EC2Instance) ON (n.lastupdated);
Note: Only remove indexes for nodes you’ve converted to data model. Leave others untouched.
3b. Remove Cleanup Job Files¶
Remove corresponding cleanup JSON files from cartography/data/jobs/cleanup/:
# Remove files like:
rm cartography/data/jobs/cleanup/aws_import_ec2_instances_cleanup.json
Note: Only remove cleanup files for modules you’ve fully converted.
Common Refactoring Patterns¶
Pattern 1: Simple Node Migration¶
Most legacy nodes can be directly converted to data model schemas.
Pattern 2: Complex Relationships¶
For modules with complex relationships, you may need:
One-to-Many relationships (see Add Node Type)
Composite Node Pattern for nodes that get data from multiple sources
Pattern 3: MatchLinks for Complex Cases¶
Use MatchLinks sparingly, only when:
Connecting two existing node types from separate data sources
Rich relationship properties that don’t belong in nodes
Things You May Encounter¶
Multiple Intel Modules Modifying Same Nodes¶
When refactoring modules that modify the same node type:
Use Simple Relationship Pattern if only referencing by ID
Use Composite Node Pattern for different views of the same entity from different data sources (see Add Relationship)
Legacy Test Adjustments¶
Older tests may need small tweaks:
Update expected property names if data model changes them
Adjust relationship directions if needed
Remove tests for manual cleanup jobs (data model handles this)
Complex Cypher Queries¶
Some legacy queries are complex. Break them down:
Identify what nodes/relationships are being created
Map to data model schemas
Use multiple
load()calls if needed
What NOT to Test¶
Do NOT explicitly test cleanup functions unless there’s a specific concern:
Data model handles complex cleanup cases automatically
Testing cleanup adds unnecessary boilerplate
Focus tests on data ingestion, not cleanup behavior
When to Stop and Ask¶
Refactors can be complex. Stop and ask the user if you encounter:
Unclear business logic in legacy Cypher
Complex relationships that don’t map clearly to data model
Test failures you can’t resolve
Multiple modules that seem interdependent
Refactoring Checklist¶
Before submitting a refactor:
[ ] Integration test exists and passes for the sync function
[ ] Data model schemas defined with proper relationships
[ ] Legacy load functions converted to use
load()[ ] Legacy cleanup functions converted to use
GraphJob.from_node_schema()[ ] Tests still pass after all changes
[ ] Index entries removed from
indexes.cypher[ ] Cleanup JSON files removed from cleanup directory
[ ] No regressions - all functionality preserved
Success Criteria¶
A successful refactor should:
Preserve all functionality - tests pass
Use data model - no handwritten Cypher for CRUD operations
Clean up legacy files - indexes and cleanup jobs removed
Maintain performance - no significant speed degradation
Follow patterns - consistent with other modern modules