Refactoring Legacy Code to Data Model¶

Related docs: Main AGENTS.md | Create Module | Add Node Type

IMPORTANT: A critical task for AI agents is refactoring legacy Cartography code from handwritten Cypher queries to the modern data model approach. This guide provides a step-by-step procedure to safely perform these refactors.

Overview¶

Legacy Cartography modules use handwritten Cypher queries to create nodes and relationships. The modern approach uses declarative data models that automatically generate optimized queries. Refactoring improves maintainability, performance, and consistency.

Step 1: Prevent Regressions (CRITICAL)¶

Before touching any code, ensure you have comprehensive test coverage:

1a. Identify the Sync Function¶

Locate the main sync_*() function for the module
This is usually named like sync_ec2_instances(), sync_users(), etc.
Example: cartography.intel.aws.ec2.instances.sync()

1b. Ensure Integration Test Exists¶

Check for integration tests in tests/integration/cartography/intel/[module]/
The test MUST call the sync function directly
If no test exists, CREATE IT FIRST before any refactoring:

# Example: tests/integration/cartography/intel/aws/ec2/test_instances.py
from unittest.mock import patch
import cartography.intel.aws.ec2.instances
from tests.data.aws.ec2.instances import MOCK_INSTANCES_DATA
from tests.integration.util import check_nodes, check_rels

TEST_UPDATE_TAG = 123456789
TEST_AWS_ACCOUNT_ID = "123456789012"

@patch.object(cartography.intel.aws.ec2.instances, "get", return_value=MOCK_INSTANCES_DATA)
def test_sync_ec2_instances(mock_get, neo4j_session):
    """Test that EC2 instances sync correctly"""
    # Act - Call the sync function
    cartography.intel.aws.ec2.instances.sync(
        neo4j_session,
        boto3_session=None,  # Mocked
        regions=["us-east-1"],
        current_aws_account_id=TEST_AWS_ACCOUNT_ID,
        update_tag=TEST_UPDATE_TAG,
        common_job_parameters={
            "UPDATE_TAG": TEST_UPDATE_TAG,
            "AWS_ID": TEST_AWS_ACCOUNT_ID,
        },
    )

    # Assert - Check expected nodes exist
    expected_nodes = {
        ("i-1234567890abcdef0", "running"),
        ("i-0987654321fedcba0", "stopped"),
    }
    assert check_nodes(neo4j_session, "EC2Instance", ["id", "state"]) == expected_nodes

CRITICAL: Run the test and ensure it passes before proceeding
If the test doesn’t exist or fails, fix it first - no exceptions

Step 2: Convert to Data Model¶

Now safely convert the legacy code to use the modern data model:

2a. Create Data Model Schema Files¶

Create schema files in cartography/models/[module]/:

# cartography/models/aws/ec2/instances.py
from dataclasses import dataclass
from cartography.models.core.common import PropertyRef
from cartography.models.core.nodes import CartographyNodeProperties, CartographyNodeSchema
from cartography.models.core.relationships import CartographyRelSchema, LinkDirection, make_target_node_matcher

@dataclass(frozen=True)
class EC2InstanceNodeProperties(CartographyNodeProperties):
    id: PropertyRef = PropertyRef("id")
    lastupdated: PropertyRef = PropertyRef("lastupdated", set_in_kwargs=True)
    instanceid: PropertyRef = PropertyRef("InstanceId")
    state: PropertyRef = PropertyRef("State")
    # ... other properties

@dataclass(frozen=True)
class EC2InstanceToAWSAccountRel(CartographyRelSchema):
    target_node_label: str = "AWSAccount"
    target_node_matcher: TargetNodeMatcher = make_target_node_matcher({
        "id": PropertyRef("AWS_ID", set_in_kwargs=True),
    })
    direction: LinkDirection = LinkDirection.INWARD
    rel_label: str = "RESOURCE"
    properties: EC2InstanceToAWSAccountRelProperties = EC2InstanceToAWSAccountRelProperties()

@dataclass(frozen=True)
class EC2InstanceSchema(CartographyNodeSchema):
    label: str = "EC2Instance"
    properties: EC2InstanceNodeProperties = EC2InstanceNodeProperties()
    sub_resource_relationship: EC2InstanceToAWSAccountRel = EC2InstanceToAWSAccountRel()

2b. Replace load_* Functions¶

Replace handwritten Cypher in load functions with data model load() calls:

# Before (legacy)
def load_ec2_instances(neo4j_session, data, region, current_aws_account_id, update_tag):
    ingest_instances = """
    UNWIND $instances_list as instance
    MERGE (i:EC2Instance{id: instance.id})
    ON CREATE SET i.firstseen = timestamp()
    SET i.instanceid = instance.InstanceId,
        i.state = instance.State,
        i.lastupdated = $update_tag
    WITH i
    MATCH (owner:AWSAccount{id: $aws_account_id})
    MERGE (owner)-[r:RESOURCE]->(i)
    ON CREATE SET r.firstseen = timestamp()
    SET r.lastupdated = $update_tag
    """
    neo4j_session.run(ingest_instances, instances_list=data, aws_account_id=current_aws_account_id, update_tag=update_tag)

# After (data model)
def load_ec2_instances(neo4j_session, data, region, current_aws_account_id, update_tag):
    load(
        neo4j_session,
        EC2InstanceSchema(),
        data,
        lastupdated=update_tag,
        AWS_ID=current_aws_account_id,
    )

If you still need a handwritten write query during a refactor, do not keep neo4j_session.run(...) for that write path. Use run_write_query() so the query executes with Cartography’s managed transaction and retry behavior.

2c. Replace cleanup_* Functions¶

Replace handwritten cleanup with data model cleanup:

# Before (legacy)
def cleanup_ec2_instances(neo4j_session, common_job_parameters):
    run_cleanup_job('aws_import_ec2_instances_cleanup.json', neo4j_session, common_job_parameters)

# After (data model)
def cleanup_ec2_instances(neo4j_session, common_job_parameters):
    GraphJob.from_node_schema(EC2InstanceSchema(), common_job_parameters).run(neo4j_session)

2d. Test Continuously¶

Run your integration test after each change
Ensure it still passes - if not, debug before continuing
You may need to update minor details in tests due to data model differences

Step 3: Cleanup Legacy Files¶

Once tests pass, clean up legacy infrastructure:

3a. Remove Index Entries¶

Remove manual index entries from cartography/data/indexes.cypher:

# Remove entries like these - data model creates indexes automatically
CREATE INDEX IF NOT EXISTS FOR (n:EC2Instance) ON (n.id);
CREATE INDEX IF NOT EXISTS FOR (n:EC2Instance) ON (n.lastupdated);

Note: Only remove indexes for nodes you’ve converted to data model. Leave others untouched.

3b. Remove Cleanup Job Files¶

Remove corresponding cleanup JSON files from cartography/data/jobs/cleanup/:

# Remove files like:
rm cartography/data/jobs/cleanup/aws_import_ec2_instances_cleanup.json

Note: Only remove cleanup files for modules you’ve fully converted.

Common Refactoring Patterns¶

Pattern 1: Simple Node Migration¶

Most legacy nodes can be directly converted to data model schemas.

Pattern 2: Complex Relationships¶

For modules with complex relationships, you may need:

One-to-Many relationships (see Add Node Type)
Composite Node Pattern for nodes that get data from multiple sources

Pattern 3: MatchLinks for Complex Cases¶

Use MatchLinks sparingly, only when:

Connecting two existing node types from separate data sources
Rich relationship properties that don’t belong in nodes

Things You May Encounter¶

Multiple Intel Modules Modifying Same Nodes¶

When refactoring modules that modify the same node type:

Use Simple Relationship Pattern if only referencing by ID
Use Composite Node Pattern for different views of the same entity from different data sources (see Add Relationship)

Legacy Test Adjustments¶

Older tests may need small tweaks:

Update expected property names if data model changes them
Adjust relationship directions if needed
Remove tests for manual cleanup jobs (data model handles this)

Complex Cypher Queries¶

Some legacy queries are complex. Break them down:

Identify what nodes/relationships are being created
Map to data model schemas
Use multiple load() calls if needed

What NOT to Test¶

Do NOT explicitly test cleanup functions unless there’s a specific concern:

Data model handles complex cleanup cases automatically
Testing cleanup adds unnecessary boilerplate
Focus tests on data ingestion, not cleanup behavior

When to Stop and Ask¶

Refactors can be complex. Stop and ask the user if you encounter:

Unclear business logic in legacy Cypher
Complex relationships that don’t map clearly to data model
Test failures you can’t resolve
Multiple modules that seem interdependent

Refactoring Checklist¶

Before submitting a refactor:

[ ] Integration test exists and passes for the sync function
[ ] Data model schemas defined with proper relationships
[ ] Legacy load functions converted to use load()
[ ] Legacy cleanup functions converted to use GraphJob.from_node_schema()
[ ] Tests still pass after all changes
[ ] Index entries removed from indexes.cypher
[ ] Cleanup JSON files removed from cleanup directory
[ ] No regressions - all functionality preserved

Success Criteria¶

A successful refactor should:

Preserve all functionality - tests pass
Use data model - no handwritten Cypher for CRUD operations
Clean up legacy files - indexes and cleanup jobs removed
Maintain performance - no significant speed degradation
Follow patterns - consistent with other modern modules