Refactoring Legacy Code to Data Model

Related docs: Main AGENTS.md | Create Module | Add Node Type

IMPORTANT: A critical task for AI agents is refactoring legacy Cartography code from handwritten Cypher queries to the modern data model approach. This guide provides a step-by-step procedure to safely perform these refactors.

Overview

Legacy Cartography modules use handwritten Cypher queries to create nodes and relationships. The modern approach uses declarative data models that automatically generate optimized queries. Refactoring improves maintainability, performance, and consistency.

Step 1: Prevent Regressions (CRITICAL)

Before touching any code, ensure you have comprehensive test coverage:

1a. Identify the Sync Function

  • Locate the main sync_*() function for the module

  • This is usually named like sync_ec2_instances(), sync_users(), etc.

  • Example: cartography.intel.aws.ec2.instances.sync()

1b. Ensure Integration Test Exists

  • Check for integration tests in tests/integration/cartography/intel/[module]/

  • The test MUST call the sync function directly

  • If no test exists, CREATE IT FIRST before any refactoring:

# Example: tests/integration/cartography/intel/aws/ec2/test_instances.py
from unittest.mock import patch
import cartography.intel.aws.ec2.instances
from tests.data.aws.ec2.instances import MOCK_INSTANCES_DATA
from tests.integration.util import check_nodes, check_rels

TEST_UPDATE_TAG = 123456789
TEST_AWS_ACCOUNT_ID = "123456789012"

@patch.object(cartography.intel.aws.ec2.instances, "get", return_value=MOCK_INSTANCES_DATA)
def test_sync_ec2_instances(mock_get, neo4j_session):
    """Test that EC2 instances sync correctly"""
    # Act - Call the sync function
    cartography.intel.aws.ec2.instances.sync(
        neo4j_session,
        boto3_session=None,  # Mocked
        regions=["us-east-1"],
        current_aws_account_id=TEST_AWS_ACCOUNT_ID,
        update_tag=TEST_UPDATE_TAG,
        common_job_parameters={
            "UPDATE_TAG": TEST_UPDATE_TAG,
            "AWS_ID": TEST_AWS_ACCOUNT_ID,
        },
    )

    # Assert - Check expected nodes exist
    expected_nodes = {
        ("i-1234567890abcdef0", "running"),
        ("i-0987654321fedcba0", "stopped"),
    }
    assert check_nodes(neo4j_session, "EC2Instance", ["id", "state"]) == expected_nodes
  • CRITICAL: Run the test and ensure it passes before proceeding

  • If the test doesn’t exist or fails, fix it first - no exceptions

Step 2: Convert to Data Model

Now safely convert the legacy code to use the modern data model:

2a. Create Data Model Schema Files

Create schema files in cartography/models/[module]/:

# cartography/models/aws/ec2/instances.py
from dataclasses import dataclass
from cartography.models.core.common import PropertyRef
from cartography.models.core.nodes import CartographyNodeProperties, CartographyNodeSchema
from cartography.models.core.relationships import CartographyRelSchema, LinkDirection, make_target_node_matcher

@dataclass(frozen=True)
class EC2InstanceNodeProperties(CartographyNodeProperties):
    id: PropertyRef = PropertyRef("id")
    lastupdated: PropertyRef = PropertyRef("lastupdated", set_in_kwargs=True)
    instanceid: PropertyRef = PropertyRef("InstanceId")
    state: PropertyRef = PropertyRef("State")
    # ... other properties

@dataclass(frozen=True)
class EC2InstanceToAWSAccountRel(CartographyRelSchema):
    target_node_label: str = "AWSAccount"
    target_node_matcher: TargetNodeMatcher = make_target_node_matcher({
        "id": PropertyRef("AWS_ID", set_in_kwargs=True),
    })
    direction: LinkDirection = LinkDirection.INWARD
    rel_label: str = "RESOURCE"
    properties: EC2InstanceToAWSAccountRelProperties = EC2InstanceToAWSAccountRelProperties()

@dataclass(frozen=True)
class EC2InstanceSchema(CartographyNodeSchema):
    label: str = "EC2Instance"
    properties: EC2InstanceNodeProperties = EC2InstanceNodeProperties()
    sub_resource_relationship: EC2InstanceToAWSAccountRel = EC2InstanceToAWSAccountRel()

2b. Replace load_* Functions

Replace handwritten Cypher in load functions with data model load() calls:

# Before (legacy)
def load_ec2_instances(neo4j_session, data, region, current_aws_account_id, update_tag):
    ingest_instances = """
    UNWIND $instances_list as instance
    MERGE (i:EC2Instance{id: instance.id})
    ON CREATE SET i.firstseen = timestamp()
    SET i.instanceid = instance.InstanceId,
        i.state = instance.State,
        i.lastupdated = $update_tag
    WITH i
    MATCH (owner:AWSAccount{id: $aws_account_id})
    MERGE (owner)-[r:RESOURCE]->(i)
    ON CREATE SET r.firstseen = timestamp()
    SET r.lastupdated = $update_tag
    """
    neo4j_session.run(ingest_instances, instances_list=data, aws_account_id=current_aws_account_id, update_tag=update_tag)

# After (data model)
def load_ec2_instances(neo4j_session, data, region, current_aws_account_id, update_tag):
    load(
        neo4j_session,
        EC2InstanceSchema(),
        data,
        lastupdated=update_tag,
        AWS_ID=current_aws_account_id,
    )

If you still need a handwritten write query during a refactor, do not keep neo4j_session.run(...) for that write path. Use run_write_query() so the query executes with Cartography’s managed transaction and retry behavior.

2c. Replace cleanup_* Functions

Replace handwritten cleanup with data model cleanup:

# Before (legacy)
def cleanup_ec2_instances(neo4j_session, common_job_parameters):
    run_cleanup_job('aws_import_ec2_instances_cleanup.json', neo4j_session, common_job_parameters)

# After (data model)
def cleanup_ec2_instances(neo4j_session, common_job_parameters):
    GraphJob.from_node_schema(EC2InstanceSchema(), common_job_parameters).run(neo4j_session)

2d. Test Continuously

  • Run your integration test after each change

  • Ensure it still passes - if not, debug before continuing

  • You may need to update minor details in tests due to data model differences

Step 3: Cleanup Legacy Files

Once tests pass, clean up legacy infrastructure:

3a. Remove Index Entries

Remove manual index entries from cartography/data/indexes.cypher:

# Remove entries like these - data model creates indexes automatically
CREATE INDEX IF NOT EXISTS FOR (n:EC2Instance) ON (n.id);
CREATE INDEX IF NOT EXISTS FOR (n:EC2Instance) ON (n.lastupdated);

Note: Only remove indexes for nodes you’ve converted to data model. Leave others untouched.

3b. Remove Cleanup Job Files

Remove corresponding cleanup JSON files from cartography/data/jobs/cleanup/:

# Remove files like:
rm cartography/data/jobs/cleanup/aws_import_ec2_instances_cleanup.json

Note: Only remove cleanup files for modules you’ve fully converted.

Common Refactoring Patterns

Pattern 1: Simple Node Migration

Most legacy nodes can be directly converted to data model schemas.

Pattern 2: Complex Relationships

For modules with complex relationships, you may need:

  • One-to-Many relationships (see Add Node Type)

  • Composite Node Pattern for nodes that get data from multiple sources

Things You May Encounter

Multiple Intel Modules Modifying Same Nodes

When refactoring modules that modify the same node type:

  • Use Simple Relationship Pattern if only referencing by ID

  • Use Composite Node Pattern for different views of the same entity from different data sources (see Add Relationship)

Legacy Test Adjustments

Older tests may need small tweaks:

  • Update expected property names if data model changes them

  • Adjust relationship directions if needed

  • Remove tests for manual cleanup jobs (data model handles this)

Complex Cypher Queries

Some legacy queries are complex. Break them down:

  1. Identify what nodes/relationships are being created

  2. Map to data model schemas

  3. Use multiple load() calls if needed

What NOT to Test

Do NOT explicitly test cleanup functions unless there’s a specific concern:

  • Data model handles complex cleanup cases automatically

  • Testing cleanup adds unnecessary boilerplate

  • Focus tests on data ingestion, not cleanup behavior

When to Stop and Ask

Refactors can be complex. Stop and ask the user if you encounter:

  • Unclear business logic in legacy Cypher

  • Complex relationships that don’t map clearly to data model

  • Test failures you can’t resolve

  • Multiple modules that seem interdependent

Refactoring Checklist

Before submitting a refactor:

  • [ ] Integration test exists and passes for the sync function

  • [ ] Data model schemas defined with proper relationships

  • [ ] Legacy load functions converted to use load()

  • [ ] Legacy cleanup functions converted to use GraphJob.from_node_schema()

  • [ ] Tests still pass after all changes

  • [ ] Index entries removed from indexes.cypher

  • [ ] Cleanup JSON files removed from cleanup directory

  • [ ] No regressions - all functionality preserved

Success Criteria

A successful refactor should:

  1. Preserve all functionality - tests pass

  2. Use data model - no handwritten Cypher for CRUD operations

  3. Clean up legacy files - indexes and cleanup jobs removed

  4. Maintain performance - no significant speed degradation

  5. Follow patterns - consistent with other modern modules