Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 13, 2025

📄 7% (0.07x) speedup for ParseTableBase.serialize in python/ccxt/static_dependencies/lark/parsers/lalr_analysis.py

⏱️ Runtime : 1.33 milliseconds 1.24 milliseconds (best of 216 runs)

📝 Explanation and details

The optimized code achieves a 7% speedup through two key micro-optimizations that reduce overhead in hot code paths:

1. Eliminated Dictionary Double Lookup in Enumerator.get
Changed from manual if item not in self.enums check + assignment to setdefault(item, len(self.enums)). This avoids the double hash table lookup (once for not in, once for assignment) when adding new items, which is particularly effective since the profiler shows this method is called 6,030 times with 1,164 new insertions.

2. Reduced Attribute Lookups in ParseTableBase.serialize

  • Stored tokens.get as local variable tokens_get to avoid repeated attribute lookups
  • Stored Reduce as local variable Reduce_action for faster identity comparisons
  • Replaced dictionary comprehension with explicit loops to better leverage these local references

The optimization is most effective for large-scale serialization workloads - test results show the biggest gains (14-18% speedup) occur with hundreds of states and tokens, where the attribute lookup overhead compounds. The serialize method processes nested loops over states and tokens, making these micro-optimizations meaningful when called frequently.

Performance Context: Since this is part of a parser's serialization logic (LALR analysis), these optimizations directly benefit parser table generation and serialization workflows, where Enumerator.get assigns unique IDs to tokens and serialize processes the entire parse table structure.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 1408 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from typing import Any, Dict, Generic, Tuple, TypeVar

# imports
import pytest
from ccxt.static_dependencies.lark.parsers.lalr_analysis import ParseTableBase

# --- Begin function to test and dependencies ---

# Minimal Action implementation for testing
class Action:
    def __init__(self, name):
        self.name = name
    def __eq__(self, other):
        return isinstance(other, Action) and self.name == other.name
    def __hash__(self):
        return hash(self.name)
    def __repr__(self):
        return f"Action({self.name!r})"

Reduce = Action('Reduce')


StateT = TypeVar("StateT")
from ccxt.static_dependencies.lark.parsers.lalr_analysis import ParseTableBase

# --- End function to test and dependencies ---

# Helper class for testing .serialize
class DummySerializable:
    def __init__(self, value):
        self.value = value
    def serialize(self, memo):
        # Return value + memo for test purposes
        return (self.value, memo)

# -------------------- Unit Tests --------------------

# Basic Test Cases

def test_serialize_basic_single_state_single_token():
    # Test with one state, one token, non-Reduce action
    states = {
        'S0': {'a': (Action('Shift'), 42)}
    }
    start_states = {'main': 'S0'}
    end_states = {'main': 'S0'}
    pt = ParseTableBase(states, start_states, end_states)
    memo = 'memo1'
    codeflash_output = pt.serialize(memo); result = codeflash_output # 4.50μs -> 4.21μs (6.67% faster)

def test_serialize_basic_reduce_action():
    # Test with Reduce action and serializable argument
    states = {
        'S0': {'b': (Reduce, DummySerializable(99))}
    }
    pt = ParseTableBase(states, {'main': 'S0'}, {'main': 'S0'})
    memo = 'memo2'
    codeflash_output = pt.serialize(memo); result = codeflash_output # 3.53μs -> 3.15μs (11.9% faster)

def test_serialize_basic_multiple_states_tokens():
    # Test with multiple states and tokens
    states = {
        'S0': {'a': (Action('Shift'), 1), 'b': (Reduce, DummySerializable(2))},
        'S1': {'b': (Action('Goto'), 3)}
    }
    pt = ParseTableBase(states, {'main': 'S0'}, {'main': 'S1'})
    memo = 'memo3'
    codeflash_output = pt.serialize(memo); result = codeflash_output # 4.51μs -> 4.19μs (7.54% faster)
    # tokens should enumerate 'a' and 'b' as 0 and 1 (order not guaranteed)
    tokens = result['tokens']
    # Each token index is mapped in states
    for state, actions in states.items():
        for token in actions:
            idx = [k for k, v in tokens.items() if v == token][0]
    # Check Reduce serialization
    for idx, (action, arg) in result['states']['S0'].items():
        if tokens[idx] == 'b':
            pass
        else:
            pass

# Edge Test Cases

def test_serialize_empty_states():
    # Test with no states
    pt = ParseTableBase({}, {}, {})
    memo = 'memo4'
    codeflash_output = pt.serialize(memo); result = codeflash_output # 2.29μs -> 2.19μs (4.85% faster)

def test_serialize_state_with_no_actions():
    # State exists, but has no actions/tokens
    states = {
        'S0': {}
    }
    pt = ParseTableBase(states, {'main': 'S0'}, {'main': 'S0'})
    memo = 'memo5'
    codeflash_output = pt.serialize(memo); result = codeflash_output # 2.68μs -> 2.36μs (13.7% faster)

def test_serialize_tokens_with_non_string_names():
    # Tokens are not strings (e.g., integers, tuples)
    states = {
        'S0': {1: (Action('Shift'), 10), (2, 'x'): (Reduce, DummySerializable(20))}
    }
    pt = ParseTableBase(states, {'main': 'S0'}, {'main': 'S0'})
    memo = 'memo6'
    codeflash_output = pt.serialize(memo); result = codeflash_output # 4.11μs -> 4.02μs (2.27% faster)
    tokens = result['tokens']
    # Check Reduce serialization
    for idx, (action, arg) in result['states']['S0'].items():
        if tokens[idx] == (2, 'x'):
            pass
        else:
            pass

def test_serialize_duplicate_tokens_in_different_states():
    # Same token in different states should be mapped to the same index
    states = {
        'S0': {'a': (Action('Shift'), 1)},
        'S1': {'a': (Reduce, DummySerializable(2))}
    }
    pt = ParseTableBase(states, {'main': 'S0'}, {'main': 'S1'})
    memo = 'memo7'
    codeflash_output = pt.serialize(memo); result = codeflash_output # 4.01μs -> 3.75μs (7.05% faster)
    tokens = result['tokens']
    # Only one index for 'a'
    indices = [idx for idx, tok in tokens.items() if tok == 'a']
    idx = indices[0]

def test_serialize_action_identity_vs_equality():
    # Reduce action is checked by identity, not equality
    states = {
        'S0': {'a': (Action('Reduce'), DummySerializable(1)), 'b': (Reduce, DummySerializable(2))}
    }
    pt = ParseTableBase(states, {'main': 'S0'}, {'main': 'S0'})
    memo = 'memo8'
    codeflash_output = pt.serialize(memo); result = codeflash_output # 3.85μs -> 3.72μs (3.28% faster)
    tokens = result['tokens']
    # Only 'b' should be serialized as Reduce (identity)
    for idx, (action, arg) in result['states']['S0'].items():
        token = tokens[idx]
        if token == 'b':
            pass
        else:
            pass

def test_serialize_start_end_states_with_non_string_keys():
    # start_states and end_states with non-string keys
    states = {
        0: {'a': (Action('Shift'), 1)}
    }
    start_states = {('main', 1): 0}
    end_states = {('main', 2): 0}
    pt = ParseTableBase(states, start_states, end_states)
    memo = 'memo9'
    codeflash_output = pt.serialize(memo); result = codeflash_output # 3.38μs -> 3.11μs (8.88% faster)

# Large Scale Test Cases

def test_serialize_large_number_of_tokens_and_states():
    # 100 states, each with 10 tokens
    num_states = 100
    num_tokens = 10
    states = {}
    for i in range(num_states):
        actions = {}
        for j in range(num_tokens):
            token = f"tok_{j}"
            # Alternate between Reduce and non-Reduce
            if (i + j) % 2 == 0:
                actions[token] = (Reduce, DummySerializable(i * 100 + j))
            else:
                actions[token] = (Action('Shift'), i * 100 + j)
        states[f"S{i}"] = actions
    start_states = {'main': 'S0'}
    end_states = {'main': f"S{num_states-1}"}
    pt = ParseTableBase(states, start_states, end_states)
    memo = 'memo_large'
    codeflash_output = pt.serialize(memo); result = codeflash_output # 158μs -> 138μs (14.3% faster)
    # Each state should have 10 actions
    for state in states:
        pass
    # Check Reduce serialization for a few
    for state in ['S0', 'S1', 'S2']:
        for idx, (action, arg) in result['states'][state].items():
            token = result['tokens'][idx]
            i = int(state[1:])
            j = int(token.split('_')[1])
            if (i + j) % 2 == 0:
                pass
            else:
                pass

def test_serialize_large_tokens_non_string():
    # 100 tokens that are tuples
    num_states = 10
    num_tokens = 100
    states = {}
    for i in range(num_states):
        actions = {}
        for j in range(num_tokens):
            token = (j, f"tok{j}")
            actions[token] = (Reduce, DummySerializable(i * 100 + j))
        states[f"S{i}"] = actions
    pt = ParseTableBase(states, {'main': 'S0'}, {'main': f"S{num_states-1}"})
    memo = 'memo_large2'
    codeflash_output = pt.serialize(memo); result = codeflash_output # 183μs -> 177μs (3.65% faster)
    # Each state should have 100 actions
    for state in states:
        pass
    # Spot check Reduce serialization
    for state in ['S0', 'S9']:
        for idx, (action, arg) in result['states'][state].items():
            pass

def test_serialize_performance_large_scale():
    # Test that large scale does not take excessive time or memory
    num_states = 50
    num_tokens = 20
    states = {}
    for i in range(num_states):
        actions = {}
        for j in range(num_tokens):
            token = f"token_{j}"
            actions[token] = (Reduce if j % 2 == 0 else Action('Shift'), DummySerializable(i * 100 + j))
        states[f"State_{i}"] = actions
    pt = ParseTableBase(states, {'main': 'State_0'}, {'main': f"State_{num_states-1}"})
    memo = 'memo_perf'
    codeflash_output = pt.serialize(memo); result = codeflash_output # 142μs -> 135μs (5.33% faster)
    # Spot check a few entries for correctness
    for i in [0, num_states-1]:
        state = f"State_{i}"
        for idx, (action, arg) in result['states'][state].items():
            token = result['tokens'][idx]
            j = int(token.split('_')[1])
            if j % 2 == 0:
                pass
            else:
                pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from typing import Any, Dict, Generic, Tuple, TypeVar

# imports
import pytest
from ccxt.static_dependencies.lark.parsers.lalr_analysis import ParseTableBase


# Dummy Action and Reduce for compatibility
class Action:
    def __init__(self, name):
        self.name = name
    def __eq__(self, other):
        return isinstance(other, Action) and self.name == other.name
    def __hash__(self):
        return hash(self.name)
    def __repr__(self):
        return f"Action({self.name!r})"

Reduce = Action('Reduce')

StateT = TypeVar("StateT")

class DummyArg:
    """Dummy argument for testing .serialize(memo) calls."""
    def __init__(self, value):
        self.value = value
    def serialize(self, memo):
        # For testing, return a tuple with the memo and value
        return (memo, self.value)
    def __eq__(self, other):
        return isinstance(other, DummyArg) and self.value == other.value
    def __repr__(self):
        return f"DummyArg({self.value!r})"
from ccxt.static_dependencies.lark.parsers.lalr_analysis import ParseTableBase

# unit tests

# --- Basic Test Cases ---

def test_serialize_basic_single_state_single_token():
    # Basic: One state, one token, non-Reduce action
    states = {
        0: {'a': (Action('Shift'), 'arg1')}
    }
    start_states = {'S': 0}
    end_states = {'E': 0}
    ptb = ParseTableBase(states, start_states, end_states)
    memo = 'memo1'
    codeflash_output = ptb.serialize(memo); result = codeflash_output # 4.32μs -> 4.10μs (5.46% faster)

def test_serialize_basic_reduce_action():
    # Basic: One state, one token, Reduce action
    arg = DummyArg('foo')
    states = {
        1: {'b': (Reduce, arg)}
    }
    ptb = ParseTableBase(states, {'S': 1}, {'E': 1})
    memo = 'memo2'
    codeflash_output = ptb.serialize(memo); result = codeflash_output # 3.48μs -> 3.37μs (3.17% faster)

def test_serialize_basic_multiple_states_tokens():
    # Basic: Multiple states, multiple tokens, mixed actions
    arg1 = DummyArg('foo')
    arg2 = DummyArg('bar')
    states = {
        0: {'x': (Action('Shift'), 'argX'), 'y': (Reduce, arg1)},
        1: {'y': (Action('Shift'), 'argY'), 'z': (Reduce, arg2)}
    }
    ptb = ParseTableBase(states, {'S': 0}, {'E': 1})
    memo = 'memo3'
    codeflash_output = ptb.serialize(memo); result = codeflash_output # 4.85μs -> 4.55μs (6.44% faster)
    # tokens should map 0,1,2 to x,y,z in order of appearance
    expected_tokens = {}
    all_tokens = []
    for state_actions in states.values():
        for token in state_actions:
            if token not in all_tokens:
                all_tokens.append(token)
    for i, token in enumerate(all_tokens):
        expected_tokens[i] = token
    # Check states mapping
    # For each state, for each token, check correct tuple
    for state, actions in states.items():
        for token, (action, arg) in actions.items():
            token_idx = [k for k, v in result['tokens'].items() if v == token][0]
            if action is Reduce:
                expected = (1, arg.serialize(memo))
            else:
                expected = (0, arg)

# --- Edge Test Cases ---

def test_serialize_empty_states():
    # Edge: No states
    states = {}
    start_states = {}
    end_states = {}
    ptb = ParseTableBase(states, start_states, end_states)
    memo = 'memo4'
    codeflash_output = ptb.serialize(memo); result = codeflash_output # 2.31μs -> 2.15μs (7.19% faster)

def test_serialize_state_with_no_actions():
    # Edge: State with no actions
    states = {
        0: {}
    }
    ptb = ParseTableBase(states, {'S': 0}, {'E': 0})
    memo = 'memo5'
    codeflash_output = ptb.serialize(memo); result = codeflash_output # 2.57μs -> 2.37μs (8.52% faster)

def test_serialize_duplicate_tokens_in_different_states():
    # Edge: Same token in different states
    arg1 = DummyArg('foo')
    arg2 = DummyArg('bar')
    states = {
        0: {'dup': (Action('Shift'), 'arg1')},
        1: {'dup': (Reduce, arg2)}
    }
    ptb = ParseTableBase(states, {'S': 0}, {'E': 1})
    memo = 'memo6'
    codeflash_output = ptb.serialize(memo); result = codeflash_output # 4.05μs -> 3.83μs (5.85% faster)
    token_idx = list(result['tokens'].keys())[0]

def test_serialize_token_ordering_consistency():
    # Edge: Token order should be consistent with first appearance
    states = {
        0: {'a': (Action('Shift'), 1), 'b': (Action('Shift'), 2)},
        1: {'b': (Action('Shift'), 3), 'c': (Action('Shift'), 4)},
        2: {'a': (Action('Shift'), 5), 'c': (Action('Shift'), 6)}
    }
    ptb = ParseTableBase(states, {'S': 0}, {'E': 2})
    memo = 'memo7'
    codeflash_output = ptb.serialize(memo); result = codeflash_output # 5.34μs -> 5.10μs (4.87% faster)
    # Check all token indices present in states
    for state, actions in states.items():
        for token in actions:
            token_idx = [k for k, v in result['tokens'].items() if v == token][0]

def test_serialize_non_string_tokens():
    # Edge: Tokens are not strings (e.g. integers, tuples)
    arg = DummyArg('baz')
    states = {
        0: {42: (Action('Shift'), 'int_token'), (1,2): (Reduce, arg)}
    }
    ptb = ParseTableBase(states, {'S': 0}, {'E': 0})
    memo = 'memo8'
    codeflash_output = ptb.serialize(memo); result = codeflash_output # 4.01μs -> 3.69μs (8.68% faster)

def test_serialize_action_identity_vs_equality():
    # Edge: Only 'is Reduce' triggers serialization, not ==Reduce
    arg = DummyArg('baz')
    states = {
        0: {'tok': (Action('Reduce'), arg)},  # not 'is' Reduce
        1: {'tok': (Reduce, arg)}
    }
    ptb = ParseTableBase(states, {'S': 0}, {'E': 1})
    memo = 'memo9'
    codeflash_output = ptb.serialize(memo); result = codeflash_output # 3.95μs -> 3.53μs (12.0% faster)
    token_idx = [k for k, v in result['tokens'].items() if v == 'tok'][0]

# --- Large Scale Test Cases ---

def test_serialize_large_number_of_states_and_tokens():
    # Large scale: 100 states, each with 10 tokens
    num_states = 100
    num_tokens = 10
    states = {}
    token_names = [f"tok{i}" for i in range(num_tokens)]
    args = [DummyArg(i) for i in range(num_tokens)]
    for s in range(num_states):
        actions = {}
        for t in range(num_tokens):
            # Alternate Reduce and non-Reduce actions
            action = Reduce if (t % 2 == 0) else Action('Shift')
            arg = args[t]
            actions[token_names[t]] = (action, arg)
        states[s] = actions
    start_states = {'S': 0}
    end_states = {'E': num_states-1}
    ptb = ParseTableBase(states, start_states, end_states)
    memo = 'memo_large'
    codeflash_output = ptb.serialize(memo); result = codeflash_output # 144μs -> 127μs (14.0% faster)
    # tokens: 0..9 mapped to tok0..tok9
    for idx, name in result['tokens'].items():
        pass
    # states: for each state, for each token, check tuple
    for s in range(num_states):
        for t in range(num_tokens):
            token = token_names[t]
            token_idx = [k for k, v in result['tokens'].items() if v == token][0]
            action, arg = states[s][token]
            if action is Reduce:
                expected = (1, arg.serialize(memo))
            else:
                expected = (0, arg)

def test_serialize_large_number_of_tokens_unique():
    # Large scale: One state, 1000 unique tokens
    num_tokens = 1000
    tokens = [f"t{i}" for i in range(num_tokens)]
    args = [DummyArg(i) for i in range(num_tokens)]
    actions = {}
    for i in range(num_tokens):
        action = Reduce if i % 3 == 0 else Action('Shift')
        actions[tokens[i]] = (action, args[i])
    states = {0: actions}
    ptb = ParseTableBase(states, {'S': 0}, {'E': 0})
    memo = 'memo_many'
    codeflash_output = ptb.serialize(memo); result = codeflash_output # 203μs -> 201μs (0.938% faster)
    # tokens: 0..999 mapped to t0..t999
    for idx, name in result['tokens'].items():
        pass
    # states: for each token, check tuple
    for i in range(num_tokens):
        token = tokens[i]
        token_idx = [k for k, v in result['tokens'].items() if v == token][0]
        action, arg = actions[token]
        if action is Reduce:
            expected = (1, arg.serialize(memo))
        else:
            expected = (0, arg)

def test_serialize_large_number_of_states_sparse_tokens():
    # Large scale: 500 states, each with 2 tokens, tokens overlap
    num_states = 500
    tokens = ['A', 'B']
    states = {}
    for s in range(num_states):
        actions = {}
        for t in tokens:
            action = Reduce if s % 2 == 0 else Action('Shift')
            arg = DummyArg(s*10 + ord(t))
            actions[t] = (action, arg)
        states[s] = actions
    ptb = ParseTableBase(states, {'S': 0}, {'E': num_states-1})
    memo = 'memo_sparse'
    codeflash_output = ptb.serialize(memo); result = codeflash_output # 235μs -> 198μs (18.3% faster)
    # states: for each state, for each token, check tuple
    for s in range(num_states):
        for t_idx, t in enumerate(tokens):
            action, arg = states[s][t]
            if action is Reduce:
                expected = (1, arg.serialize(memo))
            else:
                expected = (0, arg)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-ParseTableBase.serialize-mhx7we4a and push.

Codeflash Static Badge

The optimized code achieves a **7% speedup** through two key micro-optimizations that reduce overhead in hot code paths:

**1. Eliminated Dictionary Double Lookup in `Enumerator.get`**
Changed from manual `if item not in self.enums` check + assignment to `setdefault(item, len(self.enums))`. This avoids the double hash table lookup (once for `not in`, once for assignment) when adding new items, which is particularly effective since the profiler shows this method is called 6,030 times with 1,164 new insertions.

**2. Reduced Attribute Lookups in `ParseTableBase.serialize`**
- Stored `tokens.get` as local variable `tokens_get` to avoid repeated attribute lookups
- Stored `Reduce` as local variable `Reduce_action` for faster identity comparisons
- Replaced dictionary comprehension with explicit loops to better leverage these local references

The optimization is most effective for **large-scale serialization workloads** - test results show the biggest gains (14-18% speedup) occur with hundreds of states and tokens, where the attribute lookup overhead compounds. The `serialize` method processes nested loops over states and tokens, making these micro-optimizations meaningful when called frequently.

**Performance Context**: Since this is part of a parser's serialization logic (LALR analysis), these optimizations directly benefit parser table generation and serialization workflows, where `Enumerator.get` assigns unique IDs to tokens and `serialize` processes the entire parse table structure.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 13, 2025 09:19
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant