1022 enhance utf handling #1456

RakeshBobba03 · 2025-11-26T11:37:13Z

This PR adds UTF encoding support to the dataset readers to handle international characters and non-ASCII data. It adds an optional --encoding CLI parameter that propagates through the validation pipeline to all data readers and metadata readers. When encoding is not specified, the readers automatically detect encoding with fallbacks: JSON/NDJSON readers try UTF-8, UTF-16, and UTF-32 in sequence, while XPT readers try UTF-8, UTF-16, UTF-32, cp1252, and latin-1 to handle smart quotes and other Windows-1252 characters. This resolves UnicodeDecodeError issues when processing JSON files with international characters and XPT files containing non-UTF-8 characters from Excel exports. All readers (JSONReader, DatasetJSONReader, DatasetNDJSONReader, XPTReader) and metadata readers have been updated to support both explicit encoding specification and automatic detection.

…r to Validation_args test instantiations

RamilCDISC · 2025-12-01T20:37:35Z

tests/unit/test_dataset_json_reader.py

    )

-    reader = DatasetJSONReader()
+    from cdisc_rules_engine.models.dataset.pandas_dataset import PandasDataset


Could you please move these to the top of the file among the other imports for consistency.

RamilCDISC · 2025-12-01T20:37:49Z

tests/unit/test_dataset_ndjson_reader.py

    )

-    reader = DatasetNDJSONReader()
+    from cdisc_rules_engine.models.dataset.pandas_dataset import PandasDataset


RamilCDISC · 2025-12-01T20:38:01Z

tests/unit/test_xpt_reader.py

        data = f.read()

-    reader = XPTReader()
+    from cdisc_rules_engine.models.dataset.pandas_dataset import PandasDataset


RamilCDISC · 2025-12-03T23:38:36Z

core.py

    ),
 )
+@click.option(
+    "--encoding",


Could you please add the short form flag also for encoding for consistency. I think we should fix the options of encoding you can provide to this flag. This will ensure that only valid encoding names pass to internal engine functionalities. Also please update the readme.md for the new flag.

…te README documentation

…org/cdisc-rules-engine into 1022-Enhance-UTF-Handling

…tation parameter in DummyDataService

…dataset metadata reading failures

RamilCDISC

All looks good to me now. @gerrycampion could you please confirm if you are okay with the new logic and new function for validating the encoding in core.py? I will put the closing comment after your confirmation.

DmitryMK · 2025-12-05T14:57:06Z

@RakeshBobba03 @SFJohnson24 @gerrycampion the current approach tries to read as UTF8/16/32 for JSON and UTF8/latin1/cp1252/UTF16/32 for XPT.

Another reason to avoid it is that theoretically data can be encoded with one encoding, but will be read without errors by another encoding. For example: "Ã£" in latin1 is 'ã' in UTF-8, so if "UTF-8" is used first, it will work, but the result would be different if it was read with latin1. If we specify that default value is UTF-8 it will be more straight forward and clear for a user.

I think it is better to select an explicit approach, where default encoding is UTF8 and later a user can specify other encoding. It will impact 1023, because if one of the encodings worked, we will need to push it to the top level, so that later it can be used by the report (the report should use the same encoding as reader).

I also think it can be important for a user to control which encoding is used, because when a regulatory agency is reading/checking the data, they need to understand what is the correct encoding.

Do you think we can change it?

…atic detection

RamilCDISC · 2025-12-08T19:55:44Z

cdisc_rules_engine/services/data_readers/data_reader_factory.py

        """
        service_name = name or self._default_service_name
        if service_name in self._reader_map:
-            return self._reader_map[service_name](self.dataset_implementation)


From what I see in the previous logic, self.dataset_implementation was consistently passed to all reader classes in _reader_map.
In the updated logic, however, the USDM reader is instantiated without dataset_implementation. Could you please clarify why dataset_implementation is not being passed here?

My thinking was that the factory pattern already varies what parameters get passed based on what each reader actually needs. For example, XPTReader, DatasetJSONReader, and DatasetNDJSONReader get both dataset_implementation and encoding because they use both, while ParquetReader only gets dataset_implementation because that's what it needs (it doesn't accept encoding in its __init__ since it hardcodes UTF-8).

In the case of JSONReader, it only reads raw JSON and returns a dictionary, and it never actually creates Dataset objects (that happens later in USDMDataService). Also, throughout the codebase, JSONReader() is instantiated with no parameters when called directly (like in datasetjson_metadata_reader.py, usdm_data_service.py, dummy_data_service.py, etc.), so I was trying to match that pattern in the factory.

That said, I can see the argument for consistency, it would make the factory code more uniform and predictable. And since DataReaderInterface has a default parameter, passing dataset_implementation wouldn't break anything, it just wouldn't be used. Looking at it now, I realize that if we want consistency, we could pass dataset_implementation to JSONReader just like we do for ParquetReader (which also gets only dataset_implementation, not encoding).

I'm a bit torn between keeping it minimal versus keeping it consistent. What are your thoughts on this? Do you think the consistency benefit outweighs the redundancy, or does the current approach make sense?

@gerrycampion Could you please add your opinion. What do you think would be better approach for the codebase?

gerrycampion · 2025-12-23T18:30:26Z

@RakeshBobba03 @SFJohnson24 @gerrycampion the current approach tries to read as UTF8/16/32 for JSON and UTF8/latin1/cp1252/UTF16/32 for XPT.

Another reason to avoid it is that theoretically data can be encoded with one encoding, but will be read without errors by another encoding. For example: "Ã£" in latin1 is 'ã' in UTF-8, so if "UTF-8" is used first, it will work, but the result would be different if it was read with latin1. If we specify that default value is UTF-8 it will be more straight forward and clear for a user.

I think it is better to select an explicit approach, where default encoding is UTF8 and later a user can specify other encoding. It will impact 1023, because if one of the encodings worked, we will need to push it to the top level, so that later it can be used by the report (the report should use the same encoding as reader).

I also think it can be important for a user to control which encoding is used, because when a regulatory agency is reading/checking the data, they need to understand what is the correct encoding.

Do you think we can change it?

@DmitryMK I agree this approach would make the most sense.
@RakeshBobba03 I see you've already implemented this approach

gerrycampion · 2026-01-01T17:19:59Z

cdisc_rules_engine/services/data_readers/data_reader_factory.py

        service_name = name or self._default_service_name
        if service_name in self._reader_map:
-            return self._reader_map[service_name](self.dataset_implementation)
+            reader_class = self._reader_map[service_name]


To answer the question, I think the simplest solution is to just add this to the DataReaderInterface init params. The implementing classes can decide whether or not to use it. No need for the different conditions in the factory.

gerrycampion · 2026-01-01T17:21:17Z

cdisc_rules_engine/services/data_readers/dataset_json_reader.py

+    def __init__(self, dataset_implementation=PandasDataset, encoding: str = None):
+        self.dataset_implementation = dataset_implementation
+        self.encoding = encoding
+


Remove this since it will be in the DataReaderInterface

gerrycampion · 2026-01-01T17:23:24Z

cdisc_rules_engine/services/data_readers/dataset_ndjson_reader.py

+    def __init__(self, dataset_implementation, encoding: str = None):
+        self.dataset_implementation = dataset_implementation
+        self.encoding = encoding


Remove this since it will be in the DataReaderInterface

gerrycampion · 2026-01-01T17:24:58Z

cdisc_rules_engine/services/data_readers/dataset_ndjson_reader.py

+    @property
+    def _encoding(self):
+        return self.encoding or "utf-8"


remove this since the default should be set in core.py and passed from the factory to the DataReaderInterface

gerrycampion · 2026-01-01T17:26:50Z

cdisc_rules_engine/services/data_readers/json_reader.py

+    def from_file(self, file_path, encoding: str = None):
        try:
-            with open(file_path, "rb") as fp:
-                json = load(fp)
-            return json
+            encoding = encoding or "utf-8"


remove this since the default should be set in core.py and passed from the factory to the DataReaderInterface

gerrycampion · 2026-01-01T17:34:53Z

core.py

+@click.option(
+    "-e",
+    "--encoding",
+    default=None,


The help says this defaults to utf-8, so add the default "utf-8" here and remove it from the hardcoded locations in the rest of the code. Ensure there is also a default encoding value of utf-8 when core is called from the rule tester. maybe the datareaderfactory or datareaderinterface should have the default encoding, but we shouldn't have them at the data reader subclass level.

gerrycampion · 2026-01-01T17:35:22Z

core.py

+    callback=validate_encoding,
+    help=(
+        "File encoding for reading datasets. "
+        "If not specified, defaults to UTF-8. "


"UTF-8" or "utf-8"? I think either is valid, but it's good to be consistent.

gerrycampion · 2026-01-01T17:37:11Z

README.md

                                  "[████████████████████████████--------]
                                  78%"is printed.
  -jcf, --jsonata-custom-functions Pair containing a variable name and a Path to directory containing a set of custom JSONata functions. Can be specified multiple times
+  -e, --encoding TEXT            File encoding for reading datasets. If not specified, defaults to UTF-8. Supported encodings: utf-8, utf-16, utf-32, cp1252, latin-1, etc.


consistent capitalization again

gerrycampion · 2026-01-01T17:40:24Z

tests/unit/test_dataset_json_reader.py

would it be a lot of work to add tests for failing gracefully with the utf-8 encoding and passing with at least one other encoding?

gerrycampion · 2026-01-01T17:59:32Z

cdisc_rules_engine/services/data_readers/dataset_json_reader.py


    def read_json_file(self, file_path: str) -> dict:
-        return JSONReader().from_file(file_path)
+        return JSONReader().from_file(file_path, encoding=self.encoding)


JSONReader should also be modified so that it uses the encoding from the interface. encoding here and other calls to from_file will need to be passed to the constructor, not the from_file

RakeshBobba03 added 2 commits November 25, 2025 13:18

UTF Encoding Enhancement Implementation

c6a8c77

Merge branch 'main' into 1022-Enhance-UTF-Handling

c8f5c32

RakeshBobba03 temporarily deployed to DEV November 26, 2025 11:37 — with GitHub Actions Inactive

add dataset_implementation to DatasetJSONReader and encoding paramete…

9bbbe48

…r to Validation_args test instantiations

RakeshBobba03 temporarily deployed to DEV November 26, 2025 13:13 — with GitHub Actions Inactive

RakeshBobba03 requested a review from SFJohnson24 November 26, 2025 13:23

RakeshBobba03 marked this pull request as ready for review November 26, 2025 13:23

SFJohnson24 requested review from RamilCDISC and gerrycampion November 26, 2025 14:32

RamilCDISC requested changes Dec 1, 2025

View reviewed changes

Merge branch 'main' into 1022-Enhance-UTF-Handling

e92be69

RakeshBobba03 temporarily deployed to DEV December 2, 2025 22:50 — with GitHub Actions Inactive

move imports to top and add encoding parameter to test_validate

2d4f6ac

RakeshBobba03 temporarily deployed to DEV December 2, 2025 22:52 — with GitHub Actions Inactive

RakeshBobba03 requested a review from RamilCDISC December 2, 2025 23:37

Merge branch 'main' into 1022-Enhance-UTF-Handling

2316d95

RamilCDISC temporarily deployed to DEV December 3, 2025 23:03 — with GitHub Actions Inactive

RamilCDISC requested changes Dec 3, 2025

View reviewed changes

RakeshBobba03 added 4 commits December 3, 2025 22:23

Merge branch 'main' into 1022-Enhance-UTF-Handling

a970b73

Add short form flag (-e) for encoding option with validation and upda…

ee0d5ac

…te README documentation

Merge branch '1022-Enhance-UTF-Handling' of https://github.com/cdisc-…

273ee05

…org/cdisc-rules-engine into 1022-Enhance-UTF-Handling

Fix encoding error handling fallback and add missing dataset_implemen…

0d1c9c6

…tation parameter in DummyDataService

RakeshBobba03 temporarily deployed to DEV December 4, 2025 04:15 — with GitHub Actions Inactive

Fix XPT encoding detection order and add graceful error handling for …

277aca7

…dataset metadata reading failures

RakeshBobba03 temporarily deployed to DEV December 4, 2025 06:00 — with GitHub Actions Inactive

RakeshBobba03 requested a review from RamilCDISC December 4, 2025 06:08

This was linked to issues Dec 4, 2025

CDISC CORE Exits when encounterd non-ASCII Characters #936

Open

Enhance UTF Handling in JSON Dataset Reader #1022

Open

RamilCDISC reviewed Dec 4, 2025

View reviewed changes

RakeshBobba03 added 3 commits December 5, 2025 11:14

Merge branch 'main' into 1022-Enhance-UTF-Handling

da87cef

Merge branch 'main' into 1022-Enhance-UTF-Handling

b237486

Default to UTF-8 encoding with explicit -e flag support, remove autom…

8ff517c

…atic detection

RakeshBobba03 temporarily deployed to DEV December 8, 2025 02:27 — with GitHub Actions Inactive

Merge branch 'main' into 1022-Enhance-UTF-Handling

11accb9

RakeshBobba03 temporarily deployed to DEV December 8, 2025 02:39 — with GitHub Actions Inactive

RakeshBobba03 requested a review from RamilCDISC December 8, 2025 02:48

RamilCDISC reviewed Dec 8, 2025

View reviewed changes

gerrycampion requested changes Jan 2, 2026

View reviewed changes

1022 enhance utf handling #1456

Are you sure you want to change the base?

1022 enhance utf handling #1456

Uh oh!

Conversation

RakeshBobba03 commented Nov 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RamilCDISC left a comment

Choose a reason for hiding this comment

Uh oh!

DmitryMK commented Dec 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gerrycampion commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gerrycampion Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gerrycampion commented Dec 23, 2025 •

edited

Loading

gerrycampion Jan 1, 2026 •

edited

Loading