-
Notifications
You must be signed in to change notification settings - Fork 27
1022 enhance utf handling #1456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…r to Validation_args test instantiations
| ) | ||
|
|
||
| reader = DatasetJSONReader() | ||
| from cdisc_rules_engine.models.dataset.pandas_dataset import PandasDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please move these to the top of the file among the other imports for consistency.
| ) | ||
|
|
||
| reader = DatasetNDJSONReader() | ||
| from cdisc_rules_engine.models.dataset.pandas_dataset import PandasDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here too.
tests/unit/test_xpt_reader.py
Outdated
| data = f.read() | ||
|
|
||
| reader = XPTReader() | ||
| from cdisc_rules_engine.models.dataset.pandas_dataset import PandasDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here.
| ), | ||
| ) | ||
| @click.option( | ||
| "--encoding", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add the short form flag also for encoding for consistency. I think we should fix the options of encoding you can provide to this flag. This will ensure that only valid encoding names pass to internal engine functionalities. Also please update the readme.md for the new flag.
…te README documentation
…org/cdisc-rules-engine into 1022-Enhance-UTF-Handling
…tation parameter in DummyDataService
…dataset metadata reading failures
RamilCDISC
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good to me now. @gerrycampion could you please confirm if you are okay with the new logic and new function for validating the encoding in core.py? I will put the closing comment after your confirmation.
|
@RakeshBobba03 @SFJohnson24 @gerrycampion the current approach tries to read as UTF8/16/32 for JSON and UTF8/latin1/cp1252/UTF16/32 for XPT. Another reason to avoid it is that theoretically data can be encoded with one encoding, but will be read without errors by another encoding. For example: "ã" in latin1 is 'ã' in UTF-8, so if "UTF-8" is used first, it will work, but the result would be different if it was read with latin1. If we specify that default value is UTF-8 it will be more straight forward and clear for a user. I think it is better to select an explicit approach, where default encoding is UTF8 and later a user can specify other encoding. It will impact 1023, because if one of the encodings worked, we will need to push it to the top level, so that later it can be used by the report (the report should use the same encoding as reader). I also think it can be important for a user to control which encoding is used, because when a regulatory agency is reading/checking the data, they need to understand what is the correct encoding. Do you think we can change it? |
| """ | ||
| service_name = name or self._default_service_name | ||
| if service_name in self._reader_map: | ||
| return self._reader_map[service_name](self.dataset_implementation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I see in the previous logic, self.dataset_implementation was consistently passed to all reader classes in _reader_map.
In the updated logic, however, the USDM reader is instantiated without dataset_implementation. Could you please clarify why dataset_implementation is not being passed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that the factory pattern already varies what parameters get passed based on what each reader actually needs. For example, XPTReader, DatasetJSONReader, and DatasetNDJSONReader get both dataset_implementation and encoding because they use both, while ParquetReader only gets dataset_implementation because that's what it needs (it doesn't accept encoding in its __init__ since it hardcodes UTF-8).
In the case of JSONReader, it only reads raw JSON and returns a dictionary, and it never actually creates Dataset objects (that happens later in USDMDataService). Also, throughout the codebase, JSONReader() is instantiated with no parameters when called directly (like in datasetjson_metadata_reader.py, usdm_data_service.py, dummy_data_service.py, etc.), so I was trying to match that pattern in the factory.
That said, I can see the argument for consistency, it would make the factory code more uniform and predictable. And since DataReaderInterface has a default parameter, passing dataset_implementation wouldn't break anything, it just wouldn't be used. Looking at it now, I realize that if we want consistency, we could pass dataset_implementation to JSONReader just like we do for ParquetReader (which also gets only dataset_implementation, not encoding).
I'm a bit torn between keeping it minimal versus keeping it consistent. What are your thoughts on this? Do you think the consistency benefit outweighs the redundancy, or does the current approach make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gerrycampion Could you please add your opinion. What do you think would be better approach for the codebase?
@DmitryMK I agree this approach would make the most sense. |
| service_name = name or self._default_service_name | ||
| if service_name in self._reader_map: | ||
| return self._reader_map[service_name](self.dataset_implementation) | ||
| reader_class = self._reader_map[service_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To answer the question, I think the simplest solution is to just add this to the DataReaderInterface init params. The implementing classes can decide whether or not to use it. No need for the different conditions in the factory.
| def __init__(self, dataset_implementation=PandasDataset, encoding: str = None): | ||
| self.dataset_implementation = dataset_implementation | ||
| self.encoding = encoding | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this since it will be in the DataReaderInterface
| def __init__(self, dataset_implementation, encoding: str = None): | ||
| self.dataset_implementation = dataset_implementation | ||
| self.encoding = encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this since it will be in the DataReaderInterface
| @property | ||
| def _encoding(self): | ||
| return self.encoding or "utf-8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this since the default should be set in core.py and passed from the factory to the DataReaderInterface
| def from_file(self, file_path, encoding: str = None): | ||
| try: | ||
| with open(file_path, "rb") as fp: | ||
| json = load(fp) | ||
| return json | ||
| encoding = encoding or "utf-8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this since the default should be set in core.py and passed from the factory to the DataReaderInterface
| @click.option( | ||
| "-e", | ||
| "--encoding", | ||
| default=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The help says this defaults to utf-8, so add the default "utf-8" here and remove it from the hardcoded locations in the rest of the code. Ensure there is also a default encoding value of utf-8 when core is called from the rule tester. maybe the datareaderfactory or datareaderinterface should have the default encoding, but we shouldn't have them at the data reader subclass level.
| callback=validate_encoding, | ||
| help=( | ||
| "File encoding for reading datasets. " | ||
| "If not specified, defaults to UTF-8. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"UTF-8" or "utf-8"? I think either is valid, but it's good to be consistent.
| "[████████████████████████████--------] | ||
| 78%"is printed. | ||
| -jcf, --jsonata-custom-functions Pair containing a variable name and a Path to directory containing a set of custom JSONata functions. Can be specified multiple times | ||
| -e, --encoding TEXT File encoding for reading datasets. If not specified, defaults to UTF-8. Supported encodings: utf-8, utf-16, utf-32, cp1252, latin-1, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consistent capitalization again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be a lot of work to add tests for failing gracefully with the utf-8 encoding and passing with at least one other encoding?
|
|
||
| def read_json_file(self, file_path: str) -> dict: | ||
| return JSONReader().from_file(file_path) | ||
| return JSONReader().from_file(file_path, encoding=self.encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSONReader should also be modified so that it uses the encoding from the interface. encoding here and other calls to from_file will need to be passed to the constructor, not the from_file
This PR adds UTF encoding support to the dataset readers to handle international characters and non-ASCII data. It adds an optional --encoding CLI parameter that propagates through the validation pipeline to all data readers and metadata readers. When encoding is not specified, the readers automatically detect encoding with fallbacks: JSON/NDJSON readers try UTF-8, UTF-16, and UTF-32 in sequence, while XPT readers try UTF-8, UTF-16, UTF-32, cp1252, and latin-1 to handle smart quotes and other Windows-1252 characters. This resolves UnicodeDecodeError issues when processing JSON files with international characters and XPT files containing non-UTF-8 characters from Excel exports. All readers (JSONReader, DatasetJSONReader, DatasetNDJSONReader, XPTReader) and metadata readers have been updated to support both explicit encoding specification and automatic detection.