-
Notifications
You must be signed in to change notification settings - Fork 1
Avoid errors on array full nan values #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes in this pull request primarily focus on improving the handling of NaN values in the Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (5)
tests/test.py (2)
12-21: Consider enhancing path resolution robustnessWhile the current implementation is an improvement with fallback logic, consider these enhancements:
- Add error handling if both paths fail
- Consider making the data directory configurable via environment variable
DATA_DIR = os.path.join( os.path.abspath(os.getcwd()), "datas" ) if not os.path.isdir(DATA_DIR): DATA_DIR = os.path.join( os.path.abspath(os.getcwd()), "tests", "datas" ) + if not os.path.isdir(DATA_DIR): + DATA_DIR = os.getenv('PYFINA_DATA_DIR', DATA_DIR) + if not os.path.isdir(DATA_DIR): + raise FileNotFoundError(f"Could not find data directory in any location: {DATA_DIR}")
45-45: Consider conditional plot display for automated environmentsWhile adding
plt.show()improves interactive testing, consider making it conditional for automated environments.figure.savefig(f"feed_{FEED_NB}.png") -plt.show() +if os.environ.get('DISPLAY_PLOTS', '').lower() == 'true': + plt.show()PyFina/PyFina.py (3)
148-154: LGTM! The NaN handling logic is robust.The implementation correctly handles arrays with NaN values by:
- Checking for the presence of finite values
- Finding the first non-NaN value and its index
- Backfilling NaN values at the start if needed
Consider making the condition more explicit for better readability:
-if finiteness_obj.sum() > 0: +if np.any(finiteness_obj): # More explicit way to check for any finite values
152-154: Consider optimizing the array creation.The current implementation creates a temporary array of ones and then multiplies it. This can be more efficiently done using
np.full().- obj[:first_non_nan_index] = ( - np.ones(first_non_nan_index) * first_non_nan_value - ) + obj[:first_non_nan_index] = np.full(first_non_nan_index, first_non_nan_value)
148-154: Consider improving the handling of all-NaN arrays.When all values are NaN, the metadata fields (
first_non_nan_valueandfirst_non_nan_index) will be -1, which might be confusing. Consider:
- Using
Noneornp.naninstead of -1 to indicate no valid values- Adding documentation about these special cases
- Adding a property method to explicitly check if all values are NaN
Example implementation:
@property def is_all_nan(self) -> bool: """Return True if all values in the array are NaN.""" return self.first_non_nan_index == -1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
tests/datas/2.datis excluded by!**/*.dat
📒 Files selected for processing (2)
PyFina/PyFina.py(1 hunks)tests/test.py(3 hunks)
🔇 Additional comments (4)
tests/test.py (3)
4-4: LGTM!
The os import is appropriately added and necessary for the new path handling functionality.
35-35: LGTM! Explicit timezone handling is a good practice
The change to use explicit UTC timezone with datetime.timezone.utc is a significant improvement. It prevents ambiguity in timestamp handling and ensures consistent behavior across different systems and timezones.
Line range hint 29-29: Verify NaN handling implementation
The code calls check_starting_nan, but let's verify its implementation to ensure proper NaN value handling as mentioned in the PR objectives.
✅ Verification successful
Let me gather more specific information about the NaN handling implementation.
NaN handling is properly implemented
The codebase has comprehensive NaN handling implementation in PyFina class:
- Tracks number of NaN values via
nb_nanproperty - Detects if data starts with NaN via
starting_by_nanflag - Stores first non-NaN value and its index
- Handles NaN values by replacing them with previous values when
remove_nan=True check_starting_nanfunction provides diagnostic information about NaN occurrences
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for check_starting_nan implementation
ast-grep --pattern 'def check_starting_nan($$$)'
# Search for any NaN handling in the codebase
rg -l 'nan|NaN|isnan' --type py
Length of output: 135
Script:
#!/bin/bash
# Get implementation details of NaN handling in PyFina.py
rg -A 5 'nan|NaN|isnan' PyFina/PyFina.py
# Get implementation details in test files
rg -A 5 'nan|NaN|isnan' tests/test.py tests/multigraph.py
# Look for check_starting_nan function specifically
rg -A 10 'check_starting_nan' PyFina/PyFina.py tests/test.py tests/multigraph.py
Length of output: 7588
PyFina/PyFina.py (1)
148-154: Verify handling of arrays full of NaN values.
To ensure the changes meet the PR objective of avoiding errors with NaN-filled arrays, please verify:
- The behavior when all input values are NaN
- Edge cases with different array sizes
- Memory usage with large arrays
✅ Verification successful
The implementation safely handles arrays full of NaN values
After reviewing the implementation in PyFina.py, I can confirm that the code properly handles arrays full of NaN values:
- The code explicitly checks if
nb_nan < nptsbefore processing, preventing any issues with fully NaN arrays - The
finiteness_obj.sum() > 0check ensures safe handling when no finite values exist - Memory usage is optimized by using numpy's efficient array operations
- The implementation includes proper tracking of NaN statistics through class attributes:
nb_nan: counts total NaN valuesfirst_non_nan_value: stores first valid value (-1 if none exists)first_non_nan_index: stores first valid index (-1 if none exists)starting_by_nan: tracks if array starts with NaN
The code is defensive and will not raise errors even with edge cases like arrays of all NaN values.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for existing test cases handling full NaN arrays
rg -l "test.*nan" tests/
ast-grep --pattern 'def test_$_(self):
$$$
np.nan
$$$
}'
Length of output: 91
Script:
#!/bin/bash
# Let's try a different approach to find test files and NaN-related tests
fd "test.*\.py" tests/
# Search for NaN-related code in test files
rg "nan|NaN" tests/ -i
# Look for array creation or NaN handling in tests
rg "np\.(array|zeros|ones|full|empty)" tests/
# Check the implementation file for NaN handling
cat PyFina/PyFina.py
Length of output: 6751
mano8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid Exception IndexError: index 0 is out of bounds for axis 0 with size 0,
when finiteness_obj.sum() = 0.
from line 148 np.where(finiteness_obj)[0][0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mano8 : Just a wondering after integrating your datas in the tests : why do you record temperature (I guess it is temperature but maybe something else) each 1 seconds ? do you use a wired sensor ? I guess no otherwise you wouldn't have so many nan :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the file size. This file was generated from other files, with NaN values intentionally added for testing purposes. The source files contain solar battery voltage records sampled every second. They were created for testing in a different project. Feel free to exclude this file from the repository. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
PyFina/PyFina.py (1)
119-124: Consider improving error message formattingThe validation logic is good, but the error message formatting could be more consistent.
- raise ValueError("Error: invalid start value, " - "start must be upper than end time value " - "defined by start_time + (npoints * interval) from meta." - ) + raise ValueError( + "Error: invalid start value, start must be less than end time value " + "defined by start_time + (npoints * interval) from meta." + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
PyFina/PyFina.py(3 hunks)
🔇 Additional comments (4)
PyFina/PyFina.py (4)
80-80: LGTM: Good addition of end_time metadata
The end_time calculation is correct and provides useful metadata for validation.
110-114: LGTM: Better initialization with NaN values
Initializing arrays with NaN instead of zeros is a safer approach as it explicitly marks unset values. This change aligns well with the PR's objective of improving NaN handling.
150-152: LGTM: Good addition of bounds checking
The early exit when position is out of bounds prevents potential buffer overruns and improves robustness.
159-165: LGTM: Robust improvement to NaN handling
The changes effectively address the issue of arrays filled with NaN values:
- Using
finiteness_obj.sum() > 0is more robust than the previous condition - Proper handling of arrays starting with NaN values
- Fixes the IndexError issue mentioned in previous reviews
This implementation successfully achieves the PR's objective of avoiding errors with NaN-filled arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (5)
PyFina/PyFina.py (5)
80-80: LGTM! Consider adding a docstring update.The end_time calculation is correct and useful for validation. Consider updating the function's docstring to document this new metadata field.
def getMeta(feed_id: int, data_dir: str) -> Literal[False] | dict[str, int]: """ decoding the .meta file feed_id (4 bytes, Unsigned integer) npoints (4 bytes, Unsigned integer, Legacy : use instead filesize//4 ) interval (4 bytes, Unsigned integer) start_time (4 bytes, Unsigned integer) + Returns: + dict with keys: interval, start_time, npoints, end_time + where end_time is the timestamp of the last data point """
110-114: Clean up initialization code and improve logging placement.The NaN initialization is a good improvement, but there are some cleanup opportunities:
- Remove the commented out code as it's no longer needed
- Consider moving the debug log after both arrays are initialized
- obj.fill(np.nan) - #obj = np.zeros(npts).view(cls) - pyfina_logger.debug(obj) - raw_obj = np.empty(npts) - raw_obj.fill(np.nan) + obj.fill(np.nan) + raw_obj = np.empty(npts) + raw_obj.fill(np.nan) + pyfina_logger.debug(obj)
119-124: Improve error message formatting.The validation logic is good, but the error message formatting could be improved for consistency and readability.
- raise ValueError("Error: invalid start value, " - "start must be upper than end time value " - "defined by start_time + (npoints * interval) from meta." - ) + raise ValueError( + "Invalid start value: must be less than end_time " + f"({meta['end_time']}) from meta" + )
150-152: LGTM! Consider making the termination more explicit.The early termination logic is good for performance. Consider making it more explicit with a break statement.
- # End reading loop if pos out of bounds - else: - time = end + # Position out of bounds, no more valid data to read + else: + break
Line range hint
1-180: Consider memory optimization for large datasets.The current implementation loads the entire array into memory. For very large datasets, consider implementing a lazy-loading mechanism or streaming approach to reduce memory usage.
Would you like guidance on implementing a memory-efficient version for large datasets?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
PyFina/PyFina.py(3 hunks)
🔇 Additional comments (1)
PyFina/PyFina.py (1)
159-165: LGTM! Good fix for NaN handling edge cases.
The changes effectively address:
- The IndexError when all values are NaN (from previous review)
- Proper handling of starting NaN values when remove_nan is True
The implementation is robust and handles edge cases well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
PyFina/PyFina.py (3)
113-116: Consider removing debug logging statement.The array initialization with NaN values is a good improvement. However, the debug logging statement on line 116 might not be necessary as it will print large arrays.
obj.fill(np.nan) raw_obj = np.empty(npts) raw_obj.fill(np.nan) -pyfina_logger.debug(obj)
121-125: Improve error message formatting.The time validation is a good addition, but the error message could be more concise and better formatted.
- raise ValueError("Error: invalid start value, start must be less than end time value " - "defined by start_time + (npoints * interval) from meta." - ) + raise ValueError( + "Invalid start value: must be less than end_time " + f"({meta['end_time']}) from meta" + )
Line range hint
1-185: Consider future separation of concerns.While the current implementation is solid, consider splitting the file I/O operations and data processing into separate classes in the future. This would:
- Make the code more maintainable
- Allow for easier testing
- Enable support for different storage backends
Example structure:
PyFinaReader: Handles file I/O and metadataPyFinaProcessor: Handles data processing and NaN handlingPyFina: Orchestrates the above classes
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
PyFina/PyFina.py(4 hunks)tests/test.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test.py
🔇 Additional comments (4)
PyFina/PyFina.py (4)
65-67: LGTM! Documentation improvements are clear and accurate.
The added documentation clearly explains the return value structure and the meaning of end_time.
83-83: LGTM! Metadata enhancement with end_time calculation.
The end_time calculation is correct and provides valuable information for time range validation.
151-153: LGTM! Efficient loop termination added.
Good optimization to break the loop early when the position is out of bounds.
160-166: LGTM! Robust NaN handling implementation.
The improved NaN handling successfully addresses the previous IndexError issues by:
- Using finiteness_obj.sum() > 0 for safer array indexing
- Properly handling arrays that start with NaN values
This resolves the issue mentioned in the previous review comment about IndexError exceptions.
1ee4224
into
Open-Building-Management:main
|
thanks @mano8 👍 |
Changes:
Summary by CodeRabbit
New Features
end_timekey in metadata for better validation of start times.Bug Fixes
DATA_DIRvariable to ensure correct directory referencing.