Avoid errors on array full nan values #4

mano8 · 2024-11-28T13:28:30Z

Changes:

add PhpFina noisy data files
Avoid errors on array full nan values
improve test.py

Summary by CodeRabbit

New Features
- Improved handling of NaN values during data processing for enhanced clarity and efficiency.
- Updated error messaging for file operations to provide clearer debugging information.
- Added functionality to display generated plots visually after saving.
- Introduced a new end_time key in metadata for better validation of start times.
Bug Fixes
- Adjusted construction of the DATA_DIR variable to ensure correct directory referencing.
- Enhanced UTC timestamp processing for better accuracy.

coderabbitai · 2024-11-28T13:28:42Z

Walkthrough

The changes in this pull request primarily focus on improving the handling of NaN values in the PyFina class's __new__ method, enhancing the logic for reading data files and validating timestamps. Modifications include the addition of an end_time key in the metadata and refined error handling during file operations. In the tests/test.py file, adjustments were made to the DATA_DIR variable for correct directory resolution and to timestamp handling for explicit UTC timezone settings. A new line was added to display plots visually after generation.

Changes

File	Change Summary
PyFina/PyFina.py	Added `end_time` to metadata in `getMeta`, modified `__new__` method for NaN handling and error management.
tests/test.py	Updated `DATA_DIR` construction for correct path resolution, changed timestamp handling to explicitly set UTC timezone, and added `plt.show()` to display plots.

Poem

In the meadow where data flows,
A rabbit hops where the NaN goes.
With paths aligned and time set right,
We celebrate changes, oh what a sight!
From files we read, to plots we show,
A dance of numbers, let the knowledge grow! 🐇✨

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (5)

tests/test.py (2)
12-21: Consider enhancing path resolution robustness

While the current implementation is an improvement with fallback logic, consider these enhancements:

Add error handling if both paths fail

Consider making the data directory configurable via environment variable
 DATA_DIR = os.path.join(
     os.path.abspath(os.getcwd()),
     "datas"
 )
 if not os.path.isdir(DATA_DIR):
     DATA_DIR = os.path.join(
         os.path.abspath(os.getcwd()),
         "tests",
         "datas"
     )
+    if not os.path.isdir(DATA_DIR):
+        DATA_DIR = os.getenv('PYFINA_DATA_DIR', DATA_DIR)
+        if not os.path.isdir(DATA_DIR):
+            raise FileNotFoundError(f"Could not find data directory in any location: {DATA_DIR}")
45-45: Consider conditional plot display for automated environments

While adding plt.show() improves interactive testing, consider making it conditional for automated environments.
 figure.savefig(f"feed_{FEED_NB}.png")
-plt.show()
+if os.environ.get('DISPLAY_PLOTS', '').lower() == 'true':
+    plt.show()
PyFina/PyFina.py (3)
148-154: LGTM! The NaN handling logic is robust.

The implementation correctly handles arrays with NaN values by:

Checking for the presence of finite values

Finding the first non-NaN value and its index

Backfilling NaN values at the start if needed

Consider making the condition more explicit for better readability:
-if finiteness_obj.sum() > 0:
+if np.any(finiteness_obj):  # More explicit way to check for any finite values
152-154: Consider optimizing the array creation.

The current implementation creates a temporary array of ones and then multiplies it. This can be more efficiently done using np.full().
-                    obj[:first_non_nan_index] = (
-                        np.ones(first_non_nan_index) * first_non_nan_value
-                    )
+                    obj[:first_non_nan_index] = np.full(first_non_nan_index, first_non_nan_value)
148-154: Consider improving the handling of all-NaN arrays.

When all values are NaN, the metadata fields (first_non_nan_value and first_non_nan_index) will be -1, which might be confusing. Consider:

Using None or np.nan instead of -1 to indicate no valid values

Adding documentation about these special cases

Adding a property method to explicitly check if all values are NaN

Example implementation:
@property
def is_all_nan(self) -> bool:
    """Return True if all values in the array are NaN."""
    return self.first_non_nan_index == -1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4bc91f4 and 3111533.

⛔ Files ignored due to path filters (1)

tests/datas/2.dat is excluded by !**/*.dat

📒 Files selected for processing (2)

PyFina/PyFina.py (1 hunks)
tests/test.py (3 hunks)

🔇 Additional comments (4)

tests/test.py (3)

4-4: LGTM!

The os import is appropriately added and necessary for the new path handling functionality.

35-35: LGTM! Explicit timezone handling is a good practice

The change to use explicit UTC timezone with datetime.timezone.utc is a significant improvement. It prevents ambiguity in timestamp handling and ensures consistent behavior across different systems and timezones.

Line range hint 29-29: Verify NaN handling implementation

The code calls check_starting_nan, but let's verify its implementation to ensure proper NaN value handling as mentioned in the PR objectives.

✅ Verification successful

Let me gather more specific information about the NaN handling implementation.

NaN handling is properly implemented

The codebase has comprehensive NaN handling implementation in PyFina class:

Tracks number of NaN values via nb_nan property
Detects if data starts with NaN via starting_by_nan flag
Stores first non-NaN value and its index
Handles NaN values by replacing them with previous values when remove_nan=True
check_starting_nan function provides diagnostic information about NaN occurrences

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for check_starting_nan implementation
ast-grep --pattern 'def check_starting_nan($$$)'

# Search for any NaN handling in the codebase
rg -l 'nan|NaN|isnan' --type py

Length of output: 135

Script:

#!/bin/bash
# Get implementation details of NaN handling in PyFina.py
rg -A 5 'nan|NaN|isnan' PyFina/PyFina.py

# Get implementation details in test files
rg -A 5 'nan|NaN|isnan' tests/test.py tests/multigraph.py

# Look for check_starting_nan function specifically
rg -A 10 'check_starting_nan' PyFina/PyFina.py tests/test.py tests/multigraph.py

Length of output: 7588

PyFina/PyFina.py (1)

148-154: Verify handling of arrays full of NaN values.

To ensure the changes meet the PR objective of avoiding errors with NaN-filled arrays, please verify:

The behavior when all input values are NaN
Edge cases with different array sizes
Memory usage with large arrays

✅ Verification successful

The implementation safely handles arrays full of NaN values

After reviewing the implementation in PyFina.py, I can confirm that the code properly handles arrays full of NaN values:

The code explicitly checks if nb_nan < npts before processing, preventing any issues with fully NaN arrays
The finiteness_obj.sum() > 0 check ensures safe handling when no finite values exist
Memory usage is optimized by using numpy's efficient array operations
The implementation includes proper tracking of NaN statistics through class attributes:
- nb_nan: counts total NaN values
- first_non_nan_value: stores first valid value (-1 if none exists)
- first_non_nan_index: stores first valid index (-1 if none exists)
- starting_by_nan: tracks if array starts with NaN

The code is defensive and will not raise errors even with edge cases like arrays of all NaN values.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for existing test cases handling full NaN arrays
rg -l "test.*nan" tests/
ast-grep --pattern 'def test_$_(self):
  $$$
  np.nan
  $$$
}'

Length of output: 91

Script:

#!/bin/bash
# Let's try a different approach to find test files and NaN-related tests
fd "test.*\.py" tests/

# Search for NaN-related code in test files
rg "nan|NaN" tests/ -i

# Look for array creation or NaN handling in tests
rg "np\.(array|zeros|ones|full|empty)" tests/

# Check the implementation file for NaN handling
cat PyFina/PyFina.py

Length of output: 6751

mano8

Ok

mano8 · 2024-11-30T00:22:44Z

PyFina/PyFina.py

Avoid Exception IndexError: index 0 is out of bounds for axis 0 with size 0,
when finiteness_obj.sum() = 0.
from line 148 np.where(finiteness_obj)[0][0]

@mano8 : Just a wondering after integrating your datas in the tests : why do you record temperature (I guess it is temperature but maybe something else) each 1 seconds ? do you use a wired sensor ? I guess no otherwise you wouldn't have so many nan :-)

Apologies for the file size. This file was generated from other files, with NaN values intentionally added for testing purposes. The source files contain solar battery voltage records sampled every second. They were created for testing in a different project. Feel free to exclude this file from the repository. :-)

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

PyFina/PyFina.py (1)

119-124: Consider improving error message formatting

The validation logic is good, but the error message formatting could be more consistent.

-            raise ValueError("Error: invalid start value, "
-                             "start must be upper than end time value "
-                             "defined by start_time + (npoints * interval) from meta."
-                            )
+            raise ValueError(
+                "Error: invalid start value, start must be less than end time value "
+                "defined by start_time + (npoints * interval) from meta."
+            )

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 3111533 and c6c5c65.

📒 Files selected for processing (1)

PyFina/PyFina.py (3 hunks)

🔇 Additional comments (4)

PyFina/PyFina.py (4)

80-80: LGTM: Good addition of end_time metadata

The end_time calculation is correct and provides useful metadata for validation.

110-114: LGTM: Better initialization with NaN values

Initializing arrays with NaN instead of zeros is a safer approach as it explicitly marks unset values. This change aligns well with the PR's objective of improving NaN handling.

150-152: LGTM: Good addition of bounds checking

The early exit when position is out of bounds prevents potential buffer overruns and improves robustness.

159-165: LGTM: Robust improvement to NaN handling

The changes effectively address the issue of arrays filled with NaN values:

Using finiteness_obj.sum() > 0 is more robust than the previous condition
Proper handling of arrays starting with NaN values
Fixes the IndexError issue mentioned in previous reviews

This implementation successfully achieves the PR's objective of avoiding errors with NaN-filled arrays.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (5)

PyFina/PyFina.py (5)
80-80: LGTM! Consider adding a docstring update.

The end_time calculation is correct and useful for validation. Consider updating the function's docstring to document this new metadata field.
 def getMeta(feed_id: int, data_dir: str) -> Literal[False] | dict[str, int]:
     """
     decoding the .meta file
     feed_id (4 bytes, Unsigned integer)
     npoints (4 bytes, Unsigned integer, Legacy : use instead filesize//4 )
     interval (4 bytes, Unsigned integer)
     start_time (4 bytes, Unsigned integer)
+    Returns:
+        dict with keys: interval, start_time, npoints, end_time
+        where end_time is the timestamp of the last data point
     """
110-114: Clean up initialization code and improve logging placement.

The NaN initialization is a good improvement, but there are some cleanup opportunities:

Remove the commented out code as it's no longer needed

Consider moving the debug log after both arrays are initialized
-        obj.fill(np.nan)
-        #obj = np.zeros(npts).view(cls)
-        pyfina_logger.debug(obj)
-        raw_obj = np.empty(npts)
-        raw_obj.fill(np.nan)
+        obj.fill(np.nan)
+        raw_obj = np.empty(npts)
+        raw_obj.fill(np.nan)
+        pyfina_logger.debug(obj)
119-124: Improve error message formatting.

The validation logic is good, but the error message formatting could be improved for consistency and readability.
-            raise ValueError("Error: invalid start value, "
-                             "start must be upper than end time value "
-                             "defined by start_time + (npoints * interval) from meta."
-                            )
+            raise ValueError(
+                "Invalid start value: must be less than end_time "
+                f"({meta['end_time']}) from meta"
+            )
150-152: LGTM! Consider making the termination more explicit.

The early termination logic is good for performance. Consider making it more explicit with a break statement.
-                # End reading loop if pos out of bounds
-                else:
-                    time = end
+                # Position out of bounds, no more valid data to read
+                else:
+                    break
Line range hint 1-180: Consider memory optimization for large datasets.

The current implementation loads the entire array into memory. For very large datasets, consider implementing a lazy-loading mechanism or streaming approach to reduce memory usage.

Would you like guidance on implementing a memory-efficient version for large datasets?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c6c5c65 and a078b0b.

📒 Files selected for processing (1)

PyFina/PyFina.py (3 hunks)

🔇 Additional comments (1)

PyFina/PyFina.py (1)

159-165: LGTM! Good fix for NaN handling edge cases.

The changes effectively address:

The IndexError when all values are NaN (from previous review)
Proper handling of starting NaN values when remove_nan is True

The implementation is robust and handles edge cases well.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)

PyFina/PyFina.py (3)
113-116: Consider removing debug logging statement.

The array initialization with NaN values is a good improvement. However, the debug logging statement on line 116 might not be necessary as it will print large arrays.
 obj.fill(np.nan)
 raw_obj = np.empty(npts)
 raw_obj.fill(np.nan)
-pyfina_logger.debug(obj)
121-125: Improve error message formatting.

The time validation is a good addition, but the error message could be more concise and better formatted.
-            raise ValueError("Error: invalid start value, start must be less than end time value "
-                             "defined by start_time + (npoints * interval) from meta."
-                            )
+            raise ValueError(
+                "Invalid start value: must be less than end_time "
+                f"({meta['end_time']}) from meta"
+            )
Line range hint 1-185: Consider future separation of concerns.

While the current implementation is solid, consider splitting the file I/O operations and data processing into separate classes in the future. This would:

Make the code more maintainable

Allow for easier testing

Enable support for different storage backends

Example structure:

PyFinaReader: Handles file I/O and metadata

PyFinaProcessor: Handles data processing and NaN handling

PyFina: Orchestrates the above classes

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a078b0b and 6fa2c2d.

📒 Files selected for processing (2)

PyFina/PyFina.py (4 hunks)
tests/test.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/test.py

🔇 Additional comments (4)

PyFina/PyFina.py (4)

65-67: LGTM! Documentation improvements are clear and accurate.

The added documentation clearly explains the return value structure and the meaning of end_time.

83-83: LGTM! Metadata enhancement with end_time calculation.

The end_time calculation is correct and provides valuable information for time range validation.

151-153: LGTM! Efficient loop termination added.

Good optimization to break the loop early when the position is out of bounds.

160-166: LGTM! Robust NaN handling implementation.

The improved NaN handling successfully addresses the previous IndexError issues by:

Using finiteness_obj.sum() > 0 for safer array indexing
Properly handling arrays that start with NaN values

This resolves the issue mentioned in the previous review comment about IndexError exceptions.

alexandrecuer · 2024-12-05T09:27:46Z

thanks @mano8 👍

mano8 added 5 commits November 28, 2024 13:52

avoid errors on DATA_DIR value

a3dc413

fix deprecated utcfromtimestamp

1f787c0

fix show graph

df24625

avoid errors on full nan values

599f744

Add noisy fina files to test

3111533

coderabbitai bot reviewed Nov 28, 2024

View reviewed changes

mano8 commented Nov 28, 2024

View reviewed changes

Add PyFina test notebook

4e833f4

mano8 commented Nov 30, 2024

View reviewed changes

mano8 added 5 commits November 30, 2024 03:21

add end_time value to getMeta function result

f2a9944

Fill ndarray with nan values to avoid arbitrary values

bd4ee7c

Avoid Reading file if time start read >= file end_time

6d0ae2b

End reading loop if pos out of bounds

d1c5794

Apply changes

c6c5c65

coderabbitai bot reviewed Nov 30, 2024

View reviewed changes

Correct end_time value

a078b0b

coderabbitai bot reviewed Nov 30, 2024

View reviewed changes

Apply codeRabit recomandations

6fa2c2d

coderabbitai bot reviewed Dec 1, 2024

View reviewed changes

alexandrecuer merged commit 1ee4224 into Open-Building-Management:main Dec 5, 2024
1 of 6 checks passed

Avoid errors on array full nan values #4

Avoid errors on array full nan values #4

Uh oh!

Conversation

mano8 commented Nov 28, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

mano8 left a comment

Choose a reason for hiding this comment

Uh oh!

mano8 Nov 30, 2024

Choose a reason for hiding this comment

Uh oh!

alexandrecuer Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mano8 Dec 8, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexandrecuer commented Dec 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mano8 commented Nov 28, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 28, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

alexandrecuer Dec 5, 2024 •

edited

Loading