Skip to content

Conversation

@SangChengC
Copy link
Contributor

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @SangChengC, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly extends LightLLM's capabilities by integrating the Neo Chat multimodal model. The changes encompass modifications to the KV cache management, the introduction of a specialized inference state, and the implementation of custom transformer layers and weight loading logic tailored for Neo Chat. A key feature is the advanced image processing pipeline, which includes 2D Rotary Position Embeddings and optimized Triton kernels, allowing the model to efficiently process and understand visual information alongside text.

Highlights

  • Neo Chat Model Integration: Introduced full support for the 'Neo Chat' multimodal model, including its unique KV cache handling, inference state, transformer layers, and weight loading mechanisms.
  • Multimodal Position Embeddings: Implemented a 2D Rotary Position Embedding (RoPE) mechanism for images, allowing separate positional encoding for height and width dimensions within the attention mechanism.
  • Custom Triton Kernels: Developed specialized Triton kernels for Neo Chat's context attention and position ID generation, optimizing performance for multimodal inputs.
  • Image Preprocessing Pipeline: Added a robust image preprocessing pipeline that includes smart resizing, dynamic resolution handling, and conversion to patch embeddings with 2D RoPE application.
  • Tokenizer and Visual Model Registration: Integrated the Neo Chat tokenizer and visual model components into the LightLLM server, enabling proper instantiation and usage of the new model.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the neo_chat multimodal model, adding new model definitions, inference logic, custom Triton kernels, and vision processing components. The changes are extensive and well-structured for integrating the new model.

My review highlights several areas for improvement to enhance code quality and robustness. I've pointed out a number of debug print statements that should be removed from the production code. There is a bare except block that could hide potential errors and should be made more specific. Additionally, a local import within a function should be moved to the top level of the file.

More critically, I've identified a comment in the tokenizer logic that indicates uncertainty about a calculation, which could point to a potential bug. I've also found that the visual model's data type is hardcoded, ignoring the configured data_type, which could lead to issues. Addressing these points will improve the correctness and maintainability of the new model's implementation.

Comment on lines +236 to +244
try:
ntk_alpha = float(os.environ.get("LIGHTLLM_NTK_ALPHA", 1))
assert ntk_alpha >= 1
if ntk_alpha > 1:
logger.info(f"Note: NTK enabled, alpha set to {ntk_alpha}")
max_seq_len *= ntk_alpha
base = base * (ntk_alpha ** (partial_head_dim / (partial_head_dim - 2))) # Base change formula
except:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a bare except: is generally discouraged as it can catch and silence a wide range of unexpected errors, making debugging difficult. It's better to catch specific exceptions that you expect might occur, such as ValueError or AssertionError, and log them for better diagnostics.

Suggested change
try:
ntk_alpha = float(os.environ.get("LIGHTLLM_NTK_ALPHA", 1))
assert ntk_alpha >= 1
if ntk_alpha > 1:
logger.info(f"Note: NTK enabled, alpha set to {ntk_alpha}")
max_seq_len *= ntk_alpha
base = base * (ntk_alpha ** (partial_head_dim / (partial_head_dim - 2))) # Base change formula
except:
pass
try:
ntk_alpha = float(os.environ.get("LIGHTLLM_NTK_ALPHA", 1))
assert ntk_alpha >= 1
if ntk_alpha > 1:
logger.info(f"Note: NTK enabled, alpha set to {ntk_alpha}")
max_seq_len *= ntk_alpha
base = base * (ntk_alpha ** (partial_head_dim / (partial_head_dim - 2))) # Base change formula
except (ValueError, AssertionError) as e:
logger.warning(f"Could not apply NTK scaling: {e}")

)
grid_h, grid_w = resized_height // self.patch_size, resized_width // self.patch_size
token_num = int((grid_h * grid_w) * (self.downsample_ratio ** 2))
# 这里的grid_h和grid_w需要* self.downsample_ratio么?再仔细看下代码
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This comment indicates uncertainty about the correctness of the calculation for grid_h and grid_w. This should be investigated and clarified to prevent potential bugs. If the logic is correct, the comment should be removed. If it's incorrect, it needs to be fixed.

elif self.model_type == "gemma3":
self.model = Gemma3VisionModel()
elif self.model_type == "neo_chat":
self.model = NeoVisionTransformerPretrainedModel(kvargs, **model_cfg["vision_config"]).eval().bfloat16()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The model's data type is hardcoded to bfloat16, which ignores the data_type provided in kvargs. This can lead to unexpected behavior or errors if a different data type like float16 is intended. The model should be cast to the data type from the configuration to ensure correctness.

Suggested change
self.model = NeoVisionTransformerPretrainedModel(kvargs, **model_cfg["vision_config"]).eval().bfloat16()
self.model = NeoVisionTransformerPretrainedModel(kvargs, **model_cfg["vision_config"]).eval()
if self.data_type in ["bf16", "bfloat16"]:
self.model.bfloat16()
elif self.data_type in ["fp16", "float16"]:
self.model.half()


base = self.config.get("rope_theta", float(default_base))

print(f"base is {base}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This print statement appears to be for debugging purposes and should be removed before merging.

rope_scaling_factor = self.config.get("rope_scaling", {}).get("factor", 1.0)

base = self.config.get("rope_theta_hw", float(default_base))
print(f"hw_base is {base}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This print statement appears to be for debugging purposes and should be removed before merging.

Comment on lines +64 to +65
a = img["start_idx"]
print(f"img start_idx: {a}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These lines, including a print statement, appear to be for debugging and should be removed.

Comment on lines 157 to 159
from lightllm.models.llama.triton_kernel.token_attention_softmax_and_reducev import (
token_softmax_reducev_fwd,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Imports should be at the top of the file for better readability, to follow standard Python conventions, and to avoid the overhead of re-importing on every function call. Please move this import to the top of the file.


# must devide merge_length
cur_num = int(img_tensors[-1].shape[0] * (self.downsample_ratio ** 2))
print(f"cur_num is {cur_num}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This print statement appears to be for debugging and should be removed.

)
pixel_values, grid_hw = preprocess_pixel_values(transform(new_image).to(torch.float32), patch_size=patch_size)

print(f"Transfer image_size from ({image.height, image.width}) to ({new_image.height, new_image.width})")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This print statement seems to be for debugging purposes. It should be removed or replaced with a proper logger call if this information is valuable for diagnostics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants