Skip to content

Conversation

@mebegu
Copy link
Collaborator

@mebegu mebegu commented Jun 3, 2025

No description provided.

@mebegu mebegu self-assigned this Jun 3, 2025

StatusOr<BackedDataElements> DataElementsFromKubescape() {
BackedDataElements data_elements(11);
data_elements.emplace_back("time_", "", types::DataType::STRING);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you didn't use a types::DataType::TIME64NS column? You might run into compatibility issues (PxL, etc) later since it differs from how all of Pixie's other tables are (ever data table has a time_ TIME64NS column).

In addition to this, types::DataType::TIME64NS is significantly more memory efficient than types::DataType::STRING. The former is a 64 bit value (fixed size) while the latter is a variable byte value (~ 1 byte for every character in the timestamp string).

Separately, have you tested this end to end? I ask because our in memory database requires a proper time column if I remember correctly.

I've seen issues in the past where a query never completes in these cases. When I experienced this, I think the data table in question didn't have any time_ column, so maybe it wouldn't manifest the same.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it was my mistake. I was trying to assign the time string in ISO format from the log. I overlooked time_ and time. I am gonna differentiate the naming.

Comment on lines +280 to +292
// Fill each column — make sure your schema matches this order!
r.Append(0, types::StringValue(stringify(doc["time"], doc.GetAllocator())), kMaxStringBytes);
r.Append(1, types::StringValue(stringify(doc["level"], doc.GetAllocator())), kMaxStringBytes);
r.Append(2, types::StringValue(stringify(doc["RuleID"], doc.GetAllocator())), kMaxStringBytes);
r.Append(3, types::StringValue(stringify(doc["message"], doc.GetAllocator())), kMaxStringBytes);
r.Append(4, types::StringValue(stringify(doc["msg"], doc.GetAllocator())), kMaxStringBytes);
r.Append(5, types::StringValue(stringify(doc["event"], doc.GetAllocator())), kMaxStringBytes);
r.Append(6, types::StringValue(stringify(doc["BaseRuntimeMetadata"], doc.GetAllocator())), kMaxStringBytes);
r.Append(7, types::StringValue(stringify(doc["CloudMetadata"], doc.GetAllocator())), kMaxStringBytes);
r.Append(8, types::StringValue(stringify(doc["RuntimeK8sDetails"], doc.GetAllocator())), kMaxStringBytes);
r.Append(9, types::StringValue(stringify(doc["RuntimeProcessDetails"], doc.GetAllocator())), kMaxStringBytes);
sole::uuid u = sole::uuid4();
r.Append(10, types::UInt128Value(u.ab, u.cd));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine to use, but seems brittle in the event that the schema changes (hard coded column offsets, column names hard coded, etc).

Is there a material difference between this and the FileSourceConnector::TransferDataFromJSON function? It seems to me if the data table schema is created properly, the ::TransferDataFromJSON function would behave the same (assuming the time_ column can be changed; see my comment from above).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually for kubescape, the current FileSource is valid to process. I separated it to file extension .kubescape with its own function to extend it in future. Like providing clickhouse write etc.

But if you recommend using just .json extension for kubescape and not implement it separately for now, I am okay with it 😄 I just wanted to explicitly show you what we are trying to do.

Comment on lines +215 to +219
std::string timeStr = "empty";
if (doc.HasMember("time") && doc["time"].IsString()) {
timeStr = doc["time"].GetString();
}
r.Append(0, types::StringValue(timeStr), kMaxStringBytes);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situations will time be missing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, in regular workflow, it wont be. I will remove default

Comment on lines +221 to +226
// Extract "node_name"
std::string nodeNameStr = "empty";
if (doc.HasMember("node_name") && doc["node_name"].IsString()) {
nodeNameStr = doc["node_name"].GetString();
}
r.Append(1, types::StringValue(nodeNameStr), kMaxStringBytes);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider how to extend the DynamicRecordBuilder to have default values for columns. It would be nice if the schema defined if a column had a default value and could handle this case without the TransferData function needing to specially handle specific columns.

I don't think this should block you from moving forward with this depending on the priorities, but I think it would be possible to enhance TransferDataFromJSON to handle this use case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +234 to +237
if (key != "time" && key != "node_name") {
typeStr = key;
payloadVal = &it->value;
break;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels dangerous to depend on the sort order of rapidjson's object iteration. Can you explain more details on identifying this type field and why it won't be known up front?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, the tetragon always have 3 root keys. time, node_name, and the third one alternating in a fix list. process_exec, process_exit, process_kprobe, process_uprobe, process_tracepoint, process_loader.

I agree if there is a 4th property, it would be dangerous. Maybe I can check if key in [process_exec, process_exit, process_kprobe, process_uprobe, process_tracepoint, process_loader]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants