-
Notifications
You must be signed in to change notification settings - Fork 0
extend filesource to handle files with tetragon and kubescape extension #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
||
| StatusOr<BackedDataElements> DataElementsFromKubescape() { | ||
| BackedDataElements data_elements(11); | ||
| data_elements.emplace_back("time_", "", types::DataType::STRING); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you didn't use a types::DataType::TIME64NS column? You might run into compatibility issues (PxL, etc) later since it differs from how all of Pixie's other tables are (ever data table has a time_ TIME64NS column).
In addition to this, types::DataType::TIME64NS is significantly more memory efficient than types::DataType::STRING. The former is a 64 bit value (fixed size) while the latter is a variable byte value (~ 1 byte for every character in the timestamp string).
Separately, have you tested this end to end? I ask because our in memory database requires a proper time column if I remember correctly.
I've seen issues in the past where a query never completes in these cases. When I experienced this, I think the data table in question didn't have any time_ column, so maybe it wouldn't manifest the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it was my mistake. I was trying to assign the time string in ISO format from the log. I overlooked time_ and time. I am gonna differentiate the naming.
| // Fill each column — make sure your schema matches this order! | ||
| r.Append(0, types::StringValue(stringify(doc["time"], doc.GetAllocator())), kMaxStringBytes); | ||
| r.Append(1, types::StringValue(stringify(doc["level"], doc.GetAllocator())), kMaxStringBytes); | ||
| r.Append(2, types::StringValue(stringify(doc["RuleID"], doc.GetAllocator())), kMaxStringBytes); | ||
| r.Append(3, types::StringValue(stringify(doc["message"], doc.GetAllocator())), kMaxStringBytes); | ||
| r.Append(4, types::StringValue(stringify(doc["msg"], doc.GetAllocator())), kMaxStringBytes); | ||
| r.Append(5, types::StringValue(stringify(doc["event"], doc.GetAllocator())), kMaxStringBytes); | ||
| r.Append(6, types::StringValue(stringify(doc["BaseRuntimeMetadata"], doc.GetAllocator())), kMaxStringBytes); | ||
| r.Append(7, types::StringValue(stringify(doc["CloudMetadata"], doc.GetAllocator())), kMaxStringBytes); | ||
| r.Append(8, types::StringValue(stringify(doc["RuntimeK8sDetails"], doc.GetAllocator())), kMaxStringBytes); | ||
| r.Append(9, types::StringValue(stringify(doc["RuntimeProcessDetails"], doc.GetAllocator())), kMaxStringBytes); | ||
| sole::uuid u = sole::uuid4(); | ||
| r.Append(10, types::UInt128Value(u.ab, u.cd)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine to use, but seems brittle in the event that the schema changes (hard coded column offsets, column names hard coded, etc).
Is there a material difference between this and the FileSourceConnector::TransferDataFromJSON function? It seems to me if the data table schema is created properly, the ::TransferDataFromJSON function would behave the same (assuming the time_ column can be changed; see my comment from above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually for kubescape, the current FileSource is valid to process. I separated it to file extension .kubescape with its own function to extend it in future. Like providing clickhouse write etc.
But if you recommend using just .json extension for kubescape and not implement it separately for now, I am okay with it 😄 I just wanted to explicitly show you what we are trying to do.
| std::string timeStr = "empty"; | ||
| if (doc.HasMember("time") && doc["time"].IsString()) { | ||
| timeStr = doc["time"].GetString(); | ||
| } | ||
| r.Append(0, types::StringValue(timeStr), kMaxStringBytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what situations will time be missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, in regular workflow, it wont be. I will remove default
| // Extract "node_name" | ||
| std::string nodeNameStr = "empty"; | ||
| if (doc.HasMember("node_name") && doc["node_name"].IsString()) { | ||
| nodeNameStr = doc["node_name"].GetString(); | ||
| } | ||
| r.Append(1, types::StringValue(nodeNameStr), kMaxStringBytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should consider how to extend the DynamicRecordBuilder to have default values for columns. It would be nice if the schema defined if a column had a default value and could handle this case without the TransferData function needing to specially handle specific columns.
I don't think this should block you from moving forward with this depending on the priorities, but I think it would be possible to enhance TransferDataFromJSON to handle this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (key != "time" && key != "node_name") { | ||
| typeStr = key; | ||
| payloadVal = &it->value; | ||
| break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels dangerous to depend on the sort order of rapidjson's object iteration. Can you explain more details on identifying this type field and why it won't be known up front?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, the tetragon always have 3 root keys. time, node_name, and the third one alternating in a fix list. process_exec, process_exit, process_kprobe, process_uprobe, process_tracepoint, process_loader.
I agree if there is a 4th property, it would be dangerous. Maybe I can check if key in [process_exec, process_exit, process_kprobe, process_uprobe, process_tracepoint, process_loader]
No description provided.