-
Notifications
You must be signed in to change notification settings - Fork 51
clustering detector hits by time as well #559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for the initiative, I will look into this in more detail once I find a quite moment. |
|
1000 rows These are the exact commands that I ran: |
|
I see, thanks!
Would it be very cumbersome for you to split these two things into two commits, to see what exactly the changes to the clustering algorithm are? And whether the long run-time of 2 minutes results from the updates to the new |
|
And it also looks like there are no tests right now for this. You could add some lines in # Cluster events by radius
@test_nowarn clustered_evts = SolidStateDetectors.cluster_detector_hits(evts, 10u"µm")
@test length(clustered_evts) == length(evts)
@test length(flatview(clustered_evts.pos)) <= length(flatview(evts.pos))
# Cluster events by time
@test_nowarn clustered_evts = SolidStateDetectors.cluster_detector_hits(evts, 10u"µm", 100u"s")
@test length(clustered_evts) == length(evts)
@test length(flatview(clustered_evts.pos)) <= length(flatview(evts.pos))(note that This will ensure that these methods get covered by the tests and will result in errors whenever something is changed that might break this implementation. |
4300699 to
d370502
Compare
|
I have split the changes into 2 commits as you asked, adding the tests as well. Benchmarks after first commit: (without time clustering) Benchmarks after second commit: (with time clustering) The runtime is random, but is usually between 5-10 minutes for the trials that I did today. (1000 events) |
|
Few important questions:
a) add the old function for only position based clustering along with the new functions (has better performance as well) b) keep these parameters as kwargs, and have some default value if user doesn't specify the clustering radius. Example: if user only wants position clustering, then radius_time will be kept infinity. (will have performance issues) |
fhagemann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some comments.
I'm wondering here: would it maybe be more efficient to
- split all events by time (separate a row into multiple rows if the hits are more than
eps_timeapart). I expect most of theGeant4hits to have identical/very similarthitentries if they resulted from the same nuclear decay. - then, perform the spatial clustering like we used to before (or the more efficient implementation)?
Based on your experience in implementing this, do you think that this would be faster?
Or do you have some feeling for what is limiting the runtime of this clustering?
See my previous message:
In the end, there can be other methods of #main method
function cluster_detector_hits(
table::TypedTables.Table;
cluster_radius::Union{<:Real, Unitful.Length} = NaN, # keyword argument
cluster_time::Union{<:Real, Unitful.Time} = NaN, # NaN = skip the clustering/splitting
)
@assert :pos in TypedTables.columnnames(table) "Table has no column `pos`"
@assert :thit in TypedTables.columnnames(table) "Table has no column `thit`"
@assert :edep in TypedTables.columnnames(table) "Table has no column `edep`"
@assert :detno in TypedTables.columnnames(table) "Table has no column `detno`"
clustered_nt = map(
evt -> cluster_detector_hits(evt.detno, evt.edep, evt.pos, evt.thit, cluster_radius, cluster_time),
table
)
end
# convenience functions (allowing for the old method syntax)
cluster_detector_hits(table::TypedTables.Table, cluster_radius::Unitful.Length) = cluster_detector_hits(table; cluster_radius)
cluster_detector_hits(table::TypedTables.Table, cluster_time::Unitful.Time) = cluster_detector_hits(table; cluster_time)
cluster_detector_hits(table::TypedTables.Table, cluster_radius::Unitful.Length, cluster_time::Unitful.Time) = cluster_detector_hits(table; cluster_radius, cluster_time)
cluster_detector_hits(table::TypedTables.Table, cluster_time::Unitful.Time, cluster_radius::Unitful.Length) = cluster_detector_hits(table; cluster_radius, cluster_time)But we can worry about the convenience functions in the very last step. |
That would lead to incorrect clustering. Consider this toy example: eps_time = 10s hit 1: 1s, 1m First clustering by time, all hits are in same cluster (eps_time < 10). (Earlier hit 2 formed the common link between hit 1 and 3 for eps_time. But as finally 2 is not in the same cluster, hit 1 and 3 should not be kept in the same cluster.) |
I see what you mean. So, even if the hits were and Let`s assume this toy model: If I choose Just to think out loud (doesn't mean that this is the best way to go): the time grouping could either start with the first hit and combine everything that's withing The time clustering should be there to make sure that we only drift hits together, if they actually coincide "enough" in time, whereas the space clustering makes sure that we do not simulate thousands of small charge clouds (which might be the output of Geant4), especially if they are all very close together. |
|
As of using ArraysOfArrays
elem_ptr = deepcopy(evts.thit.elem_ptr)
kernel_size = deepcopy(evts.thit.kernel_size)
added = 0
for i in eachindex(evts.thit)
thit = evts.thit[i]
d = findall(diff(thit) .> zero(eltype(thit)))
isempty(d) && continue
splice!(elem_ptr, i + added, elem_ptr[i + added] .+ [0; d])
append!(kernel_size, (() for _ in d))
added += length(d)
end
evts_split = Table(merge(NamedTuple(c => begin
if c == :evtno
collect(eachindex(kernel_size))
else
f = getproperty(evts, c)
VectorOfArrays(f.data, elem_ptr, kernel_size)
end end
for c in columnnames(evts)))) |
Thanks for clarifying the use-case of this algorithm. As we just want to bring together the hits "close enough" in space and time, we might not need dbscan algorithm at all. Rather it could be achieved with a lot simpler algorithms. For the 1:1000 hits example that you gave, would give same cluster if we do it by dbscan, however we implement it. As long as the hits are dense in time (or space), it will keep on adding. It would lead to huge clusters which is not what we need. Using the following algorithm would be better in solving our purpose (logic): The implementation for this would be straightforward as well which means runtime will be low. |
|
I agree for time grouping/splitting. However, I would still like to keep the dbscan for space clustering. |
|
We may actually want to move away from DBSCAN as the default clustering algorithm. DBSCAN can group connected elongated/curved/... agglomerations of charges. We want something that prefers spherical clusters. |
|
I've had a little chat with ChatGPT about this exploring options and possible implementation: https://chatgpt.com/share/69108494-2248-800e-9d83-3b1b29157252 |
d370502 to
0217418
Compare
I have used this approach to cluster by time first and then by radius. The pos column in output table is CartesianPoint type now. No new dependency than the previously existing code. |
fhagemann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I had a quick look, but didn't run any code.
Maybe add to the tests how you would expect this function to behave
(e.g. what eltype the column pos has after running this function etc.).
0217418 to
6f956a9
Compare
|
6f956a9 to
a3d71f4
Compare
|
Ok, sorry for making this chaotic, but in PR #558, I added code that would make sure that the OLD implementation of Could you build your changes on top of this, and see if the tests require changes? |
|
One thing that I saw when running this: we might wanna split events that are more than |
Clarification on this: |
I would say we might want to keep the old numbers. Also, the original table might have evtno being a Vector of event numbers, so we might want to separate the arrays the same way we would separate times, energies and positions then. |
c7d4da5 to
489ffb0
Compare
After merging the latest changes from main, the geant4 table is giving unitless position values: This is causing the |
8dda4c5 to
e329c88
Compare
|
Yes, this is why it's prefered to use |
| @assert is_detector_hits_table(table) "Table does not have the correct format" | ||
|
|
||
| # Collect all time-clustered results | ||
| all_results = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid declaring Arrays with unknown element-type, as this will default to Any and not be type-stable.
3e5cb97 to
723af9d
Compare
fhagemann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're almost there! :)
| @assert !isempty(table) "Input table is empty" | ||
| firstrow = first(table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might not even need this, as the check is_detector_hits_table will check whether certain columns exist and that they have the correct type. Can you work with table instead of firstrow (i.e. inferring eltypes from table directly)?
| result_columns = ( | ||
| evtno = evtno_col, | ||
| detno = VectorOfVectors([r.detno for r in all_results]), | ||
| thit = VectorOfVectors([r.thit for r in all_results]), | ||
| edep = VectorOfVectors([r.edep for r in all_results]), | ||
| pos = VectorOfVectors([r.pos for r in all_results]) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we deal with events that had more columns initially? Can we do the same thing we did for evtno to just clone the entries if a row was split into multiple rows by the clustering?
Maybe then, ResultType can just be the similar type as the original Table (?)
You can check how we used to do it for the original Table (merging the "old" Table with the new one)
9c37f3d to
e7cf6d1
Compare
| evtno_col = Int[] | ||
|
|
||
| for (evtno, evt) in enumerate(table) | ||
| @assert is_detector_hits_table(table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to keep the assertion message that we had before:
"Table does not have the correct format" (or also happy to discuss another message that's more descriptive)
| detno = VectorOfVectors([r.detno for r in all_results]), | ||
| thit = VectorOfVectors([r.thit for r in all_results]), | ||
| edep = VectorOfVectors([r.edep for r in all_results]), | ||
| pos = VectorOfVectors([r.pos for r in all_results]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to convert the new columns into VectorOfVectors or is that automatically done by the similar when defining col_vectors? Do we test somewhere in the tests that the new column types are VectorOfVectors?
e7cf6d1 to
0a4b746
Compare
|
I have added proper splitting of the extra columns that might be in the table. Thank you for showing patience in me! |
|
The more I think about it -- we might wanna do it a bit different. This method should still receive the additional (keyword) argument SolidStateDetectors.jl/src/ChargeClustering/ChargeClustering.jl Lines 55 to 68 in 3daf777
But instead of passing it down to the other cluster_detector_hits, we should do the time-splitting already in this method BEFORE passing the table columns further down to the other cluster_detector_hits.
This would allow us to let the other SolidStateDetectors.jl/src/ChargeClustering/ChargeClustering.jl Lines 3 to 52 in 3daf777
PS: While browsing through old code and cleaning some up, I also found an old leftover of |
for issue #523
Clustering was supposed to be based on position and time simultaneously.
To achieve this, I implemented a Spatio-Temporal DBSCAN (ST-DBSCAN) algorithm, which includes a point in a cluster only when both spatial and temporal constraints are satisfied.
Traditional DBSCAN uses a Euclidean metric, which mixes position and time dimensions. This could incorrectly group points that are spatially close but temporally far apart (for example, position inside but time outside the threshold, and still euclidean distance inside threshold). The new approach avoids that issue by handling both constraints explicitly.
That is why I have used this new algorithm. Any prior implementation of this did not exist in julia libraries, so I have coded it from scratch
A new dependency Neighbors.jl was added in project.toml for this algorithm
tc = SolidStateDetectors.cluster_detector_hits(t, 20u"μm", 100u"s")When this is run, clustering will happen with time constraint as well. Cluster properties (pos, edep) are calculated as before with the addition of thit as well.
(This was run in the same case with which you demonstrated me in previous comment)
The problem that it has is that it is taking a lot of computer resources to compute the clusters. I tried using the best approaches (KDTree, inrange from Neighbors.jl), but the issue is arising due to the complexity of the problem as we have to simultaneously cluster based on position and time.
This is the computer resources typically being used when clustering using cluster_detector_hits function:
133.839229 seconds (35.63 M allocations: 1.751 GiB, 0.57% gc time, 99.71% compilation time)(There might be a better approach or performance gains to be found. I tried my best and came up with this.)
Open for any modifications, which might be required for it to be integrated with whole project.