Skip to content

Conversation

@sm2909
Copy link

@sm2909 sm2909 commented Nov 6, 2025

for issue #523

Clustering was supposed to be based on position and time simultaneously.
To achieve this, I implemented a Spatio-Temporal DBSCAN (ST-DBSCAN) algorithm, which includes a point in a cluster only when both spatial and temporal constraints are satisfied.

Traditional DBSCAN uses a Euclidean metric, which mixes position and time dimensions. This could incorrectly group points that are spatially close but temporally far apart (for example, position inside but time outside the threshold, and still euclidean distance inside threshold). The new approach avoids that issue by handling both constraints explicitly.

That is why I have used this new algorithm. Any prior implementation of this did not exist in julia libraries, so I have coded it from scratch

A new dependency Neighbors.jl was added in project.toml for this algorithm

tc = SolidStateDetectors.cluster_detector_hits(t, 20u"μm", 100u"s")

When this is run, clustering will happen with time constraint as well. Cluster properties (pos, edep) are calculated as before with the addition of thit as well.

tc.thit[1]
3-element view(::Vector{Quantity{Float32, 𝐓, Unitful.FreeUnits{(s,), 𝐓, nothing}}}, 1:3) with eltype Quantity{Float32, 𝐓, Unitful.FreeUnits{(s,), 𝐓, nothing}}:
 1.7250684f6 s
 1.7992052f6 s
 1.7991948f6 s

(This was run in the same case with which you demonstrated me in previous comment)

The problem that it has is that it is taking a lot of computer resources to compute the clusters. I tried using the best approaches (KDTree, inrange from Neighbors.jl), but the issue is arising due to the complexity of the problem as we have to simultaneously cluster based on position and time.

This is the computer resources typically being used when clustering using cluster_detector_hits function:

133.839229 seconds (35.63 M allocations: 1.751 GiB, 0.57% gc time, 99.71% compilation time)

(There might be a better approach or performance gains to be found. I tried my best and came up with this.)

Open for any modifications, which might be required for it to be integrated with whole project.

@fhagemann
Copy link
Collaborator

Thanks for the initiative, I will look into this in more detail once I find a quite moment.
Maybe a quick question to get an order of magnitude: the 133.839229 seconds you quote is for how many events/rows of the initial table?

@sm2909
Copy link
Author

sm2909 commented Nov 7, 2025

1000 rows

These are the exact commands that I ran:

using SolidStateDetectors, Geant4, Unitful
sim = Simulation{Float32}(SSD_examples[:InvertedCoax])
source = IsotopeSource(90, 228, 0.0, 0.0, CartesianPoint(0.05,0,0.05))
app = G4JLApplication(sim, source)
t = run_geant4_simulation(app, 1_000)
@time tc = SolidStateDetectors.cluster_detector_hits(t, 20u"μm", 100u"s")

@fhagemann
Copy link
Collaborator

I see, thanks!
So, what I'm seeing is that these changes do two things:

  • First, updating the current method of cluster_detector_hits to the new implementation of CartesianPoint (Change AbstractCoordinatePoint to not be a FieldVector #489), which on the current main branch does not seem to be working with the Geant4 output.
  • Update the clustering algorithm to cluster both time and/or positions

Would it be very cumbersome for you to split these two things into two commits, to see what exactly the changes to the clustering algorithm are?

And whether the long run-time of 2 minutes results from the updates to the new CartesianPoints (maybe benchmark this before implementing the timing clustering and see how long it takes to run for 1000 events) or the new clustering?

@fhagemann
Copy link
Collaborator

fhagemann commented Nov 7, 2025

And it also looks like there are no tests right now for this.

You could add some lines in @testset "Run Geant4 simulations" in test/test_geant4.jl
after creating an checking the evts table and before running the #Generate waveforms block, something like this:

    # Cluster events by radius
    @test_nowarn clustered_evts = SolidStateDetectors.cluster_detector_hits(evts, 10u"µm")
    @test length(clustered_evts) == length(evts)
    @test length(flatview(clustered_evts.pos)) <= length(flatview(evts.pos))
    
     # Cluster events by time
    @test_nowarn clustered_evts = SolidStateDetectors.cluster_detector_hits(evts, 10u"µm", 100u"s")
    @test length(clustered_evts) == length(evts)
    @test length(flatview(clustered_evts.pos)) <= length(flatview(evts.pos))

(note that flatview requires using ArraysOfArrays at the beginning of the file as well).

This will ensure that these methods get covered by the tests and will result in errors whenever something is changed that might break this implementation.

@sm2909
Copy link
Author

sm2909 commented Nov 8, 2025

I have split the changes into 2 commits as you asked, adding the tests as well.

Benchmarks after first commit: (without time clustering)

@time tc = SolidStateDetectors.cluster_detector_hits(t, 20u"μm")
  6.686447 seconds (6.65 M allocations: 390.600 MiB, 10.31% gc time, 96.51% compilation time)

Benchmarks after second commit: (with time clustering)

@time tc = SolidStateDetectors.cluster_detector_hits(t, 20u"μm", 100u"s")
573.671077 seconds (151.57 M allocations: 7.010 GiB, 0.77% gc time, 99.91% compilation time)
@time tc = SolidStateDetectors.cluster_detector_hits(t, 20u"μm", 100u"s")
405.883093 seconds (151.57 M allocations: 7.010 GiB, 0.68% gc time, 99.91% compilation time)

The runtime is random, but is usually between 5-10 minutes for the trials that I did today. (1000 events)

@sm2909
Copy link
Author

sm2909 commented Nov 8, 2025

Few important questions:

  1. The current clustering output table gives pos as static vector. Should these be also returned as CartesianPoint to match the new patterns?

  2. The current code only has clustering for BOTH pos AND time, individual clustering are not available. Older implementations with only pos parameter might break. There are 2 solutions to this:

a) add the old function for only position based clustering along with the new functions (has better performance as well)

b) keep these parameters as kwargs, and have some default value if user doesn't specify the clustering radius. Example: if user only wants position clustering, then radius_time will be kept infinity. (will have performance issues)

Copy link
Collaborator

@fhagemann fhagemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some comments.

I'm wondering here: would it maybe be more efficient to

  1. split all events by time (separate a row into multiple rows if the hits are more than eps_time apart). I expect most of the Geant4 hits to have identical/very similar thit entries if they resulted from the same nuclear decay.
  2. then, perform the spatial clustering like we used to before (or the more efficient implementation)?

Based on your experience in implementing this, do you think that this would be faster?
Or do you have some feeling for what is limiting the runtime of this clustering?

@fhagemann
Copy link
Collaborator

Few important questions:

1. The current clustering output table gives pos as static vector. Should these be also returned as CartesianPoint to match the new patterns?

2. The current code only has clustering for BOTH pos AND time, individual clustering are not available. Older implementations with only pos parameter might break. There are 2 solutions to this:

a) add the old function for only position based clustering along with the new functions (has better performance as well)

b) keep these parameters as kwargs, and have some default value if user doesn't specify the clustering radius. Example: if user only wants position clustering, then radius_time will be kept infinity. (will have performance issues)

See my previous message:
I would like to see whether it might make sense to somewhat separate time and spatial clustering.
Right now, we might need to CLUSTER in space, but actually SPLIT in time (one row can have hits from different positions and times -> we should split events at different times into different rows, and THEN cluster hits with similar distances into fewer hits).

cluster_detector_hits could then consist of two steps: split in time, and cluster in space.
Then we could make a base method that takes both cluster_radius and cluster_time as keyword arguments, by specifying them to be Unitful.Length or Unitful.Time and default values NaN that would just skip the splitting/clustering.

In the end, there can be other methods of cluster_detector_hits that would just call the base method based on the inputs, e.g.

#main method
function cluster_detector_hits(
    table::TypedTables.Table;
    cluster_radius::Union{<:Real, Unitful.Length} = NaN, # keyword argument
    cluster_time::Union{<:Real, Unitful.Time} = NaN,     # NaN = skip the clustering/splitting
    )
    @assert :pos in TypedTables.columnnames(table) "Table has no column `pos`"
    @assert :thit in TypedTables.columnnames(table) "Table has no column `thit`"
    @assert :edep in TypedTables.columnnames(table) "Table has no column `edep`"
    @assert :detno in TypedTables.columnnames(table) "Table has no column `detno`"
    clustered_nt = map(
        evt -> cluster_detector_hits(evt.detno, evt.edep, evt.pos, evt.thit, cluster_radius, cluster_time),
        table
    )
end

# convenience functions (allowing for the old method syntax)
cluster_detector_hits(table::TypedTables.Table, cluster_radius::Unitful.Length)  = cluster_detector_hits(table; cluster_radius)
cluster_detector_hits(table::TypedTables.Table, cluster_time::Unitful.Time) = cluster_detector_hits(table; cluster_time)
cluster_detector_hits(table::TypedTables.Table, cluster_radius::Unitful.Length, cluster_time::Unitful.Time) = cluster_detector_hits(table; cluster_radius, cluster_time)
cluster_detector_hits(table::TypedTables.Table, cluster_time::Unitful.Time, cluster_radius::Unitful.Length) = cluster_detector_hits(table; cluster_radius, cluster_time)

But we can worry about the convenience functions in the very last step.

@sm2909
Copy link
Author

sm2909 commented Nov 9, 2025

Here are some comments.

I'm wondering here: would it maybe be more efficient to

  1. split all events by time (separate a row into multiple rows if the hits are more than eps_time apart). I expect most of the Geant4 hits to have identical/very similar thit entries if they resulted from the same nuclear decay.
  2. then, perform the spatial clustering like we used to before (or the more efficient implementation)?

Based on your experience in implementing this, do you think that this would be faster? Or do you have some feeling for what is limiting the runtime of this clustering?

That would lead to incorrect clustering. Consider this toy example:

eps_time = 10s
eps_pos = 10m
(consider position 1d here for easier visualization)

hit 1: 1s, 1m
hit 2: 10s, 100m
hit 3: 15s, 5m

First clustering by time, all hits are in same cluster (eps_time < 10).
Then cluster based on position, [1, 3] are a cluster and [2] is another cluster.
But [1, 3] should not belong to same cluster as eps_time > 10 between them.

(Earlier hit 2 formed the common link between hit 1 and 3 for eps_time. But as finally 2 is not in the same cluster, hit 1 and 3 should not be kept in the same cluster.)

@fhagemann
Copy link
Collaborator

fhagemann commented Nov 9, 2025

That would lead to incorrect clustering. Consider this toy example:

eps_time = 10s eps_pos = 10m (consider position 1d here for easier visualization)

hit 1: 1s, 1m hit 2: 10s, 100m hit 3: 15s, 5m

First clustering by time, all hits are in same cluster (eps_time < 10). Then cluster based on position, [1, 3] are a cluster and [2] is another cluster. But [1, 3] should not belong to same cluster as eps_time > 10 between them.

(Earlier hit 2 formed the common link between hit 1 and 3 for eps_time. But as finally 2 is not in the same cluster, hit 1 and 3 should not be kept in the same cluster.)

I see what you mean.
The idea behind time clustering would be to separate events that are more than a given eps_time apart. This is because, in our pulse shape simulation, we start drifting all charge clouds at the same initial time.
Other than with distance, I would also like to make sure that grouped events are never more than eps_time apart.

So, even if the hits were

hit 1: 1s, 1m 
hit 2: 10s, 1m
hit 3: 15s, 1m

and eps_time was 10s, I would like to NEVER group hit 1 and hit3.

Let`s assume this toy model:

for N in 1:1000
hit (N): (N) ns, 1m

If I choose eps_time = 1u"ns" because my simulation step is 1u"ns", I would group all hits within the 1u"µs" together, even though this might not be what I want. I, at most, would like the first two hits to be combined and start a new group with the third hit etc.

Just to think out loud (doesn't mean that this is the best way to go): the time grouping could either start with the first hit and combine everything that's withing eps_time after the first hit.
OR we can sort all events by time and start grouping with the two hits that are closest together, group them, and then keep on adding more events if they are withing eps_time or start with new events afterwards.

The time clustering should be there to make sure that we only drift hits together, if they actually coincide "enough" in time, whereas the space clustering makes sure that we do not simulate thousands of small charge clouds (which might be the output of Geant4), especially if they are all very close together.

@fhagemann
Copy link
Collaborator

As of SolidStateDetectors@0.10.27 (pre #489), I was using this to just split row of a Table evts with non-identical thit:

using ArraysOfArrays

elem_ptr = deepcopy(evts.thit.elem_ptr)
kernel_size = deepcopy(evts.thit.kernel_size)
added = 0
for i in eachindex(evts.thit)
    thit = evts.thit[i]
    d = findall(diff(thit) .> zero(eltype(thit)))
    isempty(d) && continue
    splice!(elem_ptr, i + added, elem_ptr[i + added] .+ [0; d])
    append!(kernel_size, (() for _ in d))
    added += length(d)
end

evts_split = Table(merge(NamedTuple(c => begin
    if c == :evtno 
        collect(eachindex(kernel_size))
    else
        f = getproperty(evts, c)
        VectorOfArrays(f.data, elem_ptr, kernel_size)
    end end
for c in columnnames(evts))))

@sm2909
Copy link
Author

sm2909 commented Nov 9, 2025

That would lead to incorrect clustering. Consider this toy example:
eps_time = 10s eps_pos = 10m (consider position 1d here for easier visualization)
hit 1: 1s, 1m hit 2: 10s, 100m hit 3: 15s, 5m
First clustering by time, all hits are in same cluster (eps_time < 10). Then cluster based on position, [1, 3] are a cluster and [2] is another cluster. But [1, 3] should not belong to same cluster as eps_time > 10 between them.
(Earlier hit 2 formed the common link between hit 1 and 3 for eps_time. But as finally 2 is not in the same cluster, hit 1 and 3 should not be kept in the same cluster.)

I see what you mean. The idea behind time clustering would be to separate events that are more than a given eps_time apart. This is because, in our pulse shape simulation, we start drifting all charge clouds at the same initial time. Other than with distance, I would also like to make sure that grouped events are never more than eps_time apart.

So, even if the hits were

hit 1: 1s, 1m 
hit 2: 10s, 1m
hit 3: 15s, 1m

and eps_time was 10s, I would like to NEVER group hit 1 and hit3.

Let`s assume this toy model:

for N in 1:1000
hit (N): (N) ns, 1m

If I choose eps_time = 1u"ns" because my simulation step is 1u"ns", I would group all hits within the 1u"µs" together, even though this might not be what I want. I, at most, would like the first two hits to be combined and start a new group with the third hit etc.

Just to think out loud (doesn't mean that this is the best way to go): the time grouping could either start with the first hit and combine everything that's withing eps_time after the first hit. OR we can sort all events by time and start grouping with the two hits that are closest together, group them, and then keep on adding more events if they are withing eps_time or start with new events afterwards.

The time clustering should be there to make sure that we only drift hits together, if they actually coincide "enough" in time, whereas the space clustering makes sure that we do not simulate thousands of small charge clouds (which might be the output of Geant4), especially if they are all very close together.

Thanks for clarifying the use-case of this algorithm.

As we just want to bring together the hits "close enough" in space and time, we might not need dbscan algorithm at all. Rather it could be achieved with a lot simpler algorithms.

For the 1:1000 hits example that you gave, would give same cluster if we do it by dbscan, however we implement it. As long as the hits are dense in time (or space), it will keep on adding. It would lead to huge clusters which is not what we need.

Using the following algorithm would be better in solving our purpose (logic):

for each hit:
    if hit already in a cluster: continue
    search all other hits which are not already part of a cluster & lie with eps of hit
    if satisfies above condition: cluster with hit
end

The implementation for this would be straightforward as well which means runtime will be low.

@fhagemann
Copy link
Collaborator

I agree for time grouping/splitting. However, I would still like to keep the dbscan for space clustering.

@oschulz
Copy link
Member

oschulz commented Nov 9, 2025

We may actually want to move away from DBSCAN as the default clustering algorithm. DBSCAN can group connected elongated/curved/... agglomerations of charges. We want something that prefers spherical clusters.

@oschulz
Copy link
Member

oschulz commented Nov 9, 2025

I've had a little chat with ChatGPT about this exploring options and possible implementation: https://chatgpt.com/share/69108494-2248-800e-9d83-3b1b29157252

@sm2909 sm2909 changed the title clustering detector hits by time as well using st-dbscan algorithm clustering detector hits by time as well Nov 10, 2025
@sm2909
Copy link
Author

sm2909 commented Nov 10, 2025

we can sort all events by time and start grouping with the two hits that are closest together, group them, and then keep on adding more events if they are withing eps_time or start with new events afterwards.

I have used this approach to cluster by time first and then by radius.

The pos column in output table is CartesianPoint type now.

No new dependency than the previously existing code.

Copy link
Collaborator

@fhagemann fhagemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I had a quick look, but didn't run any code.

Maybe add to the tests how you would expect this function to behave
(e.g. what eltype the column pos has after running this function etc.).

@fhagemann fhagemann added bug Something isn't working enhancement Improvement of existing features convenience Improve user-friendliness labels Nov 11, 2025
@sm2909
Copy link
Author

sm2909 commented Nov 12, 2025

  1. Added some more test: checking eltype of pos, sum of all edep, no NaN in output table
  2. push! in previous has been moved to helper function to save allocations
  3. pos type is converted to CartesianPoint in the function (with table input) to avoid errors.
  4. Removed Clustering@0.14 supporter method as you suggested.

@fhagemann
Copy link
Collaborator

Ok, sorry for making this chaotic, but in PR #558, I added code that would make sure that the OLD implementation of cluster_detector_hits would work and merged this into main.
This also included some clever way to deal with both CartesianPoint and StaticVector as pos column of the Table.
d6a62f3

Could you build your changes on top of this, and see if the tests require changes?

@fhagemann
Copy link
Collaborator

One thing that I saw when running this: we might wanna split events that are more than cluster_time apart in time into different rows of the table, such that the table after time splitting might have more rows than the initial one.
Because SSD will simulate pulse shapes for each row, and we do not want to be events far apart in time to be simulated together.

@sm2909
Copy link
Author

sm2909 commented Dec 5, 2025

One thing that I saw when running this: we might wanna split events that are more than cluster_time apart in time into different rows of the table, such that the table after time splitting might have more rows than the initial one. Because SSD will simulate pulse shapes for each row, and we do not want to be events far apart in time to be simulated together.

Clarification on this:
The values in evtno column of the output table should have the event numbers from the original experiment, or should they just reflect the serial number of the row of the table?

@fhagemann
Copy link
Collaborator

fhagemann commented Dec 5, 2025

Clarification on this: The values in evtno column of the output table should have the event numbers from the original experiment, or should they just reflect the serial number of the row of the table?

I would say we might want to keep the old numbers.
But we can also introduce a keyword argument to let the user choose if they want the original evtno or a counter that is equal to the row of the table.

Also, the original table might have evtno being a Vector of event numbers, so we might want to separate the arrays the same way we would separate times, energies and positions then.

@sm2909
Copy link
Author

sm2909 commented Dec 6, 2025

using SolidStateDetectors, Geant4, Unitful
sim = Simulation{Float32}(SSD_examples[:InvertedCoax])
source = IsotopeSource(90, 228, 0.0, 0.0, CartesianPoint(0.05,0,0.05))
app = G4JLApplication(sim, source)
t = run_geant4_simulation(app, 1_000)

After merging the latest changes from main, the geant4 table is giving unitless position values:

t.pos[1]
7-element view(::Vector{CartesianPoint{Float32}}, 1:7) with eltype CartesianPoint{Float32}:
 CartesianPoint{Float32}(0.029364103f0, -0.010049292f0, 0.042470355f0)
 CartesianPoint{Float32}(0.029358402f0, -0.010037613f0, 0.04245329f0)
 CartesianPoint{Float32}(0.029364105f0, -0.010049293f0, 0.04247036f0)
 CartesianPoint{Float32}(0.02936739f0, -0.010051437f0, 0.042446814f0)
 CartesianPoint{Float32}(0.029364103f0, -0.010049292f0, 0.04247036f0)
 CartesianPoint{Float32}(0.02934997f0, -0.010030221f0, 0.042460836f0)
 CartesianPoint{Float32}(0.029364105f0, -0.010049293f0, 0.04247036f0)

This is causing the ustrip to fail

@fhagemann
Copy link
Collaborator

fhagemann commented Dec 7, 2025

Yes, this is why it's prefered to use to_internal_units on CartesianPoint (or SVector, depending on which it is).
I undid your merge commit and re-merged while resolving the merge conflicts.

@assert is_detector_hits_table(table) "Table does not have the correct format"

# Collect all time-clustered results
all_results = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid declaring Arrays with unknown element-type, as this will default to Any and not be type-stable.

@sm2909 sm2909 force-pushed the fix/issue-#523 branch 3 times, most recently from 3e5cb97 to 723af9d Compare December 8, 2025 17:13
Copy link
Collaborator

@fhagemann fhagemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're almost there! :)

Comment on lines 83 to 84
@assert !isempty(table) "Input table is empty"
firstrow = first(table)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might not even need this, as the check is_detector_hits_table will check whether certain columns exist and that they have the correct type. Can you work with table instead of firstrow (i.e. inferring eltypes from table directly)?

Comment on lines 113 to 119
result_columns = (
evtno = evtno_col,
detno = VectorOfVectors([r.detno for r in all_results]),
thit = VectorOfVectors([r.thit for r in all_results]),
edep = VectorOfVectors([r.edep for r in all_results]),
pos = VectorOfVectors([r.pos for r in all_results])
)
Copy link
Collaborator

@fhagemann fhagemann Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we deal with events that had more columns initially? Can we do the same thing we did for evtno to just clone the entries if a row was split into multiple rows by the clustering?
Maybe then, ResultType can just be the similar type as the original Table (?)
You can check how we used to do it for the original Table (merging the "old" Table with the new one)

@sm2909 sm2909 force-pushed the fix/issue-#523 branch 2 times, most recently from 9c37f3d to e7cf6d1 Compare December 12, 2025 17:32
evtno_col = Int[]

for (evtno, evt) in enumerate(table)
@assert is_detector_hits_table(table)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep the assertion message that we had before:
"Table does not have the correct format" (or also happy to discuss another message that's more descriptive)

Comment on lines 103 to 106
detno = VectorOfVectors([r.detno for r in all_results]),
thit = VectorOfVectors([r.thit for r in all_results]),
edep = VectorOfVectors([r.edep for r in all_results]),
pos = VectorOfVectors([r.pos for r in all_results])
Copy link
Collaborator

@fhagemann fhagemann Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to convert the new columns into VectorOfVectors or is that automatically done by the similar when defining col_vectors? Do we test somewhere in the tests that the new column types are VectorOfVectors?

@sm2909
Copy link
Author

sm2909 commented Dec 22, 2025

I have added proper splitting of the extra columns that might be in the table.

Thank you for showing patience in me!

@fhagemann
Copy link
Collaborator

The more I think about it -- we might wanna do it a bit different.

This method should still receive the additional (keyword) argument cluster_time.

function cluster_detector_hits(table::TypedTables.Table, cluster_radius::RealQuantity)
@assert is_detector_hits_table(table) "Table does not have the correct format"
clustered_nt = map(
evt -> cluster_detector_hits(evt.detno, evt.edep, evt.pos, cluster_radius),
table
)
TypedTables.Table(merge(
TypedTables.columns(table),
map(
VectorOfVectors,
TypedTables.columns(clustered_nt)
)
))
end

But instead of passing it down to the other cluster_detector_hits, we should do the time-splitting already in this method BEFORE passing the table columns further down to the other cluster_detector_hits.

This would allow us to let the other cluster_detector_hits mostly untouched (no need for adding any time splitting functionality in here), except for the performance improvements (e.g. in unit handling etc.):

function cluster_detector_hits(
detno::AbstractVector{<:Integer},
edep::AbstractVector{TT},
pos::AbstractVector{<:Union{<:StaticVector{3,PT}, <:CartesianPoint{PT}}},
cluster_radius::RealQuantity
) where {TT <: RealQuantity, T <: Real, PT <: Union{T, <:Unitful.Length{T}}}
unsorted = TypedTables.Table(detno = detno, edep = edep, pos = pos)
sorting_idxs = sortperm(unsorted.detno)
sorted = unsorted[sorting_idxs]
grouped = TypedTables.Table(consgroupedview(sorted.detno, TypedTables.columns(sorted)))
r_detno = similar(detno, 0)
r_edep = similar(edep, 0)
r_pos = similar(pos, 0)
ustripped_cradius = _parse_value(float(T), cluster_radius, internal_length_unit)
for d_hits_nt in grouped
d_hits = TypedTables.Table(d_hits_nt)
d_detno = first(d_hits.detno)
@assert all(isequal(d_detno), d_hits.detno)
if length(d_hits) > 3
clusters = Clustering.dbscan(hcat((to_internal_units.(getindex.(d_hits.pos,i)) for i in 1:3)...)',
ustripped_cradius, leafsize = 20, min_neighbors = 1, min_cluster_size = 1).clusters
for c in clusters
idxs = vcat(c.boundary_indices, c.core_indices)
@assert length(idxs) == c.size
c_hits = view(d_hits, idxs)
push!(r_detno, d_detno)
esum = sum(c_hits.edep)
push!(r_edep, esum)
if esum zero(TT)
push!(r_pos, barycenter(c_hits.pos))
else
weights = ustrip.(Unitful.NoUnits, c_hits.edep .* inv(esum))
push!(r_pos, barycenter(c_hits.pos, StatsBase.Weights(weights)))
end
end
else
append!(r_detno, d_hits.detno)
append!(r_edep, d_hits.edep)
append!(r_pos, d_hits.pos)
end
end
(detno = r_detno, edep = r_edep, pos = r_pos)
end

PS: While browsing through old code and cleaning some up, I also found an old leftover of Table utilities in https://github.com/JuliaPhysics/SolidStateDetectors.jl/blob/main/src/MCEventsProcessing/table_utils.jl. Maybe some of the functions there could be helpful in implementing this quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working convenience Improve user-friendliness enhancement Improvement of existing features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants