From c21d852077921d63f593699ac6e3543bc2059f26 Mon Sep 17 00:00:00 2001 From: Shubhendu Date: Fri, 19 Jan 2018 13:54:17 +0530 Subject: [PATCH 1/7] Added spec specs/unmanage_cluster.adoc tendrl-bug-id: Tendrl/specifications#252 Signed-off-by: Shubhendu --- specs/unmanage_cluster.adoc | 213 ++++++++++++++++++++++++++++++++++++ 1 file changed, 213 insertions(+) create mode 100644 specs/unmanage_cluster.adoc diff --git a/specs/unmanage_cluster.adoc b/specs/unmanage_cluster.adoc new file mode 100644 index 0000000..c7adaca --- /dev/null +++ b/specs/unmanage_cluster.adoc @@ -0,0 +1,213 @@ += Introduce a un-manage cluster mechanism in tendrl + +The intent of this change is to introduce an un-manage cluster functionality in +tendrl. This makes the cluster known to tendrl but not managed anymore, meaning +the monitoring, alerting and management of the cluster is no more possible from +tendrl. At later stage (if required) admin can decide to re-import the cluster +to start managing it again. + +The un-manage functionality is helpful for scenario where admin wants to bring +down the cluster for some critical maintenance activities and doesn't want the +monitoring etc to be performed for that period. + +== Problem description + +There are situations when admin needs some critical maintenance of the cluster +and during this period he doesn't want any monitoring etc taking place. Also +of he decides to dismantle the cluster at some stage we should have a mechsnism +using which the cluster could be marked as un-managed from tendrl side. + +Tendrl also should provide a provision to re-import the cluster at later stage +if admin wants and the process should be quite seamless and no or very less +manual intervention required for this job to be performed. + + +== Use Cases + +This addresses the un-managing and re-import an un-managed cluster at later +stage. The un-manage functionality in tendrl needs to take care of below things + +* Stop any services which got started as part of tendrl managing the storage +nodes and disable the services +* Set the cluster state properly so that the same is marked and listed as +un-managed in UI dashboards. No operations should be allowed on the un-managed +cluster and there should not be any monitoring, alerting or entities management +supported on this cluster anymore +* User should have an option to re-import the cluster if needed later and it +should seamlessly work as usual + + +== Proposed change + +* On un-manage cluster start a flow in tendrl server node's node-agent which +creates child jobs on storage nodes to stop tendrl specific services like +collectd and tendrl-gluster-integration + +* Mark the cluster flag `is_managed` as `False` so that the cluster could be +listed as un-managed in UI dashboards and all the possible actions could be +disabled for it + +* Archive the graphite (monitoring) data for the cluster in archive location so +the grafana dashboards dont list the cluster and its entities anymore + +* Delete the grafana alert dashboards for the cluster and its dependent entities + +The logic here goes like + +** Start a flow in node-agent on tendrl server node for un-manage cluster + +** The first atom of the above flow invokes child jobs on the storage node's +node-agent to stop tendrl specific services and marking them dissabled + +** In the main atom of the un-manage cluster flow remove if any etcd details for +the cluster and then mark the cluster is_managed flag as `False` + +** One of the atoms now un-manage cluster flow, invokes a flow in +monitoring-integration to archive the graphite data for the cluser + +** Finally another atom invokes a flow in monitoring-integration to remove the +grafana alert dashboards for the cluster and its dependent entities + +So the structure of the un-manage cluster flow would look something as below + +``` +UnmanageCluster: + tags: + - "tendrl/monitor" + atoms: + - tendrl.objects.Cluster.atoms.StopMonitoringServices + - tendrl.objects.Cluster.atoms.StopIntegrationServices + - tendrl.objects.Cluster.atoms.DeleteClusterDetails + - tendrl.objects.Cluster.atoms.DeleteMonitoringDetails + help: "Unmanage a Gluster Cluster" + enabled: true + inputs: + mandatory: + - TendrlContext.integration_id + run: tendrl.flows.UnmanageCluster + type: Update + uuid: 2f94a48a-05d7-408c-b400-e27827f4efed + version: 1 +``` + +=== Alternatives + +None + +=== Data model impact + +None + +=== Impacted Modules: + +==== Tendrl API impact: + +* Introduce an API `cluster/{int-id}/unmanage` for triggering an un-manage +cluster fow + +==== Notifications/Monitoring impact: + +* A flow to archive the cluster specific graphite data + +* A flow to remove the grafana alerts dashboards for the cluster and its +dependent entities + +* Raise an alert once cluster got un-managed with details like where to look +for old graphite data etc + +==== Tendrl/common impact: + +* A flow un-manage cluster to be tergetted at tendrl server node + +==== Tendrl/node_agent impact: + +None + +==== Sds integration impact: + +None + +==== Tendrl Dashboard impact: + +* UX requirements for invoking an un-manage cluster flow for an existing cluster +is captured at https://redhat.invisionapp.com/share/8QCOEVEY9 + +=== Security impact: + +None + +=== Other end user impact: + +User gets an option to un-mnaage an existing cluster and can re-import at later +stage + +=== Performance impact: + +None + +=== Other deployer impact: + +The tendrl-ansible module need to provide a mechanism to setup tendrl components +and dependencies on additional new node in the cluster. + + details to be added here of the plyabooks etc. + +=== Developer impact: + +None + + +== Implementation: + +* https://github.com/Tendrl/commons/issues/797 + + +=== Assignee(s): + +Primary assignee: + shtripat + mbukatov + +=== Work Items: + +* https://github.com/Tendrl/specifications/issues/252 + + +== Dependencies: + +None + +== Testing: + +* Check if UI dashboard has an option to trigget un-manage cluster flow + +* Check if the flow gets completed successfully and verify if the grafana +dashboard reflects and cluster details available now for the selected cluster + +* Verify that not grafana alert dashboards available now for the un-managed +cluster + +* Verify that the clusters list report the cluster as un-managed and import +option is enabled now + +* Try to import the cluster back and it should be successful. All grafana +dashboards, grafana alert dashboards and UI reflect the cluster details back + +* Invoke the REST end point `clusters/{int-id}/unmanage` and the cluster should +be un-managed successfully + + +== Documentation impact: + +* New un-manage cluster feature should be documented with details like what all +gets disabled / removed in case a cluster is un-managed + +* New API end point should be documented with sample input / output structures + +== References: + +* https://redhat.invisionapp.com/share/8QCOEVEY9 + +* https://github.com/Tendrl/commons/pull/798 + +* https://github.com/Tendrl/monitoring-integration/pull/317 From eafcc918b3e2740c8cedceb1d684c1beb00f9c41 Mon Sep 17 00:00:00 2001 From: Shubhendu Date: Mon, 5 Feb 2018 11:23:17 +0530 Subject: [PATCH 2/7] Incoprated review comments and added more expected behavior details tendrl-bug-id: Tendrl/specifications#252 Signed-off-by: Shubhendu --- specs/unmanage_cluster.adoc | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/specs/unmanage_cluster.adoc b/specs/unmanage_cluster.adoc index c7adaca..44c595f 100644 --- a/specs/unmanage_cluster.adoc +++ b/specs/unmanage_cluster.adoc @@ -184,18 +184,28 @@ None * Check if the flow gets completed successfully and verify if the grafana dashboard reflects and cluster details available now for the selected cluster -* Verify that not grafana alert dashboards available now for the un-managed +* Verify that no grafana alert dashboards available now for the un-managed cluster * Verify that the clusters list report the cluster as un-managed and import option is enabled now * Try to import the cluster back and it should be successful. All grafana -dashboards, grafana alert dashboards and UI reflect the cluster details back +dashboards, grafana alert dashboards and UI reflect the cluster details back * Invoke the REST end point `clusters/{int-id}/unmanage` and the cluster should be un-managed successfully +* On un-manage cluster completion, the alert dashboards in grafana would vanish +for the entities of the cluster like volume, bricks etc. Verify to make sure the +same happens as expected + +* Once cluster is un-managed the details of the cluster would vanish from +dashboards in grafana. Verify the same happens as expected + +* Verify that the final alert post un-manage flow, tells about removal of +details from grafana dashboards and grafana alert dashboards + == Documentation impact: @@ -204,6 +214,9 @@ gets disabled / removed in case a cluster is un-managed * New API end point should be documented with sample input / output structures +* The expected behavior post un-manage call in grafana dashboards should be +clearly mentioned in documents + == References: * https://redhat.invisionapp.com/share/8QCOEVEY9 From 53fdc5dc2a3433888f04becf0e69bb94975851df Mon Sep 17 00:00:00 2001 From: Shubhendu Date: Wed, 7 Feb 2018 12:38:23 +0530 Subject: [PATCH 3/7] Added UI impact details to specification tendrl-bug-id: Tendrl/specifications#252 Signed-off-by: Shubhendu --- specs/unmanage_cluster.adoc | 91 +++++++++++++++++++++++++++++++++---- 1 file changed, 82 insertions(+), 9 deletions(-) diff --git a/specs/unmanage_cluster.adoc b/specs/unmanage_cluster.adoc index 44c595f..c3c7c6a 100644 --- a/specs/unmanage_cluster.adoc +++ b/specs/unmanage_cluster.adoc @@ -14,7 +14,7 @@ monitoring etc to be performed for that period. There are situations when admin needs some critical maintenance of the cluster and during this period he doesn't want any monitoring etc taking place. Also -of he decides to dismantle the cluster at some stage we should have a mechsnism +if he decides to dismantle the cluster at some stage we should have a mechanism using which the cluster could be marked as un-managed from tendrl side. Tendrl also should provide a provision to re-import the cluster at later stage @@ -57,13 +57,13 @@ The logic here goes like ** Start a flow in node-agent on tendrl server node for un-manage cluster ** The first atom of the above flow invokes child jobs on the storage node's -node-agent to stop tendrl specific services and marking them dissabled +node-agent to stop tendrl specific services and marking them disabled ** In the main atom of the un-manage cluster flow remove if any etcd details for the cluster and then mark the cluster is_managed flag as `False` ** One of the atoms now un-manage cluster flow, invokes a flow in -monitoring-integration to archive the graphite data for the cluser +monitoring-integration to archive the graphite data for the cluster ** Finally another atom invokes a flow in monitoring-integration to remove the grafana alert dashboards for the cluster and its dependent entities @@ -103,7 +103,7 @@ None ==== Tendrl API impact: * Introduce an API `cluster/{int-id}/unmanage` for triggering an un-manage -cluster fow +cluster flow ==== Notifications/Monitoring impact: @@ -117,7 +117,7 @@ for old graphite data etc ==== Tendrl/common impact: -* A flow un-manage cluster to be tergetted at tendrl server node +* A flow un-manage cluster to be targeted at tendrl server node ==== Tendrl/node_agent impact: @@ -129,8 +129,76 @@ None ==== Tendrl Dashboard impact: -* UX requirements for invoking an un-manage cluster flow for an existing cluster -is captured at https://redhat.invisionapp.com/share/8QCOEVEY9 +* Following changes required in UI dashboards based on UX designs mentioned at +https://redhat.invisionapp.com/share/8QCOEVEY9 + +** Add an option namely `Unmanage` under kebab menu for each successfully +imported and managed cluster + +** Add a dialog box which opens up on click event of `Unmanage` option from +kebab menu of the cluster. This dialog box is for confirmation from user to +start un-manage flow for the cluster + +===== Workflow + +* User clicks the `Unmanage` option from the kebab menu for a managed cluster + +* The click event triggers a dialog box with appropriate message. A sample +message is available at +https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/273239640 + +* There are 3 possible actions on this dialog + +** `Close` icon to close the dialog and no action performed for un-managing the +cluster. User would be directed back to clusters list page + +** `Cancel` button to close the dialog and no action performed for un-managing the +cluster. User would be directed back to clusters list page + +** `Unmanage` button to start the un-manage cluster task in backend. A message +with task details gets displayed on dialog box. Sample message available at +https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/273239844 + +** This final message after submission of the task for un-managing cluster would +also provide a button to view the task details. A button `View Task Progress` is +available for the same. User can opt to close this dialog and later user context +menus to check the task updates + +** Once a cluster is being moved to un-managed state, the changes in properties +listed for cluster are as below + +*** `Import Status` changed to `Unmanaging` + +*** `Is Managed` changed to `no` + +*** The columns `Volume Profiling`, `Volumes` and `Alerts` would be hidden + +*** `View Details` link would be available to check the task details + +*** `Dashboard` button would be disabled + +*** Kebab menu for the un-managed cluster would be hidden + +** Once the un-manage cluster task gets completed a global notification gets +received + +** If task was successful, the state of the cluster would be changed to ready to +import + +If task failed due to some issues, the cluster details would listed as below in + +*** `Import Status` changed to `Unmanage failed` + +*** `Is managed` changed to `no` + +*** The columns `Volume Profiling`, `Volumes` and `Alerts` would be hidden + +*** `View Details` link would be available to check the errors + +*** `Dashboard` button would be disabled + +*** Kebab menu for the un-managed cluster would be hidden + === Security impact: @@ -138,7 +206,7 @@ None === Other end user impact: -User gets an option to un-mnaage an existing cluster and can re-import at later +User gets an option to un-manage an existing cluster and can re-import at later stage === Performance impact: @@ -167,6 +235,7 @@ None Primary assignee: shtripat mbukatov + a2batic === Work Items: @@ -179,7 +248,7 @@ None == Testing: -* Check if UI dashboard has an option to trigget un-manage cluster flow +* Check if UI dashboard has an option to trigger un-manage cluster flow * Check if the flow gets completed successfully and verify if the grafana dashboard reflects and cluster details available now for the selected cluster @@ -224,3 +293,7 @@ clearly mentioned in documents * https://github.com/Tendrl/commons/pull/798 * https://github.com/Tendrl/monitoring-integration/pull/317 + +* https://github.com/Tendrl/ui/issues/801 + +* https://github.com/Tendrl/api/issues/349 From 08cad58294db9ccdecc74a615d30474afcdf6939 Mon Sep 17 00:00:00 2001 From: Shubhendu Date: Thu, 8 Feb 2018 11:45:54 +0530 Subject: [PATCH 4/7] Added more details about API dependencies tendrl-bug-id: Tendrl/specifications#252 Signed-off-by: Shubhendu --- specs/unmanage_cluster.adoc | 31 +++++++++++++++++++++++++++---- 1 file changed, 27 insertions(+), 4 deletions(-) diff --git a/specs/unmanage_cluster.adoc b/specs/unmanage_cluster.adoc index c3c7c6a..07f42d8 100644 --- a/specs/unmanage_cluster.adoc +++ b/specs/unmanage_cluster.adoc @@ -90,13 +90,38 @@ UnmanageCluster: version: 1 ``` +* While import flow in progress the values of `cluster_job_id` and +`cluster_job_status` should be set with import job id and `Importing` +respectively + +* Once import flow is successful the value of `cluster_job_status` would be set +as `done` + +* If import flow fails the value of `cluster_job_status` would be set as +`Import failed` + +* While un-manage flow in progress the values of `cluster_job_id` and +`cluster_job_status` should be set with un-manage job id and `Unmanaging` +respectively + +* Once un-manage flow is successful the value of `cluster_job_status` would be +set as `done` + +* If un-manage flow fails the value of `cluster_job_status` would be set as +`Unmanage failed` + + === Alternatives None === Data model impact -None +* Change the fields `import_job_id` and `import_status` as `cluster_job_id` and +`cluster_job_status` respectively for cluster entity + +* The same fields would be updated with appropriate details while import and +un-manage flows on cluster === Impacted Modules: @@ -244,7 +269,7 @@ Primary assignee: == Dependencies: -None +* https://github.com/Tendrl/api/issues/349 == Testing: @@ -295,5 +320,3 @@ clearly mentioned in documents * https://github.com/Tendrl/monitoring-integration/pull/317 * https://github.com/Tendrl/ui/issues/801 - -* https://github.com/Tendrl/api/issues/349 From 1552da350a2a3a90cd7b3deebf59122e88d8a428 Mon Sep 17 00:00:00 2001 From: Shubhendu Date: Fri, 9 Feb 2018 09:46:45 +0530 Subject: [PATCH 5/7] Added details of data model changes for this flow tendrl-bug-id: Tendrl/specifications#252 Signed-off-by: Shubhendu --- specs/unmanage_cluster.adoc | 37 +++++++++++++++++++++---------------- 1 file changed, 21 insertions(+), 16 deletions(-) diff --git a/specs/unmanage_cluster.adoc b/specs/unmanage_cluster.adoc index 07f42d8..5510d71 100644 --- a/specs/unmanage_cluster.adoc +++ b/specs/unmanage_cluster.adoc @@ -29,10 +29,12 @@ stage. The un-manage functionality in tendrl needs to take care of below things * Stop any services which got started as part of tendrl managing the storage nodes and disable the services + * Set the cluster state properly so that the same is marked and listed as un-managed in UI dashboards. No operations should be allowed on the un-managed cluster and there should not be any monitoring, alerting or entities management supported on this cluster anymore + * User should have an option to re-import the cluster if needed later and it should seamlessly work as usual @@ -90,25 +92,21 @@ UnmanageCluster: version: 1 ``` -* While import flow in progress the values of `cluster_job_id` and -`cluster_job_status` should be set with import job id and `Importing` -respectively +* While import flow in progress the values of `current_job` and `statuss` +should be set with `{'job_id': 'import job id', 'job_name': 'ImportCluster', +'status': 'in_progress'}` id and `Importing` respectively -* Once import flow is successful the value of `cluster_job_status` would be set -as `done` +* Once import flow is successful the value of `status` would be set as `done` -* If import flow fails the value of `cluster_job_status` would be set as -`Import failed` +* If import flow fails the value of `status` would be set as `failed` -* While un-manage flow in progress the values of `cluster_job_id` and -`cluster_job_status` should be set with un-manage job id and `Unmanaging` -respectively +* While un-manage flow in progress the values of `current_job` and `status` +should be set with `{'job_id': 'unmanage job id', 'job_name': 'ImportCluster', +'status': 'in_progress'}` and `Unmanaging` respectively -* Once un-manage flow is successful the value of `cluster_job_status` would be -set as `done` +* Once un-manage flow is successful the value of `status` would be set as `done` -* If un-manage flow fails the value of `cluster_job_status` would be set as -`Unmanage failed` +* If un-manage flow fails the value of `status` would be set as `failed` === Alternatives @@ -117,12 +115,19 @@ None === Data model impact -* Change the fields `import_job_id` and `import_status` as `cluster_job_id` and -`cluster_job_status` respectively for cluster entity +* Change the fields `import_job_id` and `import_status` as `current_job` and +`status` respectively for cluster entity * The same fields would be updated with appropriate details while import and un-manage flows on cluster +* The field `current_job` would maintain a dict containing `status`, `job_name` +and `job_id` for currently running job on cluster + +* The field `status` would maintain values like `importing`, `unmanaging`, +`syncing` or `unknown` at a time. This maintains any flows running status on the +cluster + === Impacted Modules: ==== Tendrl API impact: From 0887eb178a566ab3a9e0de26d5cd861749a49440 Mon Sep 17 00:00:00 2001 From: Shubhendu Date: Tue, 13 Feb 2018 20:45:51 +0530 Subject: [PATCH 6/7] Un-manage + import scenario in case of previous failed import case tendrl-bug-id: Tendrl/specifications#252 Signed-off-by: Shubhendu --- specs/unmanage_cluster.adoc | 34 +++++++++++++++++++++++++++++++++- 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/specs/unmanage_cluster.adoc b/specs/unmanage_cluster.adoc index 5510d71..ed0bc9c 100644 --- a/specs/unmanage_cluster.adoc +++ b/specs/unmanage_cluster.adoc @@ -10,6 +10,11 @@ The un-manage functionality is helpful for scenario where admin wants to bring down the cluster for some critical maintenance activities and doesn't want the monitoring etc to be performed for that period. +Also in scenario where there is a failure in cluster import user might need to +resolve the issues reported while import failure and then re-import the cluster. +This flow would need an un-manage of the cluster first and the na fresh import +of the cluster. + == Problem description There are situations when admin needs some critical maintenance of the cluster @@ -21,6 +26,9 @@ Tendrl also should provide a provision to re-import the cluster at later stage if admin wants and the process should be quite seamless and no or very less manual intervention required for this job to be performed. +In case there is a failure in import cluster, tendrl needs to provide an option +to un-manage and import the cluster again. + == Use Cases @@ -38,6 +46,9 @@ supported on this cluster anymore * User should have an option to re-import the cluster if needed later and it should seamlessly work as usual +* User should have an option to un-manage a import failed cluster and import it +again in tendrl + == Proposed change @@ -49,6 +60,8 @@ collectd and tendrl-gluster-integration listed as un-managed in UI dashboards and all the possible actions could be disabled for it +* Delete cluster entity details from tendrl central store + * Archive the graphite (monitoring) data for the cluster in archive location so the grafana dashboards dont list the cluster and its entities anymore @@ -92,7 +105,7 @@ UnmanageCluster: version: 1 ``` -* While import flow in progress the values of `current_job` and `statuss` +* While import flow in progress the values of `current_job` and `status` should be set with `{'job_id': 'import job id', 'job_name': 'ImportCluster', 'status': 'in_progress'}` id and `Importing` respectively @@ -108,6 +121,21 @@ should be set with `{'job_id': 'unmanage job id', 'job_name': 'ImportCluster', * If un-manage flow fails the value of `status` would be set as `failed` +* If an import cluster fails tendrl UI needs to keep import cluster option open +and if user selects the option, it should throw a dialog telling about the +previous import failure and if user confirms to go ahead about un-manage and +then import the cluster, UI should submit an un-manage cluster first. If the +un-manage cluster task succeeds, then UI should submit a import for the same +cluster + +* UI needs to have client side storage option to retain the previous un-manage +cluster task-id for reference and for showing the details of the tasks in UI + +* So if there is an import failure for a cluster user tries import again for the +cluster after user confirmation UI submits two tasks one by one. One for +un-manage cluster and after success import cluster. UI should maintain both the +tasks details for detailing in UI + === Alternatives @@ -305,6 +333,10 @@ dashboards in grafana. Verify the same happens as expected * Verify that the final alert post un-manage flow, tells about removal of details from grafana dashboards and grafana alert dashboards +* Verify the scenatio when a cluster import fails, and user is able to start +a un-manage + reimport cluster option from UI. UI should be able to list details +of both the tasks in this scenario + == Documentation impact: From b25a4b14b0a4711c225fb545102d7e0f74333714 Mon Sep 17 00:00:00 2001 From: Shubhendu Date: Thu, 15 Feb 2018 07:58:15 +0530 Subject: [PATCH 7/7] Added scenarios handled in UI in case of failures tendrl-bug-id: Tendrl/specifications#252 Signed-off-by: Shubhendu --- specs/unmanage_cluster.adoc | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/specs/unmanage_cluster.adoc b/specs/unmanage_cluster.adoc index ed0bc9c..38e2279 100644 --- a/specs/unmanage_cluster.adoc +++ b/specs/unmanage_cluster.adoc @@ -257,6 +257,20 @@ If task failed due to some issues, the cluster details would listed as below in *** Kebab menu for the un-managed cluster would be hidden +* If a previous import failed or cluster is in mis-configured state after import +(import failed with errors field not populated for cluster), the import and +un-manage both the options would be enabled in UI. If user selects the import +option now, it lands in import cluster view/page. If there was a previous import +failed, then modal dialog shows up and message would be something like `Import +cluster previously failed with . Before import, you need to correct the +issues and then un-manage the cluster`. This dialog has `Ok` and `Cancel` +buttons. + +* If un-manage fails, it would provide a tooltip/info with failure message `If +un-manage fails, resolve the issue and then try un-manage cluster again`. It +would show a message to say `Unmanage Cluster` failed having a `View Details` +hyperlink in the cluster list view. + === Security impact: