Project

General

Profile

Demande #35570

Hâpy 2.9 : service onenodestart en erreur au redémarrage du serveur Hâpy Node

Added by Joël Cuissinat 9 months ago. Updated 9 months ago.

Status:
Nouveau
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
09/15/2023
Due date:
% Done:

0%


Description

Cela ne semble pas empêcher les VM hébergées sur le nœud de re-démarrer, elles mettraient trop de temps ?

Test squash : https://dev-eole.ac-dijon.fr/squash/executions/14706

root@hapy-node:~# diagnose 
*** Test du module hapy-node version 2.9.0 (hapy-node 0000000A) ***

Attention, serveur opérationnel mais des services ne sont pas démarrés :

● onenodestart.service loaded
root@hapy-node:~# systemctl status  onenodestart 
× onenodestart.service - OpenNebula Node starter
     Loaded: loaded (/lib/systemd/system/onenodestart.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Fri 2023-09-15 17:52:00 CEST; 1min 58s ago
    Process: 849 ExecStart=/usr/share/eole/sbin/onevm-all -c ${CREDS} -a resume (code=exited, status=255/EXCEPTION)
   Main PID: 849 (code=exited, status=255/EXCEPTION)
        CPU: 1.403s

sept. 15 17:50:54 hapy-node onevm-all[849]: /usr/lib/ruby/vendor_ruby/rubygems/specification.rb:1671: warning: previous definition of DateTimeFormat was here
sept. 15 17:50:54 hapy-node onevm-all[849]: /usr/lib/ruby/3.0.0/rubygems/core_ext/kernel_require.rb:12: warning: already initialized constant Kernel::RUBYGEMS_ACTIVATION_MONITOR
sept. 15 17:50:54 hapy-node onevm-all[849]: /usr/lib/ruby/vendor_ruby/rubygems/core_ext/kernel_require.rb:12: warning: previous definition of RUBYGEMS_ACTIVATION_MONITOR was here
sept. 15 17:50:58 hapy-node onevm-all[849]: Resume 16 - ttylinux-16... scheduled
sept. 15 17:50:58 hapy-node onevm-all[849]: Resume 15 - ttylinux-15... scheduled
sept. 15 17:52:00 hapy-node onevm-all[849]: Wait 60s for VMs to resume............................................................. FAIL
sept. 15 17:52:00 hapy-node systemd[1]: onenodestart.service: Main process exited, code=exited, status=255/EXCEPTION
sept. 15 17:52:00 hapy-node systemd[1]: onenodestart.service: Failed with result 'exit-code'.
sept. 15 17:52:00 hapy-node systemd[1]: Failed to start OpenNebula Node starter.
sept. 15 17:52:00 hapy-node systemd[1]: onenodestart.service: Consumed 1.403s CPU time.

History

#1 Updated by Joël Cuissinat 9 months ago

Si je redémarre le serveur Hâpy, c'est le service onenode qui se met en erreur :

root@hapy:~# systemctl status onenode
× onenode.service - OpenNebula Node starter
     Loaded: loaded (/lib/systemd/system/onenode.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Fri 2023-09-15 18:10:04 CEST; 1min 49s ago
    Process: 3089 ExecStart=/usr/share/eole/sbin/onevm-all -c ${CREDS} -a resume (code=exited, status=255/EXCEPTION)
   Main PID: 3089 (code=exited, status=255/EXCEPTION)
        CPU: 1.157s

sept. 15 18:09:01 hapy onevm-all[3089]: /usr/lib/ruby/3.0.0/rubygems/core_ext/kernel_require.rb:12: warning: already initialized constant Kernel::RUBYGEMS_ACTIVATION_MONITOR
sept. 15 18:09:01 hapy onevm-all[3089]: /usr/lib/ruby/vendor_ruby/rubygems/core_ext/kernel_require.rb:12: warning: previous definition of RUBYGEMS_ACTIVATION_MONITOR was here
sept. 15 18:09:02 hapy onevm-all[3089]: Resume 14 - ttylinux-14... scheduled
sept. 15 18:09:02 hapy onevm-all[3089]: Resume 4 - Eolebase FI 2.8.1-4... scheduled
sept. 15 18:09:02 hapy onevm-all[3089]: Resume 3 - install-eole-2.9.0-amd64-3... scheduled
sept. 15 18:10:04 hapy onevm-all[3089]: Wait 60s for VMs to resume............................................................. FAIL
sept. 15 18:10:04 hapy systemd[1]: onenode.service: Main process exited, code=exited, status=255/EXCEPTION
sept. 15 18:10:04 hapy systemd[1]: onenode.service: Failed with result 'exit-code'.
sept. 15 18:10:04 hapy systemd[1]: Failed to start OpenNebula Node starter.
sept. 15 18:10:04 hapy systemd[1]: onenode.service: Consumed 1.157s CPU time.

#2 Updated by Daniel Dehennin 9 months ago

En démarrant un aca.hapy-2.9.0-instance-AvecImport, j’ai ça dans les logs `/var/log/one/0.log`:

Thu Aug 31 22:04:21 2023 [Z0][VM][I]: New LCM state is SAVE_SUSPEND
Thu Aug 31 22:04:22 2023 [Z0][VMM][E]: save: Command "virsh --connect qemu+tcp://localhost/system save 77e31ad3-0226-40f4-83b9-06cf4d0e1217 /var/lib/one//datastores/100/0/checkpoint" failed: error: failed to connect to the hypervisor error: End of file while reading data: Input/output error Could not save 77e31ad3-0226-40f4-83b9-06cf4d0e1217 to /var/lib/one//datastores/100/0/checkpoint
Thu Aug 31 22:04:22 2023 [Z0][VMM][I]: ExitCode: 0
Thu Aug 31 22:04:22 2023 [Z0][VMM][I]: Successfully execute virtualization driver operation: save.

[…]
Mon Sep 18 09:17:07 2023 [Z0][VMM][D]: Message received: RESTORE FAILURE 0 ERROR: restore: Command "set -e -o pipefail  # extract the xml from the checkpoint  virsh --connect qemu+tcp://localhost/system save-image-dumpxml /var/lib/one//datastores/100/0/checkpoint > /var/lib/one//datastores/100/0/checkpoint.xml  # Eeplace all occurrences of the DS_LOCATION/<DS_ID>/<VM_ID> with the specific # DS_ID where the checkpoint is placed. This is done in case there was a # system DS migration  sed -i "s%/var/lib/one//datastores/[0-9]\+/0/%/var/lib/one//datastores/100/0/%g" /var/lib/one//datastores/100/0/checkpoint.xml sed -i "s%/var/lib/one/datastores/[0-9]\+/0/%/var/lib/one//datastores/100/0/%g" /var/lib/one//datastores/100/0/checkpoint.xml" failed: error: operation failed: failed to read qemu header Could not recalculate paths in /var/lib/one//datastores/100/0/checkpoint.xml ExitCode: 1

Il semble que la fonctionnalité suspend/resume ait un soucis.

#3 Updated by Daniel Dehennin 9 months ago

Il faudrait vérifier si la fonctionnalité SUSPEND fonctionne avec une méthode de transfert SHARED d’un datastore de type FS, peut-être n’est-ce disponible que pour la méthode de transfert QCOW2, je ne sais plus…

Also available in: Atom PDF