Demande #35570
Hâpy 2.9 : service onenodestart en erreur au redémarrage du serveur Hâpy Node
Status:
Nouveau
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
09/15/2023
Due date:
% Done:
0%
Description
Cela ne semble pas empêcher les VM hébergées sur le nœud de re-démarrer, elles mettraient trop de temps ?
Test squash : https://dev-eole.ac-dijon.fr/squash/executions/14706
root@hapy-node:~# diagnose *** Test du module hapy-node version 2.9.0 (hapy-node 0000000A) *** Attention, serveur opérationnel mais des services ne sont pas démarrés : ● onenodestart.service loaded
root@hapy-node:~# systemctl status onenodestart × onenodestart.service - OpenNebula Node starter Loaded: loaded (/lib/systemd/system/onenodestart.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Fri 2023-09-15 17:52:00 CEST; 1min 58s ago Process: 849 ExecStart=/usr/share/eole/sbin/onevm-all -c ${CREDS} -a resume (code=exited, status=255/EXCEPTION) Main PID: 849 (code=exited, status=255/EXCEPTION) CPU: 1.403s sept. 15 17:50:54 hapy-node onevm-all[849]: /usr/lib/ruby/vendor_ruby/rubygems/specification.rb:1671: warning: previous definition of DateTimeFormat was here sept. 15 17:50:54 hapy-node onevm-all[849]: /usr/lib/ruby/3.0.0/rubygems/core_ext/kernel_require.rb:12: warning: already initialized constant Kernel::RUBYGEMS_ACTIVATION_MONITOR sept. 15 17:50:54 hapy-node onevm-all[849]: /usr/lib/ruby/vendor_ruby/rubygems/core_ext/kernel_require.rb:12: warning: previous definition of RUBYGEMS_ACTIVATION_MONITOR was here sept. 15 17:50:58 hapy-node onevm-all[849]: Resume 16 - ttylinux-16... scheduled sept. 15 17:50:58 hapy-node onevm-all[849]: Resume 15 - ttylinux-15... scheduled sept. 15 17:52:00 hapy-node onevm-all[849]: Wait 60s for VMs to resume............................................................. FAIL sept. 15 17:52:00 hapy-node systemd[1]: onenodestart.service: Main process exited, code=exited, status=255/EXCEPTION sept. 15 17:52:00 hapy-node systemd[1]: onenodestart.service: Failed with result 'exit-code'. sept. 15 17:52:00 hapy-node systemd[1]: Failed to start OpenNebula Node starter. sept. 15 17:52:00 hapy-node systemd[1]: onenodestart.service: Consumed 1.403s CPU time.
History
#1 Updated by Joël Cuissinat 18 days ago
Si je redémarre le serveur Hâpy, c'est le service onenode qui se met en erreur :
root@hapy:~# systemctl status onenode × onenode.service - OpenNebula Node starter Loaded: loaded (/lib/systemd/system/onenode.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Fri 2023-09-15 18:10:04 CEST; 1min 49s ago Process: 3089 ExecStart=/usr/share/eole/sbin/onevm-all -c ${CREDS} -a resume (code=exited, status=255/EXCEPTION) Main PID: 3089 (code=exited, status=255/EXCEPTION) CPU: 1.157s sept. 15 18:09:01 hapy onevm-all[3089]: /usr/lib/ruby/3.0.0/rubygems/core_ext/kernel_require.rb:12: warning: already initialized constant Kernel::RUBYGEMS_ACTIVATION_MONITOR sept. 15 18:09:01 hapy onevm-all[3089]: /usr/lib/ruby/vendor_ruby/rubygems/core_ext/kernel_require.rb:12: warning: previous definition of RUBYGEMS_ACTIVATION_MONITOR was here sept. 15 18:09:02 hapy onevm-all[3089]: Resume 14 - ttylinux-14... scheduled sept. 15 18:09:02 hapy onevm-all[3089]: Resume 4 - Eolebase FI 2.8.1-4... scheduled sept. 15 18:09:02 hapy onevm-all[3089]: Resume 3 - install-eole-2.9.0-amd64-3... scheduled sept. 15 18:10:04 hapy onevm-all[3089]: Wait 60s for VMs to resume............................................................. FAIL sept. 15 18:10:04 hapy systemd[1]: onenode.service: Main process exited, code=exited, status=255/EXCEPTION sept. 15 18:10:04 hapy systemd[1]: onenode.service: Failed with result 'exit-code'. sept. 15 18:10:04 hapy systemd[1]: Failed to start OpenNebula Node starter. sept. 15 18:10:04 hapy systemd[1]: onenode.service: Consumed 1.157s CPU time.
#2 Updated by Daniel Dehennin 15 days ago
En démarrant un aca.hapy-2.9.0-instance-AvecImport, j’ai ça dans les logs `/var/log/one/0.log`:
Thu Aug 31 22:04:21 2023 [Z0][VM][I]: New LCM state is SAVE_SUSPEND Thu Aug 31 22:04:22 2023 [Z0][VMM][E]: save: Command "virsh --connect qemu+tcp://localhost/system save 77e31ad3-0226-40f4-83b9-06cf4d0e1217 /var/lib/one//datastores/100/0/checkpoint" failed: error: failed to connect to the hypervisor error: End of file while reading data: Input/output error Could not save 77e31ad3-0226-40f4-83b9-06cf4d0e1217 to /var/lib/one//datastores/100/0/checkpoint Thu Aug 31 22:04:22 2023 [Z0][VMM][I]: ExitCode: 0 Thu Aug 31 22:04:22 2023 [Z0][VMM][I]: Successfully execute virtualization driver operation: save. […] Mon Sep 18 09:17:07 2023 [Z0][VMM][D]: Message received: RESTORE FAILURE 0 ERROR: restore: Command "set -e -o pipefail # extract the xml from the checkpoint virsh --connect qemu+tcp://localhost/system save-image-dumpxml /var/lib/one//datastores/100/0/checkpoint > /var/lib/one//datastores/100/0/checkpoint.xml # Eeplace all occurrences of the DS_LOCATION/<DS_ID>/<VM_ID> with the specific # DS_ID where the checkpoint is placed. This is done in case there was a # system DS migration sed -i "s%/var/lib/one//datastores/[0-9]\+/0/%/var/lib/one//datastores/100/0/%g" /var/lib/one//datastores/100/0/checkpoint.xml sed -i "s%/var/lib/one/datastores/[0-9]\+/0/%/var/lib/one//datastores/100/0/%g" /var/lib/one//datastores/100/0/checkpoint.xml" failed: error: operation failed: failed to read qemu header Could not recalculate paths in /var/lib/one//datastores/100/0/checkpoint.xml ExitCode: 1
Il semble que la fonctionnalité suspend/resume ait un soucis.
#3 Updated by Daniel Dehennin 15 days ago
Il faudrait vérifier si la fonctionnalité SUSPEND
fonctionne avec une méthode de transfert SHARED
d’un datastore de type FS
, peut-être n’est-ce disponible que pour la méthode de transfert QCOW2
, je ne sais plus…