Koozali.org formerly Contribs.org

DAR backup with NFS ocassional failure

Offline sages

DAR backup with NFS ocassional failure
« on: February 28, 2020, 09:20:44 AM »
I'm running sme 9.2 in server only mode with DAR backup to workstation to a local network NFS folder.
The NFS server is an openmediavault machine that is configured for WOL and autoshutdown.
Occasionally, probably every couple of weeks, the backup fails/stalls.
My initial thoughts were that the autoshutdown on the OMV server was occurring before the backup had completed and that I could resolve the issue by adjusting the autoshutdown parameters.
Further investigation is suggesting that this may not the the issue.

What I am seeing is a fail2ban update event around the time that the backup is failing.


Feb 28 01:40:27 fwbox esmith::event[25801]: Processing event: fail2ban-update
Feb 28 01:40:27 fwbox esmith::event[25801]: Running event handler: /etc/e-smith/events/actions/generic_template_expand
Feb 28 01:40:28 fwbox esmith::event[25801]: expanding /etc/rc.d/init.d/masq
Feb 28 01:40:28 fwbox esmith::event[25801]: generic_template_expand=action|Event|fail2ban-update|Action|generic_template_expand|Start|1582825227 878704|End|1582825228 935508|Elapsed|1.056804
Feb 28 01:40:28 fwbox esmith::event[25801]: Running event handler: /etc/e-smith/events/actions/adjust-services
Feb 28 01:40:29 fwbox esmith::event[25801]: adjusting non-supervised masq (adjust)
Feb 28 01:40:30 fwbox esmith::event[25801]: adjust-services=action|Event|fail2ban-update|Action|adjust-services|Start|1582825228 936387|End|1582825230 369579|Elapsed|1.433192
Feb 28 02:00:22 fwbox kernel: nfs: server 192.168.128.10 not responding, still trying
Feb 28 02:24:00 fwbox kernel: INFO: task dar:23036 blocked for more than 120 seconds.

I'm wondering if the resultant adjust-services is causing the NFS connection to break.

Q. Rather than wait for potentially another couple of weeks for the stars to align to tray and investigate this further, can I just run the adjust-services action from the command line whilst I have an NFS connection active to see if they are related?

Not overly stressed about this but I am interested in investigating this a bit more.
...

Offline janet

  • ****
  • 4,812
Re: DAR backup with NFS ocassional failure
« Reply #1 on: February 28, 2020, 12:23:28 PM »
sages

You do not say what time the backup is scheduled to start & whether it is already running & creating backup files at the time the error message occurs (ie nfs not responding).

The fail2ban event is at 1.40 whereas the not responding nfs is at 2.00, 20 mins later, so I do not see them related

If you think fail2ban is involved, then disable fail2ban & see if the backup runs OK then.
Any access attemps being made that fail2ban identifies & bans, are still prevented from gaining access by SME server, so it is still quite safe to run without fail2ban enabled.

Quote
Feb 28 02:00:22 fwbox kernel: nfs: server 192.168.128.10 not responding, still trying
Feb 28 02:24:00 fwbox kernel: INFO: task dar:23036 blocked for more than 120 seconds.
Please search before asking, an answer may already exist.
The Search & other links to useful information are at top of Forum.

Offline sages

Re: DAR backup with NFS ocassional failure
« Reply #2 on: February 28, 2020, 01:15:36 PM »

Backup to workstation was running when it failed this time.
Usually files are created before the backup fails however the relationship between what is causing the NFS connection to fail isn't occurring at a fixed time nor, so far, is it predictable when it will occur. So it could fail between the WOL to the NFS server or at any stage during the backup process.

The log entries above are as a result of transferring my fault finding from the openmediavault NFS server to the SME server. And in this instance files had been written to the NFS server. As I mentioned my original thoughts were directed to the autoshutdown on the OMV box and I neglected to properly inspect the SME logs.
 
I have a system with:
1/ fail2ban being triggered whenever an outside source decides trips the conditions. ie effectively random. and possibly nothing to do with my issue.
2/ Dar backup running (fixed time so easy to plan for)
3/ External NFS service running on an external OMV server with an autoshutdown service that (maybe) shutting down to early as a result of how it is configured. ie it isn't seeing enough activity from my SME connection during the backup and decides to shutdown. Pseudo random.

Yes, I could disable fail2ban. However as the error is occurring seemingly randomly sometime every couple of weeks I could spend a lot of time waiting for the gods to line up and create the failure condition.
If running the masq expand event doesn't cause the fault then I suppose I could try and trip fail2ban with a simulated attack.

To try and avoid spending several weeks hoping for the gods to line up I'm now looking to have a 'controlled environment' where I can logically try some known combinations of events.
...

Offline sages

Re: DAR backup with NFS ocassional failure
« Reply #3 on: February 28, 2020, 01:34:24 PM »
The fail2ban event is at 1.40 whereas the not responding nfs is at 2.00, 20 mins later, so I do not see them related
Fair enough. I considered that there may be some timeouts involved but haven't confirmed.

From observations DAR accesses backup files at varying intervals after the actual backup slices are completed and before the catalog is updated. But I'd have to look observe it in operation to see if it's in the order of 20 minutes.
...

Offline TerryF

  • grumpy old man
  • *
  • 1,324
Re: DAR backup with NFS ocassional failure
« Reply #4 on: February 29, 2020, 01:09:35 AM »
Just thinking out loud, what eth cards at both ends? are there any reported issues, any firmware updates..

once had a similar issue, failing backups occurring randomly, ended up being a ethernet cable plug that was problamatic

--
qui scribit bis legit

Offline sages

Re: DAR backup with NFS ocassional failure
« Reply #5 on: February 29, 2020, 01:33:00 AM »
Ethernet cards are just whatever is on the respective MB's.
I've had the SME box for several years and no known issues with network connectivity, the openmediavault box and using NFS is a more recent thing, ~6 months.
I know what you mean regarding the cables though. I'll work my way through the whole process to see if I can isolate the cause.
My original question was just one part of that.
...

Offline sages

Re: DAR backup with NFS ocassional failure
« Reply #6 on: February 29, 2020, 03:28:56 AM »
It appears that the TCP connection for NFS has an idle timeout.
During the backup there are varying periods when a process is occurring on the SME machine with no traffic to the NFS mount. This shouldn't cause an issue as the TCP connection is reestablished once the SME requires access to the NFS share.
In my use case the autoshutdown settings on my OMV machine check for several items, an TCP connection to the NFS port, cpu load, network traffic and hard drive access.
After several checks over a period of time it will do an autoshutdown.
So it looks like everything is working as it was designed.
I'm looking at what I can use as either an alternative trigger for the autoshutdown or a keep-alive for the connection.
And I tried running the fail2ban-update and it didn't impact the connection.
...

Offline TerryF

  • grumpy old man
  • *
  • 1,324
Re: DAR backup with NFS ocassional failure
« Reply #7 on: February 29, 2020, 12:42:10 PM »
I am no dev/code writing guru, but, :-) seems you just need a keepalive message from the sme instance to to the NFS one..

https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html
--
qui scribit bis legit

Offline sages

Re: DAR backup with NFS ocassional failure
« Reply #8 on: February 29, 2020, 02:28:07 PM »
I am no dev/code writing guru, but, :-) seems you just need a keepalive message from the sme instance to to the NFS one..

https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html
Nice one thanks.

I'll investigate this further.
I'm currently using a baseball bat approach as I know what the backup periods are I've configured a 'disable autoshutdown' over the backup period on the openmediavault server. Not ideal as it is using a timer but it should give me time to sort it out properly.
...