[Nfd-dev] Emergency

Sun Aug 27 08:05:07 PDT 2017

All:

Well, having said all of that, we actually had a crash on most of the Testbed nodes overnight.
I have located one log file with DEBUG turned on that has a FATAL error in it.
I have it available here:

https://www.arl.wustl.edu/~jdd/REMAP_CRASH/nfd.log.10.gz

I am still going through it but one obivious thing is lines like this one:

1503809296.363506 DEBUG: [Forwarder] onIncomingInterest face=272 interest=/ndn/multicast/.../.../... <snip 15950 characters> ../.../.../.../.../.../.../.../.../…

Note that I removed 15950 characters to get it to fit into this email.
So, nfd is reporting a interest /ndn/multicast/…/…/   then with repeating “…/“ for over 16000 characters.

The crashes all seemed to occur about 12:15AM Central time (11:15 Mountain, 10:15 Pacific, …).
Althought there are a few machines where it happened about 1 hour 15 minutes earlier.  So, perhaps there
were multiple times it happened, but I just have one in a log file.

Any thoughts?

John

On Aug 26, 2017, at 1:47 PM, Dehart, John <jdd at wustl.edu<mailto:jdd at wustl.edu>> wrote:

All:

I have not been able to recreate the problem we had during the SIGCOMM NDN Tutorial.

There were three things going on that might have had some impact.
1. I had all my AWS instances up for a long time (> 12 hours) waiting for the demonstrations.
2. Haitao was doing some work with the namespace-tree app.
3. Junxiao was doing some work with his tunneling app.
But so far none of those seems to cause any problems now.
It is possible that is some combination of those three, but that will take some time to re-create.

So, I would like to see if anyone has any insight into what may have happened.
Here are some details from the log files I saved from the WU gateway node:

ndnops at wundngw:/var/log/ndn/SIGCOMM_SNAFU$ grep -B 1 FATAL *
nfd.log.2-1503351121.547045 INFO: [RibManager] Adding route /ndn/edu/uiuc nexthop=275 origin=nlsr cost=42464
nfd.log.2:1503351231.628809 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.2-1503352381.309113 INFO: [RibManager] Removing route /ndn/edu/ucla/%C1.Router/cs/spurs nexthop=275 origin=nlsr
nfd.log.2:1503352491.336944 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.2-1503352888.689848 INFO: [FibManager] add-nexthop result: FAIL reason: unknown-faceid: 334
nfd.log.2:1503352998.742155 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.3-1503300702.536604 INFO: [RibManager] Adding route /ndn/br/ufpa/%C1.Router/hostname nexthop=277 origin=nlsr cost=41860
nfd.log.3:1503300711.619102 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.4-1503257126.710786 INFO: [RibManager] Adding route /ndn/edu/memphis nexthop=8699 origin=nlsr cost=21465
nfd.log.4:1503257278.816893 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.4-1503257664.621400 INFO: [RibManager] Adding route /ndn/multicast nexthop=277 origin=autoreg cost=255
nfd.log.4:1503257774.617839 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.4-1503259091.220589 INFO: [AutoPrefixPropagator] no signing identity available for: /ndn/edu/arizona
nfd.log.4:1503259201.264904 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.4-1503260778.081639 INFO: [RibManager] Removing route /ndn/de/uni-goettingen/%C1.Router/ndntestbed nexthop=276 origin=nlsr
nfd.log.4:1503260888.015256 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.4-1503268049.795983 INFO: [FaceTable] Removed face id=350 remote=fd://39 local=unix:///run/nfd.sock
nfd.log.4:1503268216.088842 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.4-1503269941.426706 INFO: [RibManager] Removing route /ndn/edu/memphis nexthop=275 origin=nlsr
nfd.log.4:1503270051.260809 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.4.snip:1503257278.816893 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.4.snip-1503257664.621400 INFO: [RibManager] Adding route /ndn/multicast nexthop=277 origin=autoreg cost=255
nfd.log.4.snip:1503257774.617839 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
ndnops at wundngw:/var/log/ndn/SIGCOMM_SNAFU$

And let me highlight one of those which seems typical:

--
nfd.log.4-1503257664.621400 INFO: [RibManager] Adding route /ndn/multicast nexthop=277 origin=autoreg cost=255
nfd.log.4:1503257774.617839 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
—

And here is a little more of the log file from that one:
1503257410.676895 INFO: [UnicastUdpTransport] [id=0,local=udp4://128.252.153.194:6363,remote=udp4://141.225.11.173:6363] Creating transport
1503257410.676988 INFO: [FaceTable] Added face id=277 remote=udp4://141.225.11.173:6363 local=udp4://128.252.153.194:6363
1503257664.566506 INFO: [UnicastUdpTransport] [id=0,local=udp4://128.252.153.194:6363,remote=udp4://193.147.51.186:6363] Creating transport
1503257664.566544 INFO: [FaceTable] Added face id=278 remote=udp4://193.147.51.186:6363 local=udp4://128.252.153.194:6363
1503257664.570254 INFO: [UnixStreamTransport] [id=0,local=unix:///run/nfd.sock,remote=fd://38] Creating transport
1503257664.570280 INFO: [FaceTable] Added face id=279 remote=fd://38 local=unix:///run/nfd.sock
1503257664.570940 INFO: [RibManager] Adding route /ndn/broadcast nexthop=277 origin=autoreg cost=255
1503257664.621400 INFO: [RibManager] Adding route /ndn/multicast nexthop=277 origin=autoreg cost=255
1503257774.617839 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
1503257919.686039 INFO: [CsPolicy] setLimit 10

That last message is the start of a restart of NFD.

What I see here is that we are trying to add two faces, 278 and 279 and then add a route for /ndn/broadcast and then a route
for /ndn/multicast and then 110 seconds later we get a FATAL error.
I’ve looked at the NFD code a little bit and it does seem that the error is as others suggested probably related to a timeout in the RibManager.

And if I look at this code from fib-updater.cpp: FibUpdater::onUpdateError():

  if (code == ndn::nfd::Controller::ERROR_TIMEOUT && nTimeouts < MAX_NUM_TIMEOUTS) {
    sendAddNextHopUpdate(update, onSuccess, onFailure, ++nTimeouts);
  }

It would make sense that it would keep trying until the 11th time and so if the timeout was 10 seconds
this would all make sense. I haven’t found where the timeout period is but I’m guessing it is 10 seconds.
And this seems to agree with the FIB Updater section of the NFD dev-guide.

So, then the question is, what would cause the RibManager to fail 10 successive times in trying to do a Fib Update?

Any thoughts?

John

_______________________________________________
Nfd-dev mailing list
Nfd-dev at lists.cs.ucla.edu<mailto:Nfd-dev at lists.cs.ucla.edu>
http://www.lists.cs.ucla.edu/mailman/listinfo/nfd-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.cs.ucla.edu/pipermail/nfd-dev/attachments/20170827/aebedb22/attachment-0001.html>