[Nfd-dev] Emergency

Dehart, John jdd at wustl.edu
Sat Aug 26 11:47:39 PDT 2017


All:

I have not been able to recreate the problem we had during the SIGCOMM NDN Tutorial.

There were three things going on that might have had some impact.
1. I had all my AWS instances up for a long time (> 12 hours) waiting for the demonstrations.
2. Haitao was doing some work with the namespace-tree app.
3. Junxiao was doing some work with his tunneling app.
But so far none of those seems to cause any problems now.
It is possible that is some combination of those three, but that will take some time to re-create.

So, I would like to see if anyone has any insight into what may have happened.
Here are some details from the log files I saved from the WU gateway node:

ndnops at wundngw:/var/log/ndn/SIGCOMM_SNAFU$ grep -B 1 FATAL *
nfd.log.2-1503351121.547045 INFO: [RibManager] Adding route /ndn/edu/uiuc nexthop=275 origin=nlsr cost=42464
nfd.log.2:1503351231.628809 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.2-1503352381.309113 INFO: [RibManager] Removing route /ndn/edu/ucla/%C1.Router/cs/spurs nexthop=275 origin=nlsr
nfd.log.2:1503352491.336944 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.2-1503352888.689848 INFO: [FibManager] add-nexthop result: FAIL reason: unknown-faceid: 334
nfd.log.2:1503352998.742155 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.3-1503300702.536604 INFO: [RibManager] Adding route /ndn/br/ufpa/%C1.Router/hostname nexthop=277 origin=nlsr cost=41860
nfd.log.3:1503300711.619102 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.4-1503257126.710786 INFO: [RibManager] Adding route /ndn/edu/memphis nexthop=8699 origin=nlsr cost=21465
nfd.log.4:1503257278.816893 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.4-1503257664.621400 INFO: [RibManager] Adding route /ndn/multicast nexthop=277 origin=autoreg cost=255
nfd.log.4:1503257774.617839 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.4-1503259091.220589 INFO: [AutoPrefixPropagator] no signing identity available for: /ndn/edu/arizona
nfd.log.4:1503259201.264904 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.4-1503260778.081639 INFO: [RibManager] Removing route /ndn/de/uni-goettingen/%C1.Router/ndntestbed nexthop=276 origin=nlsr
nfd.log.4:1503260888.015256 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.4-1503268049.795983 INFO: [FaceTable] Removed face id=350 remote=fd://39 local=unix:///run/nfd.sock
nfd.log.4:1503268216.088842 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.4-1503269941.426706 INFO: [RibManager] Removing route /ndn/edu/memphis nexthop=275 origin=nlsr
nfd.log.4:1503270051.260809 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.4.snip:1503257278.816893 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
--
nfd.log.4.snip-1503257664.621400 INFO: [RibManager] Adding route /ndn/multicast nexthop=277 origin=autoreg cost=255
nfd.log.4.snip:1503257774.617839 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
ndnops at wundngw:/var/log/ndn/SIGCOMM_SNAFU$


And let me highlight one of those which seems typical:

--
nfd.log.4-1503257664.621400 INFO: [RibManager] Adding route /ndn/multicast nexthop=277 origin=autoreg cost=255
nfd.log.4:1503257774.617839 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
—

And here is a little more of the log file from that one:
1503257410.676895 INFO: [UnicastUdpTransport] [id=0,local=udp4://128.252.153.194:6363,remote=udp4://141.225.11.173:6363] Creating transport
1503257410.676988 INFO: [FaceTable] Added face id=277 remote=udp4://141.225.11.173:6363 local=udp4://128.252.153.194:6363
1503257664.566506 INFO: [UnicastUdpTransport] [id=0,local=udp4://128.252.153.194:6363,remote=udp4://193.147.51.186:6363] Creating transport
1503257664.566544 INFO: [FaceTable] Added face id=278 remote=udp4://193.147.51.186:6363 local=udp4://128.252.153.194:6363
1503257664.570254 INFO: [UnixStreamTransport] [id=0,local=unix:///run/nfd.sock,remote=fd://38] Creating transport
1503257664.570280 INFO: [FaceTable] Added face id=279 remote=fd://38 local=unix:///run/nfd.sock
1503257664.570940 INFO: [RibManager] Adding route /ndn/broadcast nexthop=277 origin=autoreg cost=255
1503257664.621400 INFO: [RibManager] Adding route /ndn/multicast nexthop=277 origin=autoreg cost=255
1503257774.617839 FATAL: [NFD] Non-recoverable error: request timed out code: 10060
1503257919.686039 INFO: [CsPolicy] setLimit 10

That last message is the start of a restart of NFD.


What I see here is that we are trying to add two faces, 278 and 279 and then add a route for /ndn/broadcast and then a route
for /ndn/multicast and then 110 seconds later we get a FATAL error.
I’ve looked at the NFD code a little bit and it does seem that the error is as others suggested probably related to a timeout in the RibManager.

And if I look at this code from fib-updater.cpp: FibUpdater::onUpdateError():

  if (code == ndn::nfd::Controller::ERROR_TIMEOUT && nTimeouts < MAX_NUM_TIMEOUTS) {
    sendAddNextHopUpdate(update, onSuccess, onFailure, ++nTimeouts);
  }

It would make sense that it would keep trying until the 11th time and so if the timeout was 10 seconds
this would all make sense. I haven’t found where the timeout period is but I’m guessing it is 10 seconds.
And this seems to agree with the FIB Updater section of the NFD dev-guide.

So, then the question is, what would cause the RibManager to fail 10 successive times in trying to do a Fib Update?

Any thoughts?

John


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.cs.ucla.edu/pipermail/nfd-dev/attachments/20170826/000619ca/attachment-0001.html>


More information about the Nfd-dev mailing list