[Nfd-dev] Memory related issues with NFD.

Mon May 9 06:58:09 PDT 2016

Hi Anil,

Can you create a redmine issue (http://redmine.named-data.net) to document all the information and discussion in one place? Most of us are working on paper deadline this week, so response may be slow.

Thanks,

Beichuan

> On May 9, 2016, at 1:39 AM, Anil Jangam <anilj.mailing at gmail.com> wrote:
> 
> I tried to investigate using the stack tracer got from Valgrind report. Can you please analyze these further and check if there is an issue here? 
> 
> =================================
> ./NFD/rib/rib-manager.cpp
> 188   m_keyChain.sign(*responseData);
> 189   m_face.put(*responseData);
> 
> ./NFD/daemon/mgmt/manager-base.cpp
> 98   m_keyChain.sign(*responseData);
> 99   m_face->put(*responseData);
> 
> Here each time it is allocating ~9K block of memory. I am not sure when this is released, but this is the topmost contributor for memory build up. 
> ./ndn-cxx/src/encoding/encoder.cpp
> 27 Encoder::Encoder(size_t totalReserve/* = 8800*/, size_t reserveFromBack/* = 400*/)
> 28   : m_buffer(new Buffer(totalReserve))
> 29 {
> 30   m_begin = m_end = m_buffer->end() - (reserveFromBack < totalReserve ? reserveFromBack : 0);
> 31 }
> 
> 83     Buffer* buf = new Buffer(size);
> 84     std::copy_backward(m_buffer->begin(), m_buffer->end(), buf->end());
> 85 
> 
> =================================
> The other possibility is that dead-nonce-list is not getting cleared after the *loop detection duration*. Or perhaps the ndn::Block() is not released after the ndn::Name wireEncode() is done for the first time. This is the second most reason for memory build up. 
> Ref: https://github.com/named-data/NFD/blob/master/daemon/table/dead-nonce-list.hpp#L39 <https://github.com/named-data/NFD/blob/master/daemon/table/dead-nonce-list.hpp#L39>
> 
> ./NFD/daemon/table/dead-nonce-list.cpp
> 105 DeadNonceList::Entry
> 106 DeadNonceList::makeEntry(const Name& name, uint32_t nonce)
> 107 {
> 108   Block nameWire = name.wireEncode();
> 
> ./ndn-cxx/src/encoding/block.cpp
> 344       m_subBlocks.push_back(Block(m_buffer,
> 345                                   type,
> 346                                   element_begin, element_end,
> 347                                   begin, element_end));
> 348 
> 
> 
> On Sat, May 7, 2016 at 12:24 AM, Anil Jangam <anilj.mailing at gmail.com <mailto:anilj.mailing at gmail.com>> wrote:
> Hello All,
> 
> We debugged this issue further and below are out findings.
> 
> - The issue is reproducible also on standalone NFD. Vince tried about 100 registration requests and there is a consistent increase in memory. This increase is present even if RibManager::sendSuccessResponse is not called. 
> 
> - The memory grows even if we bypass the RibManager completely by using "nfdc add-nexthop" and this problem is present in the latest code of NFD since Vince tested this with the most up-to-date version. 
> 
> - Another possibility we thought about was response messages getting cached in CS leading to increase in memory consumption by NFD. To rule this out, we set the CS size to 1 by calling  'ndnHelper.setCsSize(1);' before installing NDN L3 Stack on nodes, but yet we see memory growth. 
> 
> - I also checked that default CS size is 100 packets. So even with this size, CS should not grow beyond 100 packets. So we do not think CS is causing this growth.  
>  42 StackHelper::StackHelper()
>  43   : m_needSetDefaultRoutes(false)
>  44   , m_maxCsSize(100)
>  45   , m_isRibManagerDisabled(false)
>  46   , m_isFaceManagerDisabled(false)
>  47   , m_isStatusServerDisabled(false)
>  48   , m_isStrategyChoiceManagerDisabled(false)
> 
> - It seems to be some internal pipeline issue, because when we performed either 1000 add-nexthop commands or 1000 registration commands for the same prefix, the memory increase was observed.
> 
> As mentioned above, we believe this issue also present in standalone NFD, it is not yet reported perhaps because of the scale. Since I am running with 100+ nodes, each having its own NFD instance on my laptop (8G RAM), the growth is very quick. 
> 
> We need your inputs to debug this issue further. 
> 
> Thanks,
> /anil
> 
> On Wed, May 4, 2016 at 1:47 PM, Anil Jangam <anilj.mailing at gmail.com <mailto:anilj.mailing at gmail.com>> wrote:
> Here are some more data points from Valgrind Massiff analysis. I have ran it for 25 and 50 nodes. 
> 
> /anil.
> 
> 
> On Wed, May 4, 2016 at 2:26 AM, Anil Jangam <anilj.mailing at gmail.com <mailto:anilj.mailing at gmail.com>> wrote:
> Hi Junxiao,
> 
> The memory leak is now closed by back porting the fix you referred to. However, the growth in memory consumption is still evident. This time, I believe it is a bloating of the process size. Can you please comment looking at the attached Valgrind logs if this is a legitimate requirement of NFD or it is just holding up the resources without really needing it?  I see the allocation emanating from RibManager and on receiving Interest as some of the major contributors. 
> 
> Likewise you said earlier, these are perhaps fixed into main branch of NFD but not ported yet into the NFD of ndnSIM. Please check, reports are attached.
> 
> 50 node simulation valgrind summary:
> -------------------------------------------------------
> ==9587== LEAK SUMMARY:
> ==9587==    definitely lost: 0 bytes in 0 blocks
> ==9587==    indirectly lost: 0 bytes in 0 blocks
> ==9587==      possibly lost: 2,263,514 bytes in 67,928 blocks
> ==9587==    still reachable: 1,474,943,776 bytes in 3,910,237 blocks
> ==9587==         suppressed: 0 bytes in 0 blocks
> ==9587== 
> ==9587== For counts of detected and suppressed errors, rerun with: -v
> ==9587== ERROR SUMMARY: 37 errors from 37 contexts (suppressed: 0 from 0)
> 
> 25 node simulation valgrind summary:
> -------------------------------------------------------
> ==9287== LEAK SUMMARY:
> ==9287==    definitely lost: 0 bytes in 0 blocks
> ==9287==    indirectly lost: 0 bytes in 0 blocks
> ==9287==      possibly lost: 400,259 bytes in 11,100 blocks
> ==9287==    still reachable: 437,147,930 bytes in 1,132,024 blocks
> ==9287==         suppressed: 0 bytes in 0 blocks
> ==9287== 
> ==9287== For counts of detected and suppressed errors, rerun with: -v
> ==9287== ERROR SUMMARY: 31 errors from 31 contexts (suppressed: 0 from 0)
> 
> /anil.
> 
> 
> 
> 
> On Tue, May 3, 2016 at 7:42 AM, Junxiao Shi <shijunxiao at email.arizona.edu <mailto:shijunxiao at email.arizona.edu>> wrote:
> Hi Anil
> 
> The call stack in the Valgrind report indicates that you are running NFD within ndnSIM.
> #3236 is fixed in NFD commit 9c903e063ea8bdb324a421458eed4f51990ccd2c on Oct 04, 2015. However, ndnSIM's NFD fork is dated back on Aug 21, 2015, and doesn't contain the fix.
> You may try to backport that commit to ndnSIM's NFD fork, or ask ndnSIM developers to upgrade their fork.
> 
> Yours, Junxiao
> 
> 
> On Mon, May 2, 2016 at 5:23 PM, Anil Jangam <anilj.mailing at gmail.com <mailto:anilj.mailing at gmail.com>> wrote:
> Hi Junxiao,
> 
> I am observing a memory leak with NFD and to verify the same I did couple of Valgrind enabled simulation runs with 25 and 50 nodes. Based on the Valgrind report, and output of 'top' command, I see that RAM consumption grows consistently and rapidly. My scaling test is affected that I am not able to run the simulation for longer time and/or with high number of nodes. Also, I see a very high number of timeouts
> 
> I see a NFD leak issue in closed state, which confirms this leak however closed owing to its small size. Perhaps this is showing up a high scale?
> http://redmine.named-data.net/issues/3236/ <http://redmine.named-data.net/issues/3236/>
> 
> Please check the attached Valgrind report. Let me know what other data you may need to debug this further. Also, please suggest a solution or workaround to this?
> 
> /anil.
> 
> 
> 
> 
> 
> _______________________________________________
> Nfd-dev mailing list
> Nfd-dev at lists.cs.ucla.edu
> http://www.lists.cs.ucla.edu/mailman/listinfo/nfd-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.cs.ucla.edu/pipermail/nfd-dev/attachments/20160509/ec560737/attachment.html>