[Ndn-interest] A synopsis of NDN

Klaus Schneider klaus at cs.arizona.edu
Mon Apr 2 00:07:29 PDT 2018

Dear Michael,

Thanks for the elaborate reply.

I can see your point about the complexity of the task that you describe. 
Certainly it is very hard to create a script that returns perfect 
results, even if the task is just to remove all duplicates.

I think the ability to sort papers by relevance (like in Google 
Scholar), reduces the need to be perfect to some degree. Sure, there are 
many duplicates, non-English papers, and papers of very low quality. But 
they're all at the bottom of the list, so they don't hurt too much.

On the other hand, I can also see the value of your list, which is less 
complete, but more correct (some entries may be missing, but for any 
existing entry there is a higher probability that it actually belongs on 
the list).

I guess the larger point is that it's hard for any one person to compete 
with Google ;)

Best regards,

On 01/04/18 23:36, Michael Hucka wrote:
> On Sun, 1 Apr 2018 20:06:17 -0700, Klaus Schneider wrote:
>> I think your time might be more well spent to write a script or
>> search filter to query these sites (GS, IEEE, ...) and then remove
>> duplicates, non-English papers, etc., rather than trying to gather
>> all papers by hand.
>> This would bring a number of benefits, such as:
>> - Being always up to date
>> - Sorting by date and relevance (citation count)
>> - Listing related work (cited by)
>> - Including a link to the pdf files
>> It would also automatically include the techreports that Spyros
>> mentioned (the NDN techreport site is indexed by Google Scholar).
> Have you actually tried to do something like this?
> Speaking as someone who's been doing research and writing software since the 1980's, I feel I can say with reasonably high confidence that developing a working scheme like this would require a significant amount of time and effort.  (Either that, or we have different ideas in our minds of what constitutes a good implementation.)
> The problem is not scraping a lot of references from somewhere.  Here's an example of a problem: detecting when Google Scholar's database incorrectly says something is a journal paper when it is actually a conference paper.  A human or a weak AI has to look at the paper to figure it out.  I did that myself, in a lot of cases, and then made hand corrections.  Now this leads to a new requirement for an automation scheme: detect that an existing-but-modified entry is the same as something returned by Google Scholar, so that the next time you run the workflow, it doesn't automatically add it again thinking it's a different entry.  Yes, of course, the problem can be solved.  But this is just one example. Each little new problem adds time to the implementation and its debugging, as well as complexity to the overall system, effort to produce documentation, and software to maintain over time.
> Linking to the PDFs introduces another wrinkle.  Although I can't share the PDFs publicly because of copyright reasons, I actually have them for probably 99% of the references.   The bibliography I put online has DOIs that link to a lot of the publications directly, so people can get to the PDFs, but they will need to have access to the publication due to copyrights -- it was the best compromise I could come up with, even though I wish I could do more.  Now, the DOIs link to the publisher's page.  To link to the PDFs directly is another level of complexity altogether (there is a lot of variation in journal page formats).  Only in some limited cases like the NDN tech reports or perhaps the IEEE pages could you easily and regularly link to the PDFs.
> I use Paperpile, which has built-in recognizers for Google Scholar and many publishers' sites.  It can actually import PDFs automatically, extract metadata from the PDF, and query Google Scholar for the bib entry.  It's freaking *amazing*, and makes this kind of work go very quickly.  I don't know how much effort was required to implement its capabilities, but it is clearly not a weekend script.  And even as good as it is, it's not perfect -- it doesn't always work.  That gives us an idea of what it takes to find PDFs.
> I don't disagree that it would be nice and useful to have the automation you describe.  Who wouldn't like that?  My point here is that unless you are aware of new technology I'm overlooking (which is entirely possible!), I think doing this would be a more difficult engineering problem that it may seem.  And even if it were implemented, that wouldn't be the end of it: software has to be maintained over time, and adapted when service providers change their API or data format. (Which most definitely happens; in fact, last year, Google Scholar changed its page layout, and this completely broke another bibliography system I used to use called Sente.)
> My conclusion is that developing this automation would not be time better spent for me, and in any case, I have too much on my plate already to even start.  But, perhaps one of the NDN or CCN teams could undertake the development of something like this as an activity.
> Finally, I apologize for the length of this message, and I think further discussion of this matter would be off-topic for this mailing list.  If people really are interested in continuing discussions, I could throw together a Google group for it.
> Best regards,
> MH
> --
> Mike Hucka, Ph.D. -- mhucka at caltech.edu -- http://www.cds.caltech.edu/~mhucka
> Dept. of Computing + Mathematical Sciences, California Institute of Technology

More information about the Ndn-interest mailing list