[Ndn-interest] Repo vs aggregation queries

Thu Jul 18 16:17:28 PDT 2019

> On Jul 18, 2019, at 4:00 PM, Junxiao Shi <shijunxiao at email.arizona.edu> wrote:
> 
> Dear folks
> 
> I saw a new paper coming out last week: The Role of Data Repositories in Named Data Networking. https://ieeexplore.ieee.org/abstract/document/8756944 <https://ieeexplore.ieee.org/abstract/document/8756944> 
> It explains the importance of data repositories as an architecture component, and how the repo could be used as a storage of application data. The paper asserts that having a repo makes data available even if the data producer application is offline.
> 
> While the above points are all correct, I have one doubt: how to support aggregation queries in this architecture?
> 
> I have a building sensing application.
> Each data point is, for example, the temperature measurement in a certain room at a certain time point. Using either passive or active data insertion, this Data could be stored into the repo. Then, anyone can ask for a single data point by expressing an Interest.
> Paired with a (general propose) namespace enumeration protocol, it's also possible for a consumer to discover what Data are available in a repo.
> 
> A common use case is an aggregation query. For example, a consumer wants to know what's the maximum temperature among all the data points collected in a set of rooms within a time period. We further assume that the possible queries are not known in advance; for example, the time period could be arbitrary, and not necessarily aligned to the hour/day.
> In a relational database, this use case is supported by a simple SQL query. The SQL server spends computation power, but network usage is minimal.
> In a plain repo, this would require the consumer to retrieve every data point from the repo, and run the aggregation operator locally. If the query covers a long time period, the number of Data retrieved could be on order of 10^4. Isn't this a huge waste of network bandwidth?
> Of course, once an aggregation has been performed by someone, the result can be stored in a repo for future usage. But the first aggregation is very expensive.
> 
> I can think about adding the aggregation operators into the repo. However, this would require the repo to understand application semantics, at a level much higher than understanding data retrieval pattern.
> At that level, did the "repo" stop being a repo and become a distributed database?
> 
> On the other hand, I could have a separate "database" application answering aggregation queries. In this case, the database application could easily provide individual data points as well. Then, is there any value to still have the repo, and store every data point twice?

I would not add aggregation operators to to the repo.  I would, however, have an aggregation app serving a slightly different namespace, that was able to locally access the data in the repo and provide the aggregated result to any clients requesting it.  That's going to be local, not network, I/O.  The aggregated response might be cached, possibly in the same repo as the raw measurements.  It doesn't require storing the data twice.

> 
> Suggestions?
> Yours, Junxiao
> _______________________________________________
> Ndn-interest mailing list
> Ndn-interest at lists.cs.ucla.edu
> http://www.lists.cs.ucla.edu/mailman/listinfo/ndn-interest

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.cs.ucla.edu/pipermail/ndn-interest/attachments/20190718/307b70ca/attachment.html>