[Ndn-interest] Repo vs aggregation queries

Thu Jul 18 16:00:24 PDT 2019

Dear folks

I saw a new paper coming out last week: The Role of Data Repositories in
Named Data Networking. https://ieeexplore.ieee.org/abstract/document/8756944

It explains the importance of data repositories as an architecture
component, and how the repo could be used as a storage of application data.
The paper asserts that having a repo makes data available even if the data
producer application is offline.

While the above points are all correct, I have one doubt: how to support
aggregation queries in this architecture?

I have a building sensing application.
Each data point is, for example, the temperature measurement in a certain
room at a certain time point. Using either passive or active data
insertion, this Data could be stored into the repo. Then, anyone can ask
for a single data point by expressing an Interest.
Paired with a (general propose) namespace enumeration protocol, it's also
possible for a consumer to discover what Data are available in a repo.

A common use case is an *aggregation query*. For example, a consumer wants
to know what's the maximum temperature among all the data points collected
in a set of rooms within a time period. We further assume that the possible
queries are not known in advance; for example, the time period could be
arbitrary, and not necessarily aligned to the hour/day.
In a relational database, this use case is supported by a simple SQL query.
The SQL server spends computation power, but network usage is minimal.
In a plain repo, this would require the consumer to retrieve every data
point from the repo, and run the aggregation operator locally. If the query
covers a long time period, the number of Data retrieved could be on order
of 10^4. Isn't this a huge waste of network bandwidth?
Of course, once an aggregation has been performed by someone, the result
can be stored in a repo for future usage. But the first aggregation is very
expensive.

I can think about adding the aggregation operators into the repo. However,
this would require the repo to understand application semantics, at a level
much higher than understanding data retrieval pattern.
At that level, did the "repo" stop being a repo and become a distributed
database?

On the other hand, I could have a separate "database" application answering
aggregation queries. In this case, the database application could easily
provide individual data points as well. Then, is there any value to still
have the repo, and store every data point twice?

Suggestions?
Yours, Junxiao
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.cs.ucla.edu/pipermail/ndn-interest/attachments/20190718/506c559a/attachment.html>