[Ndn-interest] Repo vs aggregation queries

Fri Jul 19 00:32:33 PDT 2019

Dear all,

I am working on something similar, as well.

If repositories are organized at a higher level, and they cover a specific number of suffixes (let’s call them data “topics”), they should be able to easily coordinate at the Network layer. The namespace under which the data is produced by the Edge devices should be controlled by their application/IoT provider (the entity from which the data producing application is). I would assume that the repos have some kind of knowledge or gain it through a query, about the provider. Once this knowledge is gained, the data can be served by the whole cluster of repos. Of course, aggregation, data and function placement are all very important issues to be dealt with, but the main issue we are looking at and seem to be considering here is bandwidth, which is why it’d be more efficient to process and/or compress the data after it has been used.

Some further questions I am trying to (at least partially) answer in my current work are: How much of the data can be “served” (queried for specific results, processed as average, sum, aggregated over time or range of devices etc.) before a “freshness period” for processing expires? For how long should data be stored AFTER these kinds of queries or function execution? What kind of data retention policies are needed, depending on use case and services offered?

I am thinking that there may be gain from further original data storage at the Edge, unless most of the requests come from very far away. I think that this could be optimised by another, higher-level entity, which deals more with data management and service optimisation policies (and, indeed, could monitor and maintain the efficient placement and execution of data and/or associated services within these clusters). Thus, if considered “stale”, for the purposes of EDRs (timing, latency and BW efficiency purposes), the data could be compressed and sent to the Cloud (if need be), or deleted, depending, again, on specific restrictions set within the data itself, or higher up, in commercial, security, data integrity and/or network policies.

Considering the responses until now, I think Nick’s view on this would actually fit in pretty well with what I was just trying to explain.

Hope I was specific enough to help with my view on this topic and I am definitely looking forward to hearing more of your opinions and concerns on the matter.

Kind regards,
Chris Nicolaescu

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Ndn-interest <ndn-interest-bounces at lists.cs.ucla.edu> on behalf of Ernest McCracken via Ndn-interest <ndn-interest at lists.cs.ucla.edu>
Sent: Friday, July 19, 2019 3:59:06 AM
Cc: ndn-interest
Subject: Re: [Ndn-interest] Repo vs aggregation queries

>From this discussion it almost seems like these repos should act as NDN adapters to existing storage and grid storage solutions providing a basic but extensible naming schema.  Of course developing that naming schema and mapping can be complex.  Lots of new storage solutions like redis.io<https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fredis.io&data=02%7C01%7C%7C7604794ba88c453ebf9408d70bf52018%7C1faf88fea9984c5b93c9210a11d9a5c2%7C0%7C0%7C636991019757368223&sdata=h0pxZbl4qkQG3unczMkW9K2zQpWfbg4cYvcP8fw6uTI%3D&reserved=0> are making querying language simpler and are used in enterprise systems today.  Redis is used extensively by Discord for instance.

On Thu, Jul 18, 2019 at 8:38 PM Nick Briggs via Ndn-interest <ndn-interest at lists.cs.ucla.edu<mailto:ndn-interest at lists.cs.ucla.edu>> wrote:

On Jul 18, 2019, at 5:52 PM, Junxiao Shi <shijunxiao at email.arizona.edu<mailto:shijunxiao at email.arizona.edu>> wrote:

Hi Nick

On the other hand, I could have a separate "database" application answering aggregation queries. In this case, the database application could easily provide individual data points as well. Then, is there any value to still have the repo, and store every data point twice?

I would not add aggregation operators to to the repo.  I would, however, have an aggregation app serving a slightly different namespace, that was able to locally access the data in the repo and provide the aggregated result to any clients requesting it.

Would this "aggregation app" be a general purpose utility, or does it have to be tailored to each application such as building sensing?
If it's general purpose, how could the app understand the specific naming and encoding format of building sensing protocol?

I would expect it to be designed in concert with the data logging application -- it could start off being single purpose but you might find that it generalizes.  In the same way that a SQL query has to know the naming and encoding of data in tables.

 That's going to be local, not network, I/O.  The aggregated response might be cached, possibly in the same repo as the raw measurements.  It doesn't require storing the data twice.

Is this "aggregation app" accessing the data directly on the disk, or does it have to send Interests to the repo via a local socket?

If using disk access, what's the benefit of having it as a separate app?

If using Interest-Data exchanges, even if the packets are local, this still has huge overhead (encoding, forwarding, processing, etc) compared to a SQL query on the database.

This is a pretty raw view of my reasoning, having thought about the problem for all of 10 minutes:

I'd design it as Interest-Data exchanges to start with, then I'd measure the system performance to see if it was acceptable and if it's scaling properties met my requirements, and if it wasn't performing/scaling reasonably then I'd look at where the problem was and design a solution that addressed the problem.  I am a fan of optimizing implementation/architecture based on actual measurement -- though of course one's choices should be informed by theoretical complexity issues... but the constants matter too!

I don't immediately come to the same conclusion as you do about a SQL query vs an application such as I'm describing.

Remember that the repo (at least the one I worked on & with) stores everything in wire-format packets.  It happened to use B-trees with pages of comparable size to a disk page so the I/O performance was good, there were many (if they were small) content packets in a B-tree block and there was caching of the B-tree blocks so the overhead for reading multiple sequential items was minimized.

The only *encoding* operation should be on the aggregation results.

All of the forwarding operations should be in-memory.  I doubt that you can get zero-copy from the in-memory repo packet through to the aggregation application's buffer, but it shouldn't be massively bad.

There are analogous operations in both the repo and SQL cases -- SQL is going to be interpreting a table schema to drive accesses to table data stored on disk (and cached in memory) and decoding and applying the operations from the query etc. etc.  For both SQL and stored content objects you'll be making data representation choices that affect the speed of the operations you'll be doing (e.g., storing measurements as text or binary values).

People have cared for some time that SQL databases had good performance... so a lot of time has been spent optimizing them.
Nobody has spent a lot of time optimizing repo query tools, and the supporting NDN components, but I think it would have a good payoff for many applications if someone did.

Does that give you a better understanding of my position?

Yours, Junxiao

_______________________________________________
Ndn-interest mailing list
Ndn-interest at lists.cs.ucla.edu<mailto:Ndn-interest at lists.cs.ucla.edu>
http://www.lists.cs.ucla.edu/mailman/listinfo/ndn-interest<https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.lists.cs.ucla.edu%2Fmailman%2Flistinfo%2Fndn-interest&data=02%7C01%7C%7C7604794ba88c453ebf9408d70bf52018%7C1faf88fea9984c5b93c9210a11d9a5c2%7C0%7C0%7C636991019757368223&sdata=ri6RHzbmw7QhBXY4uyevFj63rIJol4F7gJUDVt8G5cw%3D&reserved=0>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.cs.ucla.edu/pipermail/ndn-interest/attachments/20190719/0b8b60fd/attachment-0001.html>