The EVAM Blog

Building a Real-Time Device Management System

22.11.2016

The number of connected devices in the home is projected to expand dramatically, even as other devices (IOT) are also expected to explode in number and variety.     One of the constraints on success in this space is effective monitoring and automated device management.

Fortunately, a significant step forward in remote monitoring and device management was provided by the Broadband Forum (link), with the TR-069 standard.  Now, we aren’t limited to the SNMP used for so many years in network IP focused devices.   TR-069 is an application layer http protocol, focused on provisioning and communications between on-premise devices (CPE) and auto configuration servers.    The range of devices serviceable with TR-069 include routers, gateways, set top boxes, VOIP phones, and others.

This article describes how Device Management systems are transformed from being reactive and remedial, to a proactive real time device management system.

Challenges and Architecture of Device Management

Customer premise device management is challenging with millions of devices, from varied manufacturers, and the mandate for low-cost always-on service.   An in-home device that requires extensive service, or requires the end-user to continually reboot the device won’t fly.

TR-069 and allows the Automated Configuration Server (ACS) to understand the device type, manufacturer, serial number, and upload device logs.     TR-069 allows us to understand the online status of the device, with a “heartbeat,” the last configuration change and reboot.  The ACS can also request the device to download updated configuration or new firmware.   TR-069 is designed to restore a device to operation after it has failed.

Unfortunately, the customer premise devices are capable of widely ranging uses, and configurations, which lead to many failures.   Routers and residential gateways are good examples, supporting wired and wireless LANs, with WEP or WPA2 encryption, DHCP or static IP assignment, with variable DHCP lease terms, IPv4 and IPv6, with varied routing protocols and bridges.   Environmental factors also play a large role with power quality, RF interference, overheating, and simply customer driven loading with services like BitTorrent.

Customer experience with most in home devices is a story of failure, with frequent customer initiated reboots of routers.   To enable the vision of the smart home, or IOT more generally, we need to do dramatically better in device management.

 Enhancing ACS services with Scenarios and Actions

Simply put, TR-069 is a feedback oriented design, as it waits for devices to go offline prior to taking action.  This approach generates maximum failures, and ultimately higher costs.   What is needed is the addition of proactive pattern recognition and action, detecting patterns that lead to device failure, with proactive steps taken to preserve device uptime.

The causes for devices going offline are many and include software quality (memory leaks), device configuration, device loading, power quality, device overheating, and many other issues.

The model used for proactive event management is one that has been developed over the past decade, and is widely used today for real time event management for fraud detection in banking, real time offer management for retailers, and wireless service management for carriers worldwide.   The design is low risk, works on premise or on the major public clouds, and is economical.   A recent design for managing 3 million end customer devices on a wireless network was implemented on a single 8 core Linux host with 64 GB of ram.

In this design, scenarios are defined that recognize sequences or patterns of events that precede a device failure.  The system recognizes when the events combine to fulfill a scenario, and corresponding actions are taken.   For example, devices may require specific configurations.   When a configuration log indicates the customer has enabled unsupported features, the service provider may email the customer recommending a change to the supported configuration.

The architecture of such a system is shown below, with real time event collection occurring between the ACS server and the real time event engine (illustrated here as EVAM).   In step #1, the event engine collects and recognizes events in real time.  When a scenario is fulfilled the resulting actions can go either back to the ACS server (step 2), such as requesting the device to download a new configuration file, or in step 3 an action is directed via separate channels such as email, outbound calls, or other customer directed communication.

Proactive Device Management System

A critical aspect of this system architecture is surfacing the patterns that lead to failure, that once recognized can be added to scenarios for action.

Conclusions:

It’s easy to see that Smart Home device management will be enhanced with proactive event detection and device management.  The design outlined in this article is proven across many industries, is affordable, and helps drive the actions in device management from purely reactive to a mix that is closer to 50:50 mix of proactive:reactive actions.

Author: