monitoring

Network Monitoring for Cloud-Connected IoT Devices

One of the emerging trends in network monitoring is the integration of cloud computing and Internet of Things (IoT) devices. Cloud computing refers to the delivery of computing services over the internet, such as storage, processing, and software. IoT devices are physical objects that are connected to the internet and can communicate with other devices or systems. Examples of IoT devices include smart thermostats, wearable devices, and industrial sensors.

Cloud-connected IoT devices pose new challenges and opportunities for network monitoring. On one hand, cloud computing enables IoT devices to access scalable and flexible resources and services, such as data analytics and artificial intelligence. On the other hand, cloud computing introduces additional complexity and risk to the network, such as latency, bandwidth consumption, and security threats.

Therefore, network monitoring for cloud-connected IoT devices requires a comprehensive and proactive approach that can address the following aspects:

  • Visibility: Network monitoring should provide a clear and complete view of the network topology, status, and performance of all the devices and services involved in the cloud-IoT ecosystem. This includes not only the physical devices and connections, but also the virtual machines, containers, and microservices that run on the cloud platform. Network monitoring should also be able to detect and identify any anomalies or issues that may affect the network functionality or quality.
  • Scalability: Network monitoring should be able to handle the large volume and variety of data generated by cloud-connected IoT devices. This requires a scalable and distributed architecture that can collect, store, process, and analyze data from different sources and locations. Network monitoring should also leverage cloud-based technologies, such as big data analytics and machine learning, to extract meaningful insights and patterns from the data.
  • Security: Network monitoring should ensure the security and privacy of the network and its data. This involves implementing appropriate encryption, authentication, authorization, and auditing mechanisms to protect the data in transit and at rest. Network monitoring should also monitor and alert on any potential or actual security breaches or attacks that may compromise the network or its data.
  • Automation: Network monitoring should automate as much as possible the tasks and processes involved in network management. This includes using automation tools and scripts to configure, deploy, update, and troubleshoot network devices and services. Network monitoring should also use automation techniques, such as artificial intelligence and machine learning, to perform predictive analysis, anomaly detection, root cause analysis, and remediation actions.

Solutions for Network Monitoring for Cloud-Connected IoT Devices

There are many solutions available for network monitoring for cloud-connected IoT devices. Some of them are native to cloud platforms or specific IoT platforms, while others are third-party or open-source solutions. Some of them are specialized for certain aspects or layers of network monitoring, while others are comprehensive or integrated solutions. Some of them are:

  • Domotz: Domotz is a cloud-based network and endpoint monitoring platform that also provides system management functions. This service is capable of monitoring security cameras as well as network devices and endpoints. Domotz can monitor cloud-connected IoT devices using SNMP or TCP protocols. It can also integrate with various cloud platforms such as AWS, Azure, and GCP.
  • Splunk Industrial for IoT: Splunk Industrial for IoT is a solution that provides end-to-end visibility into industrial IoT systems.  Splunk Industrial for IoT can collect and analyze data from various sources such as sensors, gateways, and cloud services. Splunk Industrial for IoT can also provide dashboards, alerts, and insights into the performance, health, and security of cloud-connected IoT devices.
  • Datadog IoT Monitoring: Datadog IoT Monitoring is a solution that provides comprehensive observability for cloud-connected IoT devices. Datadog IoT Monitoring can collect and correlate metrics, logs, traces, and events from various sources such as sensors, gateways, cloud services. Datadog IoT Monitoring can also provide dashboards, alerts, and insights into the performance, health, and security of cloud-connected IoT devices.
  • Senseye PdM: Senseye PdM is a solution that provides predictive maintenance for industrial IoT systems. Senseye PdM can collect and analyze data from various sources such as sensors, gateways, and cloud services. Senseye PdM can also provide  dashboards, alerts, and insights into the condition, performance, and reliability of cloud-connected IoT devices.
  • SkySpark: SkySpark is a solution that provides analytics and automation for smart systems. SkySpark can collect and analyze data from various sources such as sensors, gateways, and cloud services. SkySpark can also provide dashboards, alerts, and insights into the performance, efficiency, and optimization of cloud-connected IoT devices.

Network monitoring for cloud-connected IoT devices is a vital and challenging task that requires a holistic and adaptive approach. Network monitoring can help to optimize the performance, reliability, and security of the network and its components. Network monitoring can also enable new capabilities and benefits for cloud-IoT applications, such as enhanced user experience, improved operational efficiency, and reduced costs.

Containers and Kubernetes Observability Tools and Best Practices

Containers and Kubernetes are popular technologies for developing and deploying cloud-native applications. Containers are lightweight and portable units of software that can run on any platform. Kubernetes is an open-source platform that orchestrates and manages containerized workloads and services.

Containers and Kubernetes offer many benefits, such as scalability, performance, portability, and agility. However, they also introduce new challenges for observability. Observability is the ability to measure and understand the internal state of a system based on the external outputs. Observability helps developers and operators troubleshoot issues, optimize performance, ensure reliability, and improve user experience.

Observability in containers and Kubernetes involves collecting, analyzing, and alerting on various types of data and events that reflect the state and activity of the containerized applications and the Kubernetes clusters. These data and events include metrics, logs, traces, events, alerts, dashboards, and reports.

In this article, we will explore some of the tools and best practices for observability in containers and Kubernetes.

Tools for Observability in Containers and Kubernetes

There are many tools available for observability in containers and Kubernetes. Some of them are native to Kubernetes or specific container platforms, while others are third-party or open-source solutions. Some of them are specialized for certain aspects or layers of observability, while others are comprehensive or integrated solutions. Some of them are:

  • Kubernetes Dashboard: Kubernetes Dashboard is a web-based user interface that allows users to manage and monitor Kubernetes clusters and resources. It provides information such as cluster status, node health, pod logs, resource usage, network policies, and service discovery. It also allows users to create, update, delete, or scale Kubernetes resources using graphical or YAML editors.
  • Prometheus: Prometheus is an open-source monitoring system that collects and stores metrics from various sources using a pull model. It supports multi-dimensional data model, flexible query language, alerting rules, and visualization tools. Prometheus is widely used for monitoring Kubernetes clusters and applications, as it can scrape metrics from Kubernetes endpoints, pods, services, and nodes. It can also integrate with other tools such as Grafana, Alertmanager, Thanos, and others.
  • Grafana: Grafana is an open-source visualization and analytics platform that allows users to create dashboards and panels using data from various sources. Grafana can connect to Prometheus and other data sources to display metrics in various formats such as graphs, charts, tables, maps, and more. Grafana can also support alerting, annotations, variables, templates, and other advanced features. Grafana is commonly used for visualizing Kubernetes metrics and performance
  • EFK Stack: EFK Stack is a combination of three open-source tools: Elasticsearch, Fluentd, and Kibana. Elasticsearch is a distributed search and analytics engine that stores and indexes logs and other data. Fluentd is a data collector that collects
    and transforms logs and other data from various sources and sends them to Elasticsearch or other destinations. Kibana is a web-based user interface that allows users to explore and visualize data stored in Elasticsearch. EFK Stack is widely used for logging and observability in containers and Kubernetes as it can collect and analyze logs from containers pods, nodes, services, and other software.
  • Loki: Loki is an open-source logging system that is designed to be cost-effective and easy to operate. Loki is inspired by Prometheus and uses a similar data model and query language. Loki collects logs from various sources using Prometheus service discovery and labels. Loki stores logs in a compressed and indexed format that enables fast and efficient querying. Loki can integrate with Grafana to display logs alongside metrics

Best Practices for Observability in Containers and Kubernetes

Observability in containers and Kubernetes requires following some best practices to ensure effective, efficient, and secure observability Here are some of them:

  • Define observability goals and requirements: Before choosing or implementing any observability tools or solutions, it is important to define the observability goals and requirements for the containerized applications and the Kubernetes clusters These goals and requirements should align with the business objectives, the user expectations, the service level agreements (SLAs), and the compliance standards. They should also specify what data and events to collect, how to analyze them, how to alert on them, and how to visualize them.
  • Use standard formats and protocols: To ensure interoperability and compatibility among different observability tools and solutions, it is recommended to use standard formats and protocols for collecting, storing, and exchanging data and events. For example, use OpenMetrics for metrics, JSON for logs, OpenTelemetry for traces, CloudEvents for events. Containers and Kubernetes Observability Tools and Best Practices. These standards can help reduce complexity, overhead, and vendor lock-in in observability.
  • Leverage native Kubernetes features: Kubernetes provides some native features that can help with observability For example, use labels and annotations to add metadata to Kubernetes resources that can be used for filtering, grouping, or querying. Use readiness probes and liveness probes to check the health status of containers. Use resource requests and limits to specify the resource requirements of containers. Use horizontal pod autoscaler (HPA) or vertical pod autoscaler (VPA) to scale pods based on metrics. Use custom resource definitions (CRDs) or operators to extend the functionality of Kubernetes resources These features can help improve the visibility, control, and optimization of containers and Kubernetes clusters.

Monitoring and Observability in the Oracle Cloud

Monitoring and observability are essential practices for ensuring the availability, performance, security, and cost-efficiency of cloud-based systems and applications. Monitoring and observability involve collecting, analyzing, and alerting on various types of data and events that reflect the state and activity of the cloud environment, such as metrics, logs, traces, and user experience.

Oracle Cloud provides a comprehensive set of tools and services for monitoring and observability of its cloud resources and services. Oracle Cloud also supports integration with third-party tools and standards for monitoring and observability of hybrid and multi-cloud environments.

(Image: Delphi, Greece)

In this article, we will discuss some of the benefits and challenges of monitoring and observability of Oracle Cloud.

Benefits of Monitoring and Observability of Oracle Cloud

Some of the benefits of monitoring and observability of Oracle Cloud are:

  • Visibility: Oracle Cloud provides visibility into the health, performance, usage, and cost of its cloud resources and services. Users can access metrics, logs, events, alerts, dashboards, reports, and analytics from the Oracle Cloud console or APIs. Users can also use Oracle Cloud Observability and Management Platform, which provides a unified view of the observability data across Oracle Cloud and other cloud or on-premises environments.
  • Control: Oracle Cloud provides control over the configuration, management, and optimization of its cloud resources and services. Users can use policies, rules, thresholds, actions, functions, notifications, and connectors to automate monitoring and observability tasks. Users can also use Oracle Cloud Resource Manager to deploy and manage cloud resources using Terraform-based automation.
  • Security: Oracle Cloud provides security for its cloud resources and services. Users can use encryption, access control, identity management, auditing, compliance, firewall, antivirus, vulnerability scanning, and incident response to protect their cloud data and assets. Users can also use Oracle Cloud Security Advisor to assess their security posture and receive recommendations for improvement.
  • Innovation: Oracle Cloud provides innovation for its cloud resources and services. Users can use artificial intelligence (AI), machine learning (ML), natural language processing (NLP), computer vision (CV), blockchain, chatbots, digital assistants, Internet of Things (IoT), edge computing, serverless computing, microservices, containers, and Kubernetes to enhance their cloud capabilities and outcomes. Users can also use Oracle Cloud Enterprise Manager to monitor, analyze, and administer Oracle Database and Engineered Systems

Challenges of Monitoring and Observability of Oracle Cloud

Some of the challenges of monitoring and observability of Oracle Cloud are:

  • Complexity: Oracle Cloud offers a wide range of services and features that can create complexity and confusion for users. Users need to understand and choose the appropriate tools and services for their monitoring and observability needs. Users also need to configure and manage the tools and services properly to avoid errors, misconfigurations, or inefficiencies
  • Integration: Oracle Cloud supports integration with third-party tools and standards for monitoring and observability. However, users need to ensure compatibility, interoperability, and security of the integration solutions. Users also need to deal with potential issues such as data duplication, inconsistency, or loss
  • Skills: Oracle Cloud requires users to have adequate skills and knowledge to use its tools and services for monitoring and observability. Users need to learn how to use the Oracle Cloud console, APIs, CLI, SDKs, and other interfaces. Users also need to learn how to use the Oracle Cloud Observability and Management Platform, Oracle Cloud Resource Manager, Oracle Cloud Security Advisor, Oracle Cloud Enterprise Manager, and other tools and services.

Monitoring and observability are essential practices for ensuring the availability, performance, security, and cost-efficiency of cloud-based systems and applications. Oracle Cloud provides a comprehensive set of tools and services for monitoring and observability of its cloud resources and services. Oracle Cloud also supports integration with third-party tools and standards for monitoring and observability of hybrid and multi-cloud environments.
However, monitoring and observability of Oracle Cloud also pose some challenges such as complexity, integration, and skills Users need to be aware of these challenges and address them accordingly to ensure effective, efficient, and secure monitoring and observability of Oracle Cloud.

AWS vs Azure: Serverless Observability and Monitoring

Serverless computing is a cloud service model that allows developers to run code without provisioning or managing servers. Serverless applications are composed of functions that are triggered by events and run on demand. Serverless computing offers many benefits, such as scalability, performance, cost-efficiency, and agility.

However, serverless computing also introduces new challenges for observability and monitoring. Observability is the ability to measure and understand the internal state of a system based on the external outputs. Monitoring is the process of collecting, analyzing, and alerting on the metrics and logs that indicate the health and performance of a system.

Observability and monitoring are essential for serverless applications because they help developers troubleshoot issues, optimize performance, ensure reliability, and improve user experience. However, serverless applications are more complex and dynamic than traditional applications, making them harder to observe and monitor.

Some of the challenges of serverless observability and monitoring are:

  • Lack of visibility: Serverless functions are ephemeral and stateless, meaning they are created and destroyed on demand, and do not store any data or context. This makes it difficult to track the execution flow and dependencies of serverless functions across multiple services and platforms.
  • High cardinality: Serverless functions can have many variations based on input parameters, environment variables, configuration settings, and runtime versions. This creates a high cardinality of metrics and logs that need to be collected and analyzed.
  • Distributed tracing: Serverless functions can be triggered by various sources, such as HTTP requests, messages, events, timers, or other functions. This creates a distributed tracing problem, where developers need to correlate the traces of serverless functions across different sources and services.
  • Cold starts: Serverless functions can experience cold starts, which are delays in the execution time caused by the initialization of the function code and dependencies. Cold starts can affect the performance and availability of serverless applications, especially for latency-sensitive scenarios.
  • Cost optimization: Serverless functions are billed based on the number of invocations and the execution time. Therefore, developers need to monitor the usage and cost of serverless functions to optimize their resource allocation and avoid overspending.

AWS and Azure are two of the leading cloud providers that offer serverless computing services. AWS Lambda is the serverless platform of AWS, while Azure Functions is the serverless platform of Azure. Both platforms provide observability and monitoring features for serverless applications, but they also have some differences and limitations.

In this article, we will compare AWS Lambda and Azure Functions in terms of their observability and monitoring capabilities, including their native features and third-party software reviews and recommendations.

Native Features

Both AWS Lambda and Azure Functions provide native features for observability and monitoring serverless applications. These features include:

  • Metrics: Both platforms collect and display metrics such as invocations, errors, duration, memory usage, concurrency, and throughput for serverless functions. These metrics can be viewed on dashboards or queried using APIs or CLI tools. Metrics can also be used to create alarms or alerts based on predefined thresholds or anomalies.
  • Logs: Both platforms capture and store logs for serverless functions. These logs include information such as start and end time, request ID, status code, error messages, custom print statements, etc. Logs can be viewed on consoles or queried using APIs or CLI tools. Logs can also be streamed or exported to external services for further analysis or retention.
  • Tracing: Both platforms support distributed tracing for serverless functions. Distributed tracing allows developers to track the execution flow and latency
    of serverless functions across different sources and services. Tracing can help identify bottlenecks errors, failures or performance issues in serverless applications.

Both platforms use open standards such as OpenTelemetry or W3C Trace Context for tracing. However, there are also some differences between AWS Lambda and Azure Functions in terms of their native features for observability and monitoring.

Some of these differences are:

  • Metrics granularity: AWS Lambda provides metrics at a 1-minute granularity by default while Azure Functions provides metrics at a 5-minute granularity by default
    However, both platforms allow users to change the granularity to a lower or higher level depending on their needs
  • Metrics aggregation: AWS Lambda aggregates metrics by function name function version or alias (if specified), region (if specified) or globally (across all regions). Azure Functions aggregates metrics by the function name (or function app name), region (if specified) or globally (across all regions).
  • Logs format: AWS Lambda logs are formatted as plain text with a timestamp prefix. Azure Functions logs are formatted as JSON objects with various fields such as timestamp, level, message, category, functionName, invocationId, etc.
  • Logs retention: AWS Lambda logs are stored in Amazon CloudWatch Logs service for 90 days by default (or longer if specified by users). Azure Functions logs are stored in Azure Monitor service for 30 days by default (or longer if specified by users)
  • Tracing integration: AWS Lambda integrates with AWS X-Ray service for tracing. AWS X-Ray provides a web console and an API for viewing traces and analyzing the performance of serverless applications on AWS. Azure Functions integrates with Azure Application Insights service for tracing. Azure Application Insights provides a web console and an API for viewing traces and analyzing the performance of serverless applications on Azure.

Cloud Native Security: Cloud Native Application Protection Platforms

Back in 2022, 77% of interviewed CIOs stated that their IT environment is constantly changing. We can only guess that this number, would the respondents be asked today, will be as high as 90%+. Detecting flaws and security vulnerabilities becomes more and more challenging in 2023 since the complexity of typical software deployment is exponentially increasing year to year. The relatively new trend of Cloud Native Application Protection Platforms (CNAPP) is now supported by the majority of cybersecurity companies, offering their CNAPP solutions for cloud and on-prem deployments.

CNAPP rapid growth is driven by cybersecurity threats, while misconfiguration is one of the most reported reasons for security breaches and data loss. While workloads and data move to the cloud, the required skill sets of IT and DevOps teams must also become much more specialized. The likelihood of an unintentional misconfiguration is increased because the majority of seasoned IT workers still have more expertise and got more training on-prem than in the cloud. In contrast, a young “cloud-native” DevOps professional has very little knowledge of “traditional” security like network segmentation or firewall configuration, which will typically result in configuration errors.

Some CNAPP are proud to be “Agentless” eliminating the need to install and manage agents that can cause various issues, from machine’ overload to agent vulnerabilities due to security flows and, guess what, due to the agent’s misconfiguration. Agentless monitoring has its benefits but it is not free of risks. Any monitored device should be “open” for such monitoring, typically coming from a remote server. If an adversary was able to fake a monitoring attempt, he can easily get access to all the monitored devices and compromise the entire network. So “agentless CNAPP” does not automatically mean a better solution than a competing security platform. Easier for maintenance by IT staff? Yes, it is. Is it more secure? Probably not.

Full Stack IT Observability Will Drive Business Performance in 2023

Cisco predicts that 2023 will be shaped by a few exciting trends in technology, including network observability with business correlation. Cisco’s EVP & Chief Strategy Officer Liz Centoni is sure that

To survive and thrive, companies need to be able to tie data insights derived from normal IT operations directly to business outcomes or risk being overtaken by more innovative competitors

and we cannot agree more.

Proper intelligent monitoring of digital assets along with distributed tracing should be tightly connected to the business context of the enterprise. Thus, any organization can benefit from actionable business insights while improving online and digital user experience for customers, employees, and contractors. Additionally, fast IT response based on artificial intelligence data analysis of monitored and collected network and assets events can prevent or at least provide fast remediation for the most common security threat that exists in nearly any modern digital organization: misconfiguration. 79% of firms have already experienced a data breach in the past 2 years, while 67% of them pointed to security misconfiguration as the main reason.

Misconfiguration of most software products can be timely detected and fixed with data collection and machine learning of network events and configuration files analyzed by network observability and network monitoring tools. An enterprise should require its IT departments to reach full stack observability, and connect the results with the business context. It is particularly important since we know that 99% of cloud security failures are customers’ mistakes (source: Gartner). Business context should be widely adopted as a part of the results delivered by intelligent observability and cybersecurity solutions.

Cloud Monitoring Market Size Estimations

According to a marketing study, the global IT infrastructure monitoring market is supposed to grow at 13.6% CAGR reaching USD $64.5 in 2031. Modern IT infrastructure becomes increasingly more complex and requires new skills from IT personnel, often blurring the borders between IT staff, DevOps, and development teams. With the continued move from on-prem deployments to the enterprise cloud, IT infrastructure goes to the cloud as well, and thus IT teams have to learn basic cloud-DevOps skills, such as scripting, cloud-based scaling, events creation, and monitoring. Furthermore, no company today offers a complete monitoring solution that can monitor any network device and software component.

Thus, IT teams have to build their monitoring solutions piece by piece, using various mostly not interconnected systems, developed by different, often competing vendors. For some organizations, it also comes to compliance, such as GDPR or ISO requirements, and to SLAs that obligate the IT department to timely detect, report, and fix any issue with their systems. In this challenging multi-system and multi-device environment, network observability becomes the key to enterprise success. IT organizations keep increasing their budgets seeking to reach the comprehensive cloud and on-prem monitoring for their systems and devices, and force the employees to run network and device monitoring software on their personal devices, such as mobile phones and laptops. This trend also increases the IT spend on cybersecurity solutions such as SDR and network security analysis with various SIEM tools.

Metrist raises $5.5M for eBPF-based cloud monitoring

Metrist, a startup with DevOps roots, raises $5.5M to help companies to deal with cloud services outages. Metrist was founded by two DevOps veterans, Jeff Martens and Ryan Duffield, whose past experience includes working for New Relic, PagerDuty and similar observability and monitoring companies.

Metrist Founders
Metrist Founders, Image Credit: Metrist

Metrist’s idea is not very original: negotiate outages that vendors’ SLAs do not cover. Surprisingly, there are not too many competitors in this area. Some competition for Metrist’s business comes from Parametric Insurance, which sells insurance policies that include cloud and CDN outages.

In contrast to selling insurance, Metrist is willing to play the role of the trusted arbiter in negotiating outage outcomes with vendors and the affected company.

One of the interesting parts of this story is that according to TechCrunch report Metrist team plans to run an eBPF agent to gather data services a customer runs. There are a few issues associated with this technical approach:

  1. Metrist is going to miss all container deployments, e.g. ECS at AWS or any K8s+dockers infrastructure. It is quite a big part of cloud infrastructure that Metrist won’t be able to observe with eBPF-based agents.
  2. On top of that, eBPF can not see into Serverless deployments, e.g. AWS Lambda-s. This further reduces the world of apps that Metrist can monitor.
  3. And there is a third factor that limits Metrist scale-up: most enterprises become very suspicious once they are asked to run yet another agent on their cloud VM or a barrel metal machine. While companies like PageDuty or New Relic have already overcome this psychological barrier by being on the market for long enough, it still could be a showstopper for a young startup that needs to prove itself to its customers.

Having said this, we wish the Metrist’s team all the success.

Digital Experience Journal: Top 20 Vendors for Managing IT Performance in 2022

Digital Enterprise Journal’s recently published analysis of IT performance markets, 24 Key Areas Shaping IT Performance Markets in 2022. Designed to help end-user organizations understand what solution is the best fit for their specific needs, it provides an in-depth analysis of which vendors, including Catchpoint, align with key user requirements for managing IT performance in relation to this year’s key trends.

Catchpoint is proud of customers like Equinix, SAP and Cox Automotive, who tell their success stores with the company’s products:

Kelsey Waters, Senior Director of Cloud Operations, Equinix:

Catchpoint gives Equinix a more complete picture of internet visibility into what’s going on in the network, and that helps the company solve problems more quickly and communicate problems with clarity for customers. With Catchpoint, Equinix is able to identify and diagnose issues in a matter of minutes and begin to correct them before they become larger problems for end users.

Equinix is a leader in the digital infrastructure space, providing a platform that guarantees flexibility, scalability, and security. Top-tier enterprises, software as a service (SaaS), and cloud providers rely on Equinix to deliver services and expect no compromise when it comes to digital performance.

Equinix is a neutral co-location and data center provider. “The fundamental idea of Equinix was to create a place where competitive networks could come together and share data in a secure way,” explains Kelsey Waters of Cloud Operations. Equinix includes its subset, Equinix Metal. Equinix Metal provides bare metal services in a consumption-based model, similar to public clouds but in a bare metal fashion (Catchpoint is itself a customer).

Digital performance is crucial to Equinix, as they help customers scale businesses with agility and ease, without worrying about critical infrastructure. With more than 220 data centers in over 26 locations worldwide, Equinix strives to maintain 99.9% uptime. Equinix partnered with Catchpoint to:

  • Ensure service reliability.
  • Offer customers insights into observability and performance trends.
  • Maintain consistent availability and reachability.
  • Provide a full picture of the internet.

To ensure customers provide the best end-user experience, Equinix services must consistently run at peak levels. That’s why they invested in Catchpoint’s end user observability solution to stay ahead of any network-impacting incident