Ops checks and alert analysis

From manual console checks to AI-run ops investigations

Dutifly turns the recurring web work in your ops routine into reusable AI workflows.

Health checks, log review, cloud-resource audits, alert triage, root-cause investigation, and report generation all run across the tools you already use. Dutifly pulls the signals together, traces the evidence, and gives you a conclusion you can review instead of making you sign in, search logs, and reconcile metrics by hand.

Every resolved incident and every correction becomes part of your personal ops Skill. When a similar issue appears again, the AI starts from your team's proven playbook.

Minutes
Cross-tool health checks
Signal merge
Metrics, logs, traces
Runbook memory
Personal Skill growth
System HealthOptimal

SCENARIO

What ops teams repeat every day

Daily health check

Open Prometheus, Grafana, SLS, and cloud consoles one by one, then correlate the signals manually.

09:00

Daily health check

Open Prometheus, Grafana, SLS, and cloud consoles one by one, then correlate the signals manually.

Alert storm triage

Thirty alerts arrive at once. Most are noise, and the one that matters is buried in the stream.

10:30

Alert storm triage

Thirty alerts arrive at once. Most are noise, and the one that matters is buried in the stream.

Incident investigation

Jump between Kibana, Jaeger, and Grafana while context keeps dropping between screens.

14:20

Incident investigation

Jump between Kibana, Jaeger, and Grafana while context keeps dropping between screens.

WORKFLOW

Let AI carry the complexity

Ops teams describe the investigation target; Dutifly breaks it into steps, gathers data across platforms, and assembles the evidence chain.

Human-approved

One-time access

An operator grants read-only access and defines the boundary for monitoring sites, logs, and cloud resources.

Grafana / Prometheus / SLS

AI-run

Intent parsing

Describe the task in natural language; the AI turns it into an investigation path.

AI-run

Signal gathering

Collect and connect clues across monitoring, log, and trace systems.

AI-run

Root-cause analysis

Produce an evidence-backed assessment and mark uncertainty.

AI-run

Preference learning

Remember your troubleshooting habits and turn human corrections into reusable Skills.

Dutifly · Ops workspace
Human reviewPayment Service shows P99 latency instability between 04:32 and 04:41. Peak latency reached 2,340 ms, with slow queries concentrated on payment-db-replica-2.
I will correlate SLS error logs, Prometheus metrics, and Jaeger traces, then rule out release changes and external dependencies.
Scheduled jobs usually run at that time. Next time, exclude scheduled-job traffic before you analyze it.
Recorded and rerun
For the 04:00-05:00 window, source=scheduler traffic will be filtered automatically. After filtering, the user request path is healthy.
Ask Dutifly about this incident...

CAPABILITIES

AIOps coverage across the full workflow

Not another dashboard. An ops assistant that understands system signals.

Daily health reports

Generate cross-platform health summaries every day, with source evidence preserved.

Alert prioritization and noise reduction

Use historical patterns to separate false positives, duplicates, and genuine risk.

Root-cause localization

Connect metrics, logs, and distributed traces into a verifiable investigation path.

Capacity trend forecasting

Spot CPU, memory, GC, and disk-watermark trends before they become capacity incidents.

Multi-cloud resource governance

Bring AliCloud, AWS, and GCP resources into one view of instances, utilization, and drift.

Change impact review

Compare key signals before and after releases so teams can understand impact quickly.

For this service, filter scheduler traffic during overnight batch windows
When payment-path P99 exceeds 800 ms, inspect the connection pool first

Turn every investigation into a personal Skill

Dutifly is not limited to one-off analysis. It remembers business rules, troubleshooting habits, and preferences. Manual corrections, exception rules, and final judgments become part of your personal Skill library so the next health check or incident starts with what you already know.

Personal Skill libraryReusable playbooks

SUPPORTED ECOSYSTEM

Connect directly to existing monitoring, logging, tracing, and alerting tools

  • Prometheus
  • Grafana
  • AliCloud SLS
  • Elasticsearch
  • Zabbix
  • Jaeger
  • PagerDuty
  • Datadog
  • Tencent CLS

FAQ

Frequently asked questions

Start with a low-risk health-check workflow, confirm permissions, evidence paths, and human approval, then expand automation step by step.

Will Dutifly change production systems directly?

No. In ops scenarios, Dutifly starts with read-only analysis, evidence organization, and recommendations. High-risk actions such as restarts, scaling, configuration changes, or rollbacks require owner approval.

Do we need to connect every monitoring platform at once?

No. It is better to start with one service, one alert workflow, or one routine health check. Once metrics, logs, and troubleshooting experience are working together, you can expand to more platforms.

What ops work is Dutifly good at?

Routine health checks, alert noise reduction, anomaly attribution, cross-platform summaries, incident reports, and team knowledge capture. Outputs stay reviewable, and critical judgment remains with the human owner.

How are data permissions controlled?

Dutifly reads only within the scope you authorize. Teams can limit platforms, services, resources, and task boundaries. Data outside that scope is not accessed proactively.

Start with one read-only health check, then bring Dutifly into your current ops workflow

Pick one service, one alert path, or one routine health check as the pilot. Dutifly brings together monitoring, logs, and traces without rebuilding systems or disrupting your existing tool stack.

Read-only analysis · Human approval for risky actions · Auditable evidence