ControlUp ControlUp provides inside for IT managers on their hypervisors, server farms and desktops in the organization.
ControlUp consist of two components:
1. ControlUp real-time component
– On-prime component that helps you find and fix issues through in-OS guest view, real-time grid interface, and assisted troubleshooting.
– Easily monitor and manage through the console’s actionable dashboard and quickly resolve issues on hypervisors, virtual and physical servers, user sessions, application processes, and more.
2. ControlUp Insights
The big data analytic platform summarizes key metrics of the state of your monitored resources every 24 hours, highlighting key activity indicators, risk factors and performance stats.
The insights component can help and IT manager with question like sizing, system health for long period of time, user experience and application usage.
Only the real-time component was exists, no history and deep insights on the system.
The Insight component lavage the real-time component have to build a new product that will help the IT manager with different decision like:
- Do I need to buy more disks ?
- How many license the organization uses for software X ?
- Which application/process on my users desktop cause me problems?
- Uploading data from the customer data-centers (source real-time component) to AWS with little network disruption. A first snapshot of the runtime is uploaded, from this point on only changes are uploaded (not full record). This is done to minimize the effect of the new system on the customer site (network & minimal change on the real-time application)
- Upload data is about computer (desktop/ hypervisors), session, processes. The data is split to 2 types:
a. Static data (header) on the component like names, when started / end, etc. All static changes are saved (have history).
b. Counters data like CPU, memory, IO, etc. We needed to generate a full record of all counters every 5 min even if no data was uploaded from the real-time component (no changes in the counters).
- The real-time component data is partially, contains incorrect and in-consist data. The ETL needed to validate and fix this issues as well as generate the missing data.
The connectivity of the entity (computer/session/process/etc) holds very important information for the system but this data might be not accurate and we need to verify and correct it.
a. We might get or not an indication that the entity was started/created.
b. We might get or not a keep-alive indication that the entity still running.
c. We might get or not a stop indication that the entity is down.
The connectivity of an entity define for the ETL process if we need to generate a 5 min counter record based on it.
- The ETL is split to 2 sub-system:
a. Process that prepare the data-read the uploaded partial data and rebuild a full events (filling missing events, fix data problems with the incoming data).
After verify, fix and generate the raw events, we create aggregation data for all entities:
i. Time-series 5 minutes aggregation events for 48 hours.
ii. Time-series 30 minutes aggregation events for 7 days.
iii. Time-series 1 hour aggregation events for 30 days.
iv. Time-series 1 day aggregation events for 1 year.
b. Serving Redshift Cluster for reporting tools
i. Upload the generated events
ii. Make sure that we are not saving too much history and by that manage the tables and cluster size.
iii. Support a front end web application with hundreds of users.
- The number of events generated by the system is based on the numbered of monitored machines on site. The biggest entity we monitor is the processes entity. We aggregate the process events into 2 types of objects: process & application (aggregation on the processes)
a. In Windows each process is forking 2 process: 1 background process and 1 real process.The background process live for less than 1 second.
b. In the aggregation tables we save all processes (include the background) for the 5 minutes time-series. For 30 minute, 1 hour and 1 day we filter the “insignificant” processes (background and non-important processes).
c. Statistical information :
i. Size of processes: More than 1B events per day.
Process Time-Series Number of records Number of Blocks
5 min aggrigation (48 hours) 2518912084 235212
30 min aggrigation (7 days) 3314966261 258348
1 hour aggrigation (1 month) 7661702693 553496
1 day aggrigation (1 year) 11279814022 756065
ii. Size of application: More than 1B events per day.
Application Time-Series Number of records Number of Blocks
5 min aggrigation (48 hours) 1167399320 103875
30 min aggrigation (1 week) 1521859312 85567
1 hour aggrigation (1 month) 3217751836 176483
1 day aggrigation (1 year) 1079023582 102021
The product and offering
Implemented big data tools to improve
analytics systems and insights.
Part time RedShift DBA on site as well
as as well as DevOps support.
Talk to an AWS expert today
What are you waiting for? Let us show you how we can accelerate your cloud innovation.