度量指标 资源用量 资源使用情况是你作业在GB小时内使用的资源量。
计量统计 我们将作业的资源使用量定义为任务容器大小和任务运行时间的乘积。因此,作业的资源使用量可以定义为mapper
和reducer
任务的资源使用量总和。
范例 1
2
3
4
5
6
7
8
Consider a job with:
4 mappers with runtime {12, 15, 20, 30} mins.
4 reducers with runtime {10 , 12, 15, 18} mins.
Container size of 4 GB
Then,
Resource used by all mappers: 4 * (( 12 + 15 + 20 + 30 ) / 60 ) GB Hours = 5.133 GB Hours
Resource used by all reducers: 4 * (( 10 + 12 + 15 + 18 ) / 60 ) GB Hours = 3.666 GB Hours
Total resource used by the job = 5.133 + 3.6666 = 8.799 GB Hours
浪费的资源量 这显示了作业以GB小时浪费的资源量或以浪费的资源百分比。
计量统计 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
To calculate the resources wasted, we calculate the following:
The minimum memory wasted by the tasks (Map and Reduce)
The runtime of the tasks (Map and Reduce)
The minimum memory wasted by a task is equal to the difference between the container size and maximum task memory(peak memory) among all tasks. The resources wasted by the task is then the minimum memory wasted by the task multiplied by the duration of the task. The total resource wasted by the job then will be equal to the sum of wasted resources of all the tasks.
Let us define the following for each task:
peak_memory_used := The upper bound on the memory used by the task.
runtime := The run time of the task.
The peak_memory_used for any task is calculated by finding out the maximum of physical memory(max_physical_memory) used by all the tasks and the virtual memory(virtual_memory) used by the task.
Since peak_memory_used for each task is upper bounded by max_physical_memory, we can say for each task:
peak_memory_used = Max(max_physical_memory, virtual_memory/2.1)
Where 2.1 is the cluster memory factor.
The minimum memory wasted by each task can then be calculated as:
wasted_memory = Container_size - peak_memory_used
The minimum resource wasted by each task can then be calculated as:
wasted_resource = wasted_memory * runtime
运行时间 运行时间指标显示了作业运行的总时间。
计量统计 作业运行时间是作业提交到资源管理器和作业完成时的时间差。
范例 作业的提交时间为1461837302868 ms
,结束时间为1461840952182 ms
,作业的runtime
时间是1461840952182 - 1461837302868 = 3649314 ms
,即1.01
小时。
等待时间 等待时间是作业处于等待状态消耗的时间
计量统计 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
For each task, let us define the following:
ideal_start_time := The ideal time when all the tasks should have started
finish_time := The time when the task finished
task_runtime := The runtime of the task
- Map tasks
For map tasks, we have
ideal_start_time := The job submission time
We will find the mapper task with the longest runtime ( task_runtime_max) and the task which finished last ( finish_time_last )
The total wait time of the job due to mapper tasks would be:
mapper_wait_time = finish_time_last - ( ideal_start_time + task_runtime_max)
- Reduce tasks
For reducer tasks, we have
ideal_start_time := This is computed by looking at the reducer slow start percentage (mapreduce.job.reduce.slowstart.completedmaps) and finding the finish time of the map task after which first reducer should have started
We will find the reducer task with the longest runtime ( task_runtime_max) and the task which finished last ( finish_time_last )
The total wait time of the job due to reducer tasks would be:
reducer_wait_time = finish_time_last - ( ideal_start_time + task_runtime_max)
启发式算法 Map-Reduce Mapper数据倾斜 Mapper数据倾斜启发式算法能够显示作业是否发生数据倾斜。启发式算法会将所有Mapper分成两组,第一组的平均值会小于第二组。 例如,第一组有900个Mapper作业,每个Mapper作业平均数据量为7MB,而另一份包含1200个Mapper作业,且每个Mapper作业的平均数据量是500MB。
计算 首先通过递归算法计算两组平均内存消耗,来评估作业的等级。其误差为两组平均内存消耗的差除以这俩组最小的平均内存消耗的差的值。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
Let us define the following variables,
deviation: the deviation in input bytes between two groups
num_of_tasks: the number of map tasks
file_size: the average input size of the larger group
num_tasks_severity: List of severity thresholds for the number of tasks. e.g., num_tasks_severity = {10, 20, 50, 100}
deviation_severity: List of severity threshold values for the deviation of input bytes between two groups. e.g., deviation_severity: {2, 4, 8, 16}
files_severity: The severity threshold values for the fraction of HDFS block size. e.g. files_severity = { ⅛, ¼, ½, 1}
Let us define the following functions ,
func avg(x): returns the average of a list x
func len(x): returns the length of a list x
func min(x,y): returns minimum of x and y
func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.
We’ll compute two groups recursively based on average memory consumed by them.
Let us call the two groups: group_1 and group_2
Without loss of generality, let us assume that,
avg(group_1) > avg(group_2) and len(group_1)< len(group_2) then ,
deviation = avg(group_1) - avg(group_2) / min(avg(group_1)) - avg(group_2))
file_size = avg(group_1)
num_of_tasks = len(group_0)
The overall severity of the heuristic can be computed as,
severity = min(
getSeverity(deviation, deviation_severity)
, getSeverity(file_size,files_severity)
, getSeverity(num_of_tasks,num_tasks_severity)
)
---
误差(deviation):分成两部分后输入数据量的误差
作业数量(num_of_tasks):map作业的数量
文件大小(file_size):较大的那部分的平均输入数据量的大小
作业数量的严重度(num_tasks_severity):一个List包含了作业数量的严重度阈值,例如num_tasks_severity = {10, 20, 50, 100}
误差严重度(deviation severity):一个List包含了两部分Mapper作业输入数据差值的严重度阈值,例如deviation_severity: {2, 4, 8, 16}
文件严重度(files_severity):一个List包含了文件大小占HDFS块大小比例的严重度阈值,例如files_severity = { ⅛, ¼, ½, 1}
然后定义如下的方法,
方法 avg(x):返回List x的平均值
方法 len(x):返回List x的长度大小
方法 min(x,y):返回x和y中较小的一个
方法 getSeverity(x,y):比较x和y中的严重度阈值,返回严重度的值
接下来,根据两个部分的平均内存消耗,进行递归计算。
假设分成的两部分分别为group_1和group_2
为了不失一般性,假设
avg(group_1) > ave(group_2) and len(group_1) < len(group_2)
以及
deviation = avg(group_1) - avg(group_2) / min(avg(group_1) - avg(group_2))
file_size = avg(group_1)
num_of_tasks = len(group_0)
启发式算法的严重度可以通过下面的方法来计算:
severity = min(getSeverity(deviation, deviation_severity),getSeverity(file_size,files_severity),getSeverity(num_of_tasks,num_tasks_severity))
参数配置 阈值参数deviation_severity
、num_tasks_severity
和files_severity
能够简单的进行配置。如果想进一步了解如何配置这些参数,可以点击这里 进行查看。
Mapper GC Mapper GC会分析任务的GC效率。它会计算出GC时间占所有CPU时间的百分比。
计算 启发式算法对Mapper GC
严重度的计算按照如下过程进行。首先,计算出所有作业的平均的CPU使用时间、平均运行时间以及平均垃圾回收消耗的时间。我们要计算Mapper GC
严重度的最小值,这个值可以通过平均运行时间和平均垃圾回收时间占平均CPU总消耗时间的比例来计算。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Let us define the following variables:
avg_gc_time: average time spent garbage collecting
avg_cpu_time: average cpu time of all the tasks
avg_runtime: average runtime of all the tasks
gc_cpu_ratio: avg_gc_time/ avg_cpu_time
gc_ratio_severity: List of severity threshold values for the ratio of avg_gc_time to avg_cpu_time.
runtime_severity: List of severity threshold values for the avg_runtime.
Let us define the following functions ,
func min(x,y): returns minimum of x and y
func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.
The overall severity of the heuristic can then be computed as,
severity = min(getSeverity(avg_runtime, runtime_severity), getSeverity(gc_cpu_ratio, gc_ratio_severity)
参数配置 阈值参数gc_ratio_severity
和runtime_severity
也是可以简单配置的。如果想进一步了解如何配置这些参数,可以参考这里 。
Mapper内存消耗 此部分指标用来检查mapper
的内存消耗。他会检查任务的消耗内存与容器请求到的内存比例。消耗的内存指任务最大消耗物理内存快照的平均值。容器请求的内存是作业mapreduce.map/reduce.memory.mb
的配置值,是作业能请求到的最大物理内存。
计算 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Let us define the following variables,
avg_physical_memory: Average of the physical memories of all tasks.
container_memory: Container memory
container_memory_severity: List of threshold values for the average container memory of the tasks.
memory_ratio_severity: List of threshold values for the ratio of avg_plysical_memory to container_memory
Let us define the following functions ,
func min(x,y): returns minimum of x and y
func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.
The overall severity can then be computed as,
severity = min(getSeverity(avg_physical_memory/container_memory, memory_ratio_severity)
, getSeverity(container_memory,container_memory_severity)
)
参数配置 阈值参数container_memory_severity
和memory_ratio_severity
也是可以简单配置的。如果想进一步了解如何配置这些参数,可以参考这里 。
Mapper的运行速度 这部分分析Mapper
代码的运行效率。通过这些分析可以知道mapper
是否受限于CPU,或者处理的数据量过大。这个分析能够分析mapper
运行速度快慢和处理的数据量大小之间的关系。
计算 这个启发式算法的严重度值,是mapper
作业的运行速度的严重度和mapper
作业的运行时间严重度中较小的一个。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Let us define the following variables,
median_speed: median of speeds of all the mappers. The speeds of mappers are found by taking the ratio of input bytes to runtime.
median_size: median of size of all the mappers
median_runtime: median of runtime of all the mappers.
disk_speed_severity: List of threshold values for the median_speed.
runtime_severity: List of severity threshold values for median_runtime.
Let us define the following functions ,
func min(x,y): returns minimum of x and y
func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.
The overall severity of the heuristic can then be computed as,
severity = min(getSeverity(median_speed, disk_speed_severity), getSeverity(median_runtime, median_runtime_severity)
参数配置 阈值参数disk_speed_severity
和runtime_severity
可以很简单的配置。如果想进一步的了解这些参数配置,可以点击这里 查看。
Mapper溢出 这个启发式算法通过分析磁盘I/O
来评判mapper
的性能。mapper
溢出比例(溢出的记录数/总输出的记录数)是衡量mapper
性能的一个重要指标:如果这个值接近2,表示几乎每个记录都溢出了,并临时写到磁盘两次(其中一次发生在内存排序缓存溢出时,另一次发生在合并所有溢出的块时)。当这些发生时表明mapper
输入输出的数据量过大了。
计算 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Let us define the following parameters,
total_spills: The sum of spills from all the map tasks.
total_output_records: The sum of output records from all the map tasks.
num_tasks: Total number of tasks.
ratio_spills: total_spills/ total_output_records
spill_severity: List of the threshold values for ratio_spills
num_tasks_severity: List of threshold values for total number of tasks.
Let us define the following functions ,
func min(x,y): returns minimum of x and y
func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.
The overall severity of the heuristic can then be computed as,
severity = min(getSeverity(ratio_spills, spill_severity), getSeverity(num_tasks, num_tasks_severity)
参数配置 阈值spill_severity
和num_tasks_severity
可以简单的进行配置。如果想进一步了解配置参数的详细信息,可以点击这里查看。 here .
Mapper运行时间 这部分分析mapper
的数量是否合适。通过分析结果,我们可以更好的优化任务中mapper
的数量这个参数设置。有以下两种情况发生时,这个参数就需要优化了:
Mapper
的运行时间很短。通常作业在以下情况下出现:mapper
数量过多mapper
的平均运行时间很短文件太小 大文件或不可分割文件块,通常作业在以下情况下出现:mapper
数量太少mapper
的平均运行时间太长文件过大 (个别达到 GB 级别) 计算 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Let us define the following variables,
avg_size: average size of input data for all the mappers
avg_time: average of runtime of all the tasks.
num_tasks: total number of tasks.
short_runtime_severity: The list of threshold values for tasks with short runtime
long_runtime_severity: The list of threshold values for tasks with long runtime.
num_tasks_severity: The list of threshold values for number of tasks.
Let us define the following functions ,
func min(x,y): returns minimum of x and y
func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.
The overall severity of the heuristic can then be computed as,
short_task_severity = min(getSeverity(avg_time,short_runtime_severity), getSeverity(num_tasks, num_tasks_severity))
severity = max(getSeverity(avg_size, long_runtime_severity), short_task_severity)
参数配置 阈值short_runtime_severity
、long_runtime_severity
以及num_tasks_severity
可以很简单的配置。如果想进一步了解参数配置的详细信息,可以点击这里 查看。
Reducer数据倾斜 这部分分析每个Reduce
中的数据是否存在倾斜情况。这部分分析能够发现Reducer
中是否存在这种情况,将Reduce
分为两部分,其中一部分的输入数据量是否明显大于另一部分的输入数据量。
计算 首先通过递归算法计算均值并基于每个组消耗的平均内存消耗将任务划分为两组来评估该算法的等级。误差表示为两个部分Reducer
的平均内存消耗之差除以两个部分最小内存消耗之差得到的比例。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Let us define the following variables:
deviation: deviation in input bytes between two groups
num_of_tasks: number of reduce tasks
file_size: average of larger group
num_tasks_severity: List of severity threshold values for the number of tasks.
e.g. num_tasks_severity = {10,20,50,100}
deviation_severity: List of severity threshold values for the deviation of input bytes between two groups.
e.g. deviation_severity = {2,4,8,16}
files_severity: The severity threshold values for the fraction of HDFS block size
e.g. files_severity = { ⅛, ¼, ½, 1}
Let us define the following functions :
func avg(x): returns the average of a list x
func len(x): returns the length of a list x
func min(x,y): returns minimum of x and y
func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.
We’ll compute two groups recursively based on average memory consumed by them.
Let us call the two groups: group_1 and group_2
Without loss of generality, let us assume that:
avg(group_1) > avg(group_2) and len(group_1)< len(group_2) then ,
deviation = avg(group_1) - avg(group_2) / min(avg(group_1)) - avg(group_2))
file_size = avg(group_1)
num_of_tasks = len(group_0)
The overall severity of the heuristic can be computed as,
severity = min(getSeverity(deviation,deviation_severity),getSeverity(file_size,files_severity),getSeverity(num_of_tasks,num_tasks_severity))
参数配置 阈值deviation_severity
、num_tasks_severity
和files_severity
,可以很简单的进行配置。如果想进一步了解这些参数的配置,可以点击这里 查看。
Reducer GC 这部分分析任务的GC效率,能够计算并告诉我们GC时间占所用CPU时间的比例。
计算 首先,会计算出所有任务的平均CPU消耗时间、平均运行时间以及平均垃圾回收所消耗的时间。然后,算法会根据平均运行时间以及垃圾回收时间占平均CPU时间的比值来计算出最低的严重等级。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Let us define the following variables:
avg_gc_time: average time spent garbage collecting
avg_cpu_time: average cpu time of all the tasks
avg_runtime: average runtime of all the tasks
gc_cpu_ratio: avg_gc_time/ avg_cpu_time
gc_ratio_severity: List of severity threshold values for the ratio of avg_gc_time to avg_cpu_time.
runtime_severity: List of severity threshold values for the avg_runtime.
Let us define the following functions ,
func min(x,y): returns minimum of x and y
func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.
The overall severity of the heuristic can then be computed as,
severity = min(getSeverity(avg_runtime, runtime_severity), getSeverity(gc_cpu_ratio, gc_ratio_severity)
参数配置 阈值gc_ratio_severity
、runtime_severity
可以很简单的配置,如果想进一步了解参数配置的详细过程,可以点击这里 查看。
Reducer内存消耗 这部分分析显示了任务的内存利用率。算法会比较作业消耗的内存以及容器要求的内存分配。消耗的内存是指每个作业消耗的最大内存的平均值。容器需求的内存是指任务配置的mapreduce.map/reduce.memory.mb
,也就是任务能够使用最大物理内存。
计算 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Let us define the following variables,
avg_physical_memory: Average of the physical memories of all tasks.
container_memory: Container memory
container_memory_severity: List of threshold values for the average container memory of the tasks.
memory_ratio_severity: List of threshold values for the ratio of avg_physical_memory to container_memory
Let us define the following functions ,
func min(x,y): returns minimum of x and y
func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.
The overall severity can then be computed as,
severity = min(getSeverity(avg_physical_memory/container_memory, memory_ratio_severity)
, getSeverity(container_memory,container_memory_severity)
)
参数配置 阈值container_memory_severity
和memory_ratio_severity
可以简单的进行配置。如果想进一步了解配置参数的详细信息,可以点击这里 查看。
Reducer运行时间 这部分分析Reducer
的运行效率,可以帮助我们更好的配置任务中reducer
的数量。当出现以下两种情况时,说明Reducer
的数量需要进行调优:
Reducer
过多,hadoop任务可能的表现是:Reducer
数量过多Reducer
的运行时间很短Reducer
过少,hadoop任务可能的表现是:计算 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Let us define the following variables,
avg_size: average size of input data for all the mappers
avg_time: average of runtime of all the tasks.
num_tasks: total number of tasks.
short_runtime_severity: The list of threshold values for tasks with short runtime
long_runtime_severity: The list of threshold values for tasks with long runtime.
num_tasks_severity: The number of tasks.
Let us define the following functions ,
func min(x,y): returns minimum of x and y
func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.
The overall severity of the heuristic can then be computed as,
short_task_severity = min(getSeverity(avg_time,short_runtime_severity), getSeverity(num_tasks, num_tasks_severity))
severity = max(getSeverity(avg_size, long_runtime_severity), short_task_severity)
参数配置 阈值参数short_runtime_severity
、long_runtime_severity
以及num_tasks_severity
可以很简单的配置,如果想进一步了解参数配置的详细过程,可以点击这里 查看。
清洗&排序 这部分分析reducer
消耗的总时间以及reducer
在进行清洗和排序时消耗的时间,通过这些分析,可以评估reducer
的执行效率。
计算 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Let’s define following variables,
avg_exec_time: average time spent in execution by all the tasks.
avg_shuffle_time: average time spent in shuffling.
avg_sort_time: average time spent in sorting.
runtime_ratio_severity: List of threshold values for the ratio of twice of average shuffle or sort time to average execution time.
runtime_severity: List of threshold values for the runtime for shuffle or sort stages.
The overall severity can then be found as,
severity = max(shuffle_severity, sort_severity)
where shuffle_severity and sort_severity can be found as:
shuffle_severity = min(getSeverity(avg_shuffle_time, runtime_severity), getSeverity(avg_shuffle_time*2/avg_exec_time, runtime_ratio_severity))
sort_severity = min(getSeverity(avg_sort_time, runtime_severity), getSeverity(avg_sort_time*2/avg_exec_time, runtime_ratio_severity))
参数配置 阈值参数avg_exec_time
、avg_shuffle_time
和avg_sort_time
可以很简单的进行配置。更多关于参数配置的相信信息可以点击这里 查看。
Spark Spark的事件日志限制 Spark
事件日志处理器当前无法处理很大的日志文件。Dr-Elephant
需要花很长的时间去处理一个很大的Spark
时间日志文件,期间很可能会影响Dr-Elephant
本身的稳定运行。因此,目前我们设置了一个日志大小限制(100MB),如果超过这个大小,会新起一个进程去处理这个日志。
计算 如果数据被限流了,那么启发式算法将评估为最严重等级CRITICAL
,否则,就没有评估等级。
Spark负载均衡处理器 和Map/Reduce
任务的执行机制不同,Spark
应用在启动后会一次性分配它所需要的所有资源,直到整个任务结束才会释放这些资源。根据这个机制,对Spark
的处理器的负载均衡就显得非常重要,可以避免集群中个别节点压力过大。
计算 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Let us define the following variables:
peak_memory: List of peak memories for all executors
durations: List of durations of all executors
inputBytes: List of input bytes of all executors
outputBytes: List of output bytes of all executors.
looser_metric_deviation_severity: List of threshold values for deviation severity, loose bounds.
metric_deviation_severity: List of threshold values for deviation severity, tight bounds.
Let us define the following functions :
func getDeviation(x): returns max(|maximum-avg|, |minimum-avg|)/avg, where
x = list of values
maximum = maximum of values in x
minimum = minimum of values in x
avg = average of values in x
func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.
func max(x,y): returns the maximum value of x and y.
func Min (l): returns the minimum of a list l.
The overall severity can be found as,
severity = Min ( getSeverity(getDeviation(peak_memory), looser_metric_deviation_severity),
getSeverity(getDeviation(durations), metric_deviation_severity),
getSeverity(getDeviation(inputBytes), metric_deviation_severity),
getSeverity(getDeviation(outputBytes), looser_metric_deviation_severity).
)
参数配置 阈值参数looser_metric_deviation_severity
和metric_deviation_severity
可以简单的进行配置。如果想进一步了解参数配置的详细过程,可以点击这里 查看。
Spark任务运行时间 这部分启发式算法对Spark
任务的运行时间进行调优分析。每个Spark
应用程序可以拆分成多个任务,每个任务又可以拆分成多个运行阶段。
计算 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Let us define the following variables,
avg_job_failure_rate: Average job failure rate
avg_job_failure_rate_severity: List of threshold values for average job failure rate
Let us define the following variables for each job,
single_job_failure_rate: Failure rate of a single job
single_job_failure_rate_severity: List of threshold values for single job failure rate.
The severity of the job can be found as maximum of single_job_failure_rate_severity for all jobs and avg_job_failure_rate_severity.
i.e. severity = max(getSeverity(single_job_failure_rate, single_job_failure_rate_severity),
getSeverity(avg_job_failure_rate, avg_job_failure_rate_severity)
)
where single_job_failure_rate is computed for all the jobs.
参数配置 阈值参数single_job_failure_rate_severity
和avg_job_failure_rate_severity
可以很简单的进行配置。更多详细信息,可以点击这里 查看。
Spark内存限制 目前,Spark
应用程序缺少动态资源分配的功能。与Map/Reduce
任务不同,能够为每个map/reduce
进程分配所需要的资源,并且在执行过程中逐步释放占用的资源。而Spark
在应用程序执行时,会一次性的申请所需要的所有资源,直到任务结束才释放这些资源。过多的内存使用会对集群节点的稳定性产生影响。所以,我们需要限制Spark
应用程序能使用的最大内存比例。
计算 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Let us define the following variables,
total_executor_memory: total memory of all the executors
total_storage_memory: total memory allocated for storage by all the executors
total_driver_memory: total driver memory allocated
peak_memory: total memory used at peak
mem_utilization_severity: The list of threshold values for the memory utilization.
total_memory_severity_in_tb: The list of threshold values for total memory.
Let us define the following functions ,
func max(x,y): Returns maximum of x and y.
func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.
The overall severity can then be computed as,
severity = max(getSeverity(total_executor_memory,total_memory_severity_in_tb),
getSeverity(peak_memory/total_storage_memory, mem_utilization_severity)
)
参数配置 阈值参数total_memory_severity_in_tb
和mem_utilization_severity
可以很简单的配置。进一步了解,可以点击这里 查看。
Spark阶段运行时间 与Spark
任务运行时间一样,Spark
应用程序可以分为多个任务,每个任务又可以分为多个运行阶段。
计算 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Let us define the following variable for each spark job,
stage_failure_rate: The stage failure rate of the job
stagge_failure_rate_severity: The list of threshold values for stage failure rate.
Let us define the following variables for each stage of a spark job,
task_failure_rate: The task failure rate of the stage
runtime: The runtime of a single stage
single_stage_tasks_failure_rate_severity: The list of threshold values for task failure of a stage
stage_runtime_severity_in_min: The list of threshold values for stage runtime.
Let us define the following functions ,
func max(x,y): returns the maximum value of x and y.
func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.
The overall severity can be found as:
severity_stage = max(getSeverity(task_failure_rate, single_stage_tasks_faioure_rate_severity),
getSeverity(runtime, stage_runtime_severity_in_min)
)
severity_job = getSeverity(stage_failure_rate,stage_failure_rate_severity)
severity = max(severity_stage, severity_job)
where task_failure_rate is computed for all the tasks.
参数配置 阈值参数single_stage_tasks_failure_rate_severity
、stage_runtime_severity_in_min
和stage_failure_rate_severity
可以很简单的配置。进一步了解,请点击这里 。
本章篇幅较长,一些专有名词及参数功能,可以在Dr-Elephant
的Dashboard
中查。