编程技术网

关注微信公众号,定时推送前沿、专业、深度的编程技术资料。

 找回密码
 立即注册

QQ登录

只需一步,快速开始

极客时间

根据工作人员、核心和数据帧大小确定最佳 Spark 分区数:Determining optimal number of Spark partitions based on workers, cores and DataFrame size

Matthew Farwel spark 2022-5-7 16:40 7人围观

腾讯云服务器

* 我今天刚刚失败了:通过 Python 使用 Spark 准备我的大数据,当使用太多分区导致 活动任务在 Spark UI 中是一个负数.

There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is:

  • The Spark Driver node (sparkDriverCount)
  • The number of worker nodes available to a Spark cluster (numWorkerNodes)
  • The number of Spark executors (numExecutors)
  • The DataFrame being operated on by all workers/executors, concurrently (dataFrame)
  • The number of rows in the dataFrame (numDFRows)
  • The number of partitions on the dataFrame (numPartitions)
  • And finally, the number of CPU cores available on each worker nodes (numCpuCoresPerWorker)

I believe that all Spark clusters have one-and-only-one Spark Driver, and then 0+ worker nodes. If I'm wrong about that, please begin by correcting me! Assuming I'm more or less correct about that, let's lock in a few variables here. Let's say we have a Spark cluster with 1 Driver and 4 Worker nodes, and each Worker Node has 4 CPU cores on it (so a total of 16 CPU cores). So the "given" here is:

sparkDriverCount = 1
numWorkerNodes = 4
numCpuCores = numWorkerNodes * numCpuCoresPerWorker = 4 * 4 = 16

Given that as the setup, I'm wondering how to determine a few things. Specifically:

  • What is the relationship between numWorkerNodes and numExecutors? Is there some known/generally-accepted ratio of workers to executors? Is there a way to determine numExecutors given numWorkerNodes (or any other inputs)?
  • Is there a known/generally-accepted/optimal ratio of numDFRows to numPartitions? How does one calculate the 'optimal' number of partitions based on the size of the dataFrame?
  • I've heard from other engineers that a general 'rule of thumb' is: numPartitions = numWorkerNodes * numCpuCoresPerWorker, any truth to that? In other words, it prescribes that one should have 1 partition per CPU core.
解决方案

Yes, a application has one and only Driver.

What is the relationship between numWorkerNodes and numExecutors?

A worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker.

So `numWorkerNodes <= numExecutors'.

Is there any ration for them?

Personally, having worked in a fake cluster, where my laptop was the Driver and a virtual machine in the very same laptop was the worker, and in an industrial cluster of >10k nodes, I didn't need to care about that, since it seems that takes care of that.

I just use:

--num-executors 64

when I launch/submit my script and knows, I guess, how many workers it needs to summon (of course, by taking into account other parameters as well, and the nature of the machines).

Thus, personally, I don't know any such ratio.


Is there a known/generally-accepted/optimal ratio of numDFRows to numPartitions?

I am not aware of one, but as a rule of thumb you could rely on the product of #executors by #executor.cores, and then multiply that by 3 or 4. Of course this is a heuristic. In it would look like this:

sc = SparkContext(appName = "smeeb-App")
total_cores = int(sc._conf.get('spark.executor.instances')) * int(sc._conf.get('spark.executor.cores'))
dataset = sc.textFile(input_path, total_cores * 3)

How does one calculate the 'optimal' number of partitions based on the size of the DataFrame?

That's a great question. Of course its hard to answer and it depends on your data, cluster, etc., but as discussed here with myself.

Too few partitions and you will have enormous chunks of data, especially when you are dealing with , thus putting your application in memory stress.

Too many partitions and you will have your taking much pressure, since all the metadata that has to be generated from the increases significantly as the number of partitions increase (since it maintains temp files, etc.). *

So what you want is too find a sweet spot for the number of partitions, which is one of the parts of fine tuning your application. :)

'rule of thumb' is: numPartitions = numWorkerNodes * numCpuCoresPerWorker, is it true?

Ah, I was writing the heuristic above before seeing this. So this is already answered, but take into account the difference of a worker and an executor.


* I just failed for this today: Prepare my bigdata with Spark via Python, when using too many partitions caused Active tasks is a negative number in Spark UI.

这篇关于根据工作人员、核心和数据帧大小确定最佳 Spark 分区数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程技术网(www.editcode.net)!

腾讯云服务器

相关推荐

阿里云服务器
关注微信
^