Python pyspark DataFrame.dot用法及代码示例

本文简要介绍 pyspark.pandas.DataFrame.dot 的用法。

用法: DataFrame.dot(other: Series) → Series

计算DataFrame 和其他之间的矩阵乘法。

此方法计算 DataFrame 和其他系列的值之间的矩阵乘积

它也可以在 Python >= 3.5 中使用 self @ other 调用。

注意

由于大数据的性质，这种方法基于昂贵的操作。在内部，它需要为每个值生成每一行，然后分组两次——这是一个巨大的操作。为防止误用，此方法具有“compute.max_rows”默认输入长度限制，并引发 ValueError。

>>> from pyspark.pandas.config import option_context
>>> with option_context(
...     'compute.max_rows', 1000, "compute.ops_on_diff_frames", True
... ):  
...     psdf = ps.DataFrame({'a': range(1001)})
...     psser = ps.Series([2], index=['a'])
...     psdf.dot(psser)
Traceback (most recent call last):
  ...
ValueError: Current DataFrame has more then the given limit 1000 rows.
Please set 'compute.max_rows' by using 'pyspark.pandas.config.set_option'
to retrieve to retrieve more than 1000 rows. Note that, before changing the
'compute.max_rows', this operation is considerably expensive.

参数：

other：Series: 计算矩阵乘积的另一个对象。

Series: 将 self 和 other 之间的矩阵乘积作为 Series 返回。

注意：

DataFrame 和其他的维度必须兼容才能计算矩阵乘法。此外，DataFrame 的列名和其他索引必须包含相同的值，因为它们将在乘法之前对齐。

Series 的 dot 方法计算内积，而不是这里的矩阵积。

例子：

>>> from pyspark.pandas.config import set_option, reset_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> psdf = ps.DataFrame([[0, 1, -2, -1], [1, 1, 1, 1]])
>>> psser = ps.Series([1, 1, 2, 1])
>>> psdf.dot(psser)
0   -4
1    5
dtype: int64

请注意，对象的洗牌不会改变结果。

>>> psser2 = psser.reindex([1, 0, 2, 3])
>>> psdf.dot(psser2)
0   -4
1    5
dtype: int64
>>> psdf @ psser2
0   -4
1    5
dtype: int64
>>> reset_option("compute.ops_on_diff_frames")

相关用法

注：本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.pandas.DataFrame.dot。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。

用法:

参数：

返回：

注意：

例子：