pyspark.sql.functions.count_min_sketch#
- pyspark.sql.functions.count_min_sketch(col, eps, confidence, seed)[source]#
Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a CountMinSketch before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.
New in version 3.5.0.
- Parameters
- Returns
Column
count-min sketch of the column
Examples
>>> df = spark.createDataFrame([[1], [2], [1]], ['data']) >>> df = df.agg(count_min_sketch(df.data, lit(0.5), lit(0.5), lit(1)).alias('sketch')) >>> df.select(hex(df.sketch).alias('r')).collect() [Row(r='0000000100000000000000030000000100000004000000005D8D6AB90000000000000000000000000000000200000000000000010000000000000000')]