The Double data type in PySpark is used to represent floating-point numbers with double precision. It is one of the standard numeric data types available in PySpark, along with Integer, Long, Float, and Decimal.
Creating Double Columns
In PySpark, you can create a DataFrame with a Double column by specifying the data type as DoubleType(). For example:
from pyspark.sql import SparkSession from pyspark.sql.types import DoubleType # Create a SparkSession spark = SparkSession.builder.appName("DoubleTypeExample").getOrCreate() # Create a DataFrame with Double column data = [("John", 25.5), ("Alice", 30.2), ("Bob", 35.9)] df = spark.createDataFrame(data, ["Name", "Age"], DoubleType()) df.show()
This will create a DataFrame with two columns – “Name” of StringType and “Age” of DoubleType.
Performing Operations on Double Columns
You can perform various mathematical operations on columns of type Double in PySpark. Let’s say we have a DataFrame named df:
+-----+----+ | Name| Age| +-----+----+ | John|25.5| |Alice|30.2| | Bob|35.9| +-----+----+
1. Mathematical Operations:
You can perform arithmetic operations like addition, subtraction, multiplication, and division on columns of type Double using the built-in functions provided by PySpark:
from pyspark.functions import col # Addition df.withColumn("Age_plus_5", col("Age") + 5).show() # Subtraction df.withColumn("Age_minus_10", col("Age") - 10).show() # Multiplication df.withColumn("Age_times_2", col("Age") * 2).show() # Division df.withColumn("Age_divided_by_3", col("Age") / 3).show()
2. Aggregation Functions:
You can also use the aggregation functions available in PySpark, such as sum(), avg(), max(), and min(), on columns of type Double:
from pyspark.functions import sum, avg, max, min # Sum of all ages df.select(sum(col("Age"))).show() # Average age df.select(avg(col("Age"))).show() # Maximum age df.select(max(col("Age"))).show() # Minimum age df.select(min(col("Age"))).show()
The Double data type in PySpark is a versatile data type that allows you to work with floating-point numbers with double precision. It is useful for performing various mathematical operations and aggregations on numerical data. By understanding how to create and manipulate Double columns, you can effectively analyze and process numeric data in PySpark.