What Is Double Data Type in PySpark?

//

Heather Bennett

The Double data type in PySpark is used to represent floating-point numbers with double precision. It is one of the standard numeric data types available in PySpark, along with Integer, Long, Float, and Decimal.

Creating Double Columns

In PySpark, you can create a DataFrame with a Double column by specifying the data type as DoubleType(). For example:


from pyspark.sql import SparkSession
from pyspark.sql.types import DoubleType

# Create a SparkSession
spark = SparkSession.builder.appName("DoubleTypeExample").getOrCreate()

# Create a DataFrame with Double column
data = [("John", 25.5), ("Alice", 30.2), ("Bob", 35.9)]
df = spark.createDataFrame(data, ["Name", "Age"], DoubleType())
df.show()

This will create a DataFrame with two columns – “Name” of StringType and “Age” of DoubleType.

Performing Operations on Double Columns

You can perform various mathematical operations on columns of type Double in PySpark. Let’s say we have a DataFrame named df:


+-----+----+
| Name| Age|
+-----+----+
| John|25.5|
|Alice|30.2|
|  Bob|35.9|
+-----+----+

1. Mathematical Operations:

You can perform arithmetic operations like addition, subtraction, multiplication, and division on columns of type Double using the built-in functions provided by PySpark:


from pyspark.functions import col

# Addition
df.withColumn("Age_plus_5", col("Age") + 5).show()

# Subtraction
df.withColumn("Age_minus_10", col("Age") - 10).show()

# Multiplication
df.withColumn("Age_times_2", col("Age") * 2).show()

# Division
df.withColumn("Age_divided_by_3", col("Age") / 3).show()

2. Aggregation Functions:

You can also use the aggregation functions available in PySpark, such as sum(), avg(), max(), and min(), on columns of type Double:


from pyspark.functions import sum, avg, max, min

# Sum of all ages
df.select(sum(col("Age"))).show()

# Average age
df.select(avg(col("Age"))).show()

# Maximum age
df.select(max(col("Age"))).show()

# Minimum age
df.select(min(col("Age"))).show()

Conclusion

The Double data type in PySpark is a versatile data type that allows you to work with floating-point numbers with double precision. It is useful for performing various mathematical operations and aggregations on numerical data. By understanding how to create and manipulate Double columns, you can effectively analyze and process numeric data in PySpark.

Discord Server - Web Server - Private Server - DNS Server - Object-Oriented Programming - Scripting - Data Types - Data Structures

Privacy Policy