This Jupyter notebook demonstrates how to use PySpark to create, manipulate, and transform a DataFrame and its underlying RDD, including converting string values to an integer representation using custom logic.
- Imports PySpark libraries and creates a SparkContext and SQLContext.
- Defines a schema with three columns:
sales(float),employee(string),ID(integer). - Constructs a single-row DataFrame with sample data.
- Defines a Python function,
toInt, that takes a string and converts each character into its ASCII code, concatenates these codes into a single string, and returns the result as an integer. - Registers this function as a Spark UDF (
colsInt) for use in SQL/DataFrame transformations.
- Adds a new column
semployeeto the DataFrame where eachemployeevalue is transformed to the integer encoding via the UDF.
- Uses
show()to print out the DataFrame with the new column, illustrating the transformation.
- Registers the original DataFrame as a temporary SQL table.
- Runs a SQL query to select all columns and the integer-encoded employee via the UDF, storing the result in a new DataFrame and displaying it.
- Converts the DataFrame into an RDD (Resilient Distributed Dataset).
- Defines a function
toIntEmployeethat applies the same string-to-integer conversion logic for RDD rows, building a new Row object with the encoded value. - Applies this function with
mapto each row of the RDD.
- Collects and prints all results from the RDD, displaying the custom-encoded employee values