Description
High Performance Spark
Best Practices for Scaling and Optimizing Apache Spark
Authors: Karau Holden, Warren Rachel
Language: EnglishSubjects for High Performance Spark:
Approximative price 44.97 €
In Print (Delivery period: 12 days).
Add to cart the book of Karau Holden, Warren Rachel
Publication date: 06-2017
358 p. · 18.1x23.3 cm · Paperback
358 p. · 18.1x23.3 cm · Paperback
Description
/li>Contents
/li>
Apache Spark is amazing when everything clicks. But if you haven’t seen
the performance improvements you expected, or still don’t feel confident
enough to use Spark in production, this practical book is for you. Authors
Holden Karau and Rachel Warren demonstrate performance optimizations to
help your Spark queries run faster and handle larger data sizes, while
using fewer resources.
Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing.
With this book, you’ll explore:
. How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure
. The choice between data joins in Core Spark and Spark SQL
. Techniques for getting the most out of standard RDD transformations
. How to work around performance issues in Spark’s key/value pair paradigm
. Writing high-performance Spark code without Scala or the JVM
. How to test for functionality and performance when applying suggested improvements
. Using Spark MLlib and Spark ML machine learning libraries
. Spark’s Streaming components and external community packages
Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing.
With this book, you’ll explore:
. How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure
. The choice between data joins in Core Spark and Spark SQL
. Techniques for getting the most out of standard RDD transformations
. How to work around performance issues in Spark’s key/value pair paradigm
. Writing high-performance Spark code without Scala or the JVM
. How to test for functionality and performance when applying suggested improvements
. Using Spark MLlib and Spark ML machine learning libraries
. Spark’s Streaming components and external community packages
Chapter 1 - Introduction to High Performance Spark
. What Is Spark and Why Performance Matters
. What You Can Expect to Get from This Book
. Spark Versions
. Why Scala?
. Conclusion
Chapter 2 - How Spark Works
. How Spark Fits into the Big Data Ecosystem
. Spark Model of Parallel Computing: RDDs
. Spark Job Scheduling
. The Anatomy of a Spark Job
. Conclusion
Chapter 3 - DataFrames, Datasets, and Spark SQL
. Getting Started with the SparkSession (or HiveContext or SQLContext)
. Spark SQL Dependencies
. Basics of Schemas
. DataFrame API
. Data Representation in DataFrames and Datasets
. Data Loading and Saving Functions
. Datasets
. Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)
. Query Optimizer
. Debugging Spark SQL Queries
. JDBC/ODBC Server
. Conclusion
Chapter 4 - Joins (SQL and Core)
. Core Spark Joins
. Spark SQL Joins
. Conclusion
. Narrow Versus Wide Transformations
. What Type of RDD Does Your Transformation Return?
. Minimizing Object Creation
. Iterator-to-Iterator Transformations with mapPartitions
. Set Operations
. Reducing Setup Overhead
. Reusing RDDs
. Conclusion
Chapter 6 - Working with Key/Value Data
. The Goldilocks Example
. Actions on Key/Value Pairs
. What’s So Dangerous About the groupByKey Function
. Choosing an Aggregation Operation
. Multiple RDD Operations
. Partitioners and Key/Value Data
. Dictionary of OrderedRDDOperations
. Secondary Sort and repartitionAndSortWithinPartitions
. Straggler Detection and Unbalanced Data
. Conclusion
Chapter 7 - Going Beyond Scala
. Beyond Scala within the JVM
. Beyond Scala, and Beyond the JVM
. Calling Other Languages from Spark
. The Future
. Conclusion
. Unit Testing
. Getting Test Data
. Property Checking with ScalaCheck
. Integration Testing
. Verifying Performance
. Job Validation
. Conclusion
Chapter 9 - Spark MLlib and ML Choosing Between Spark MLlib and Spark ML
. Working with MLlib
. Working with Spark ML
. General Serving Considerations
. Conclusion
Chapter 10 - Spark Components and Packages
. Stream Processing with Spark
. GraphX
. Using Community Packages and Libraries
. Conclusion
Appendix - Tuning, Debugging, and Other Things Developers Like to Pretend Don’t Exist
. Spark Tuning and Cluster Sizing
. Basic Spark Core Settings: How Many Resources to Allocate to the Spark Application?
. Serialization Options
. What Is Spark and Why Performance Matters
. What You Can Expect to Get from This Book
. Spark Versions
. Why Scala?
. Conclusion
Chapter 2 - How Spark Works
. How Spark Fits into the Big Data Ecosystem
. Spark Model of Parallel Computing: RDDs
. Spark Job Scheduling
. The Anatomy of a Spark Job
. Conclusion
Chapter 3 - DataFrames, Datasets, and Spark SQL
. Getting Started with the SparkSession (or HiveContext or SQLContext)
. Spark SQL Dependencies
. Basics of Schemas
. DataFrame API
. Data Representation in DataFrames and Datasets
. Data Loading and Saving Functions
. Datasets
. Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)
. Query Optimizer
. Debugging Spark SQL Queries
. JDBC/ODBC Server
. Conclusion
Chapter 4 - Joins (SQL and Core)
. Core Spark Joins
. Spark SQL Joins
. Conclusion
. Narrow Versus Wide Transformations
. What Type of RDD Does Your Transformation Return?
. Minimizing Object Creation
. Iterator-to-Iterator Transformations with mapPartitions
. Set Operations
. Reducing Setup Overhead
. Reusing RDDs
. Conclusion
Chapter 6 - Working with Key/Value Data
. The Goldilocks Example
. Actions on Key/Value Pairs
. What’s So Dangerous About the groupByKey Function
. Choosing an Aggregation Operation
. Multiple RDD Operations
. Partitioners and Key/Value Data
. Dictionary of OrderedRDDOperations
. Secondary Sort and repartitionAndSortWithinPartitions
. Straggler Detection and Unbalanced Data
. Conclusion
Chapter 7 - Going Beyond Scala
. Beyond Scala within the JVM
. Beyond Scala, and Beyond the JVM
. Calling Other Languages from Spark
. The Future
. Conclusion
. Unit Testing
. Getting Test Data
. Property Checking with ScalaCheck
. Integration Testing
. Verifying Performance
. Job Validation
. Conclusion
Chapter 9 - Spark MLlib and ML Choosing Between Spark MLlib and Spark ML
. Working with MLlib
. Working with Spark ML
. General Serving Considerations
. Conclusion
Chapter 10 - Spark Components and Packages
. Stream Processing with Spark
. GraphX
. Using Community Packages and Libraries
. Conclusion
Appendix - Tuning, Debugging, and Other Things Developers Like to Pretend Don’t Exist
. Spark Tuning and Cluster Sizing
. Basic Spark Core Settings: How Many Resources to Allocate to the Spark Application?
. Serialization Options
© 2024 LAVOISIER S.A.S.