Categories
bigdata bigdata/flink bigdata/spark bigdata/topics cicd cicd/jenkins cicd/travis data-science docker git kubernetes kubernetes/minikube kubernetes/prometheus kubernetes/tools shell springbootTags
Spark Sql Techniques
Pivot and Unpivot
行转列 and 列转行
列转行一般需要将该表的几个字段(列)所属同一种类的字段进行总结,并且新的字段的值就是之前的字段名
行转列一般需要将该表的某一字段下的数据进行汇总并且总结成新的多个字段
All of this two techniques need to use
SQL
Keywords
WITH
Referrence: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-cte.html
It’s called Common Table Expression
Syntax:
1 | WITH common_table_expression [ , ... ] |
LATERAL VIEW
Referrence: https://spark.apache.org/docs/3.1.1/sql-ref-syntax-qry-select-lateral-view.html#content
Syntax:
1 | LATERAL VIEW [ OUTER ] generator_function ( expression [ , ... ] ) [ table_alias ] AS column_alias [ , ... ] |
This clause follows after original table
or view
to append each row which is generated by generator_function
to each original output row
. Then you original table will contains more columns to become a wide table for speeding up the searchs.
Spark Functions
explode(array(…)
Referrence: https://spark.apache.org/docs/latest/api/sql/index.html#explode
Usually used to separate the elements in array or map into row formation.
cloumn_name
can also used as the parameter in explode
function.
row_numer()
Referrence: https://spark.apache.org/docs/latest/api/sql/#row_number
Usually used with PARTITION BY
keywords. E.g. row_number() OVER (PARTITION BY col1 ORDER BY col2) FROM table1
nvl(expr1, expr2)
Equals with: Optional(expr1).getOrElse(expr2)