Data Loading into Bigquery
For this content, listen to the Data Warehouse of Part 2 of the Course Preparing for Google Cloud Certification: Cloud Data Engineer specialized course and organize the lecture contents in my own expression. Write it in a Closed Book and fill in the missing content while reviewing it again.
Data Loading
To make a query request to Bigquery, you must be able to access data. There are several methods, among which there is a method of loading data into Bigquery..
EL ,ELT , ETL
There are three data loading methods. There are EL, ELT and ETL. EL is loading without a separate transformation. Not going through transformation means that the data is already in usable form. ELT and ETL are methods used when transformation is required. It is the difference between Transformation loading and before loading. In the case of Bigquery, it is a data warehouse specialized for reading. In other words, it can be said that it is not specialized for UPDATE. Not sure how it is now, but as far as I know, Bigquery doesn't allow more than 1500 updates per table. If you think you need transformation before loading, you can use ETL.
EL in Bigquery
As mentioned earlier, EL in Bigquery does not require transformation, and if data is usable, just load it.
supported format
Supported format supports various formats such as CSV, JSON, PARQUET, etc. CSV and JSON automatically determine the schema with the autodetect schema option. However, it is recommended that you perform manual verification. AVR, ORC, and PARQUET are self-describing formats. No need to use auto-detection. You don't need to provide an inline-schema. Bigquery can determine the schema. In addition to this, the user can manually define the schema. However, this is not recommended.
Bigquery supports the gzip format. However, Bigquery's performance is good when loading uncompressed files. Similarly, Cloud Storage shows better performance when uploading multiple large files than uploading multiple small files.
There are cases where you have to choose between a time-sensitive scenario and a scenario where there are bandwidth and time constraints and loading uncompressed files. You should do a quick loading test to see which alternative works best.
DTS(data transfer system)
Big query supports data transfer system. Loading files in various formats has been confirmed above. Big query can load data through API. You can load data into Bigquery from anywhere you can run your code. For example, in Google Cloud, we have compute engine, container on kubernetess engine, app engine, and cloud function. You can load by using the API of these services. However, this requires you to create a custom data preprocess function. Therefore, in reality, Bigquery loads data using APIs in GCP’s dataproc and datflow. DataProc and Dataflow are services that process data.
Accroding to what I know, Hadoop, Spark corrspond to Dataproc, and Apache Airflow correspond to Dataflow.
DataProc and Dataflow provide a process function so we don’t have to rewrite the process function. (The description may be inaccurate. If there are any inaccuracies, please give feedback) When loading through other SaaS (Software as a service), the pre-built transformation function performs transformation before loading. (I thought it was ETL, but in lectures, I classify it as EL category.)
From Cloud Storage
Upload from other on-premises to Cloud Storage. Then , You can load from Cloud Storage to Bigquery.
gsutil -m cp *.csv gs:://mybucket
bq load ...
The process of loading with Bigquery can be automated as a scheduled query. The query should be written in standard SQL. This SQL includes data definition language (DDL) and data manipulation language (DML). Query string and destination table are parameterized and performed regularly every day.
. Bigquery keeps query history in cache for 7 days. Therefore, you can query a snapshot of data for a specific time period. In other words, you can easily revert to previous data. Data can be recovered only when there is no table with the same ID in the dataset. In the case of the Streaming Pipeline, you cannot recover because you can create an empty table.
From Saas
DTS regularly provides transformation, scheduling
Backfill in DTS
For example, let’s assume a scenario where a banker sends a receipt to a data warehouse.
The bank transmits data periodically, but there may be cases where the banker cannot send the record by that time.
In this case, it should be possible to detect that not all records have been entered into the data. , Requests to fill in these missing data are called data backfills.
DTS in Bigquery
Since DTS is a managed service, there is no need to worry about overhead maintaining, secure, etc. The old data transfer service required a lot of code, but DTS in Bigquery doesn’t. The core of DTS is that automatic and scheduled transfer is possible. Using this, you can transfer data from other on-premises.
How DTS work?
Bigquery can perform data transfers from over 100 different SaaS applications.
You do not need to use Cloud Storage and you can transfer from Saas using DTS.
ELT in Bigquery
This time, let’s look at ELT in Bigquery.
Loading data and Transforming in Bigquery
The Transformation process has been added to the photo shown from EL . Loading from other on-premises or other cloud services to Cloud storage remains the same.
You can transform data with DML. However, as mentioned above, DML has a daily quota of 1500 per table.
Bigquery can also perform transformations by creating a new table with CREATE or REPLACE TABLE.
Fedearated query is possible in Bigquery, but it is better in terms of performance to load query with Bigquery. However, it has the disadvantage of incurring additional storage cost.
It can also be loaded from other cloud services in GCP. However, export is possible only for cloud storage. (According to the GCP PDE test standards, yes, but I am not sure now.)
Custom Transformation
What if the transformation requires more transformation than Bigquery provides? All we have to do is write a Transformation function. Bigquery provides user defined function (UDF) functionality. Because Bigquery optimizes standard sql to perform faster, it is recommended to write standard sql. Except for Sql, only Javascript is possible. (I am not familiar with JavaScript, so I will pass it.)
ETL in Bigquery
The lecture did not deal with the relevant examples separately. In my opinion, ETL is similar to EL except for the process of T (Transformation).
Leave a comment