Sqoop 사용

Classic 환경에서 이용 가능합니다.

Sqoop(SQL to Hadoop)은 Hadoop과 관계형 데이터베이스 간에 데이터를 전송할 수 있도록 설계된 오픈소스 소프트웨어입니다.
간단한 CLI로 Oracle, MySQL 등의 RDBMS 특정 테이블 또는 특정 조건에 맞는 데이터를 HDFS로 쉽게 옮길 수 있으며, Hive, Pig, HBase 등으로 바로 옮겨 확인할 수 있습니다. 반대로 HDFS에 저장되어 있는 데이터를 RDBMS로 옮길 수도 있습니다.

Sqoop은 2009년 첫 버전이 나온 후 2012년에 Apache Top Level Project가 되어 지속적으로 발전 중입니다. 현재 Apache Sqoop은 Sqoop 1, Sqoop 2의 두 가지 버전으로 발전하고 있습니다. Sqoop 1은 클라이언트 방식이고, Sqoop 2는 기존 Sqoop 1 방식에 서버 사이드 방식이 추가된 버전입니다.

네이버 클라우드 플랫폼의 Cloud Hadoop에서 지원하는 클러스터 Type에는 Sqoop이 기본적으로 설치되어 있습니다.
이 가이드에서는 Sqoop을 사용하여 MySQL 데이터를 Cloud Hadoop로 복사하거나 반대로 보내는 방법에 대하여 설명합니다.
그 외의 Sqoop의 사용법은 Apache Sqoop Documentation를 참고해 주십시오.

참고

이 가이드는 MySQL 5버전을 기준으로 설명하고 있으며, MySQL 8 버전으로 마이그레이션 시에는 커넥터의 버전 업그레이드가 필요합니다.

8 버전의 커넥터 버전은 곧 업데이트 예정입니다.

사전 작업

Cloud Hadoop에서 Sqoop을 사용하기 위해서는 다음과 같은 사전 작업이 필요합니다.

항목	설명	가이드
ACG 및 ACL 설정	Cloud Hadoop 클러스터 서버와 RDBMS 서버 간의 통신 가능하도록 설정	방화벽 설정(ACG)
RDBMS	원본 데이터가 저장되어 있는 RDBMS

사전 작업 후 다음 단계를 차례대로 진행해 주십시오.

1. Hadoop 클러스터 접속

작업하고자 하는 Cloud Hadoop 클러스터의 마스터 노드에 접속해 주십시오.
클러스터 마스터 노드에 접속 방법은 SSH로 클러스터 노드에 접속을 참고해 주십시오.

2. MySQL 접속 및 가져올 대상 데이터 확인

아래 명령어를 실행하여 원격 MySQL 서버에 접속할 수 있도록 MySQL Client가 설치되어 있는지 확인해 주십시오.
- 기본적으로 마스터 노드(2대)에는 MySQL Client가 설치되어 있습니다.
```
[sshuser@m-001-example-1fx8-hd ~]$ mysql -V
 mysql  Ver 14.14 Distrib 5.6.41, for Linux (x86_64) using  EditLine wrapper
```

아래 명령어를 실행한 후, MySQL 서버에 연결되었는지 확인해 주십시오.

mysql -u [클러스터 관리자 계정명] -p

접속 예시

[sshuser@m-001-example-1fx8-hd ~]$ mysql -u example -p
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 37
Server version: 5.6.35-log MySQL Community Server (GPL)
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

주의

연결이 제대로 되지 않을 때는 MySQL DB 포트로 ACG가 잘 열려 있는지 확인하고, MySQL DB 자체의 ACL 리스트를 확인합니다.

사용자의 MySQL DB 서버에서 Hadoop 클러스터 HDFS로 가져올 데이터를 확인해 주십시오.

아래 예시는 sqoop이라는 데이터베이스에 SQOOP_TEST 테이블이 미리 생성된 상태입니다.

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| sqoop              |
+--------------------+
4 rows in set (0.00 sec)

mysql> use sqoop
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+-----------------+
| Tables_in_sqoop |
+-----------------+
| SQOOP_TEST      |
+-----------------+
1 row in set (0.00 sec)

mysql> select * from SQOOP_TEST;
+--------------+--------+-----------+------+---------------------+------+------+--------+
| SQOOP_TESTNO | ENAME  | JOB       | MGR  | HIREDATE            | SAL  | COMM | DEPTNO |
+--------------+--------+-----------+------+---------------------+------+------+--------+
|         7369 | SMITH  | CLERK     | 7902 | 1980-12-17 00:00:00 |  800 | NULL |     20 |
|         7499 | ALLEN  | SALESMAN  | 7698 | 1981-02-20 00:00:00 | 1600 |  300 |     30 |
|         7521 | WARD   | SALESMAN  | 7698 | 1981-02-22 00:00:00 | 1250 |  500 |     30 |
|         7566 | JONES  | MANAGER   | 7839 | 1981-04-02 00:00:00 | 2975 | NULL |     20 |
|         7654 | MARTIN | SALESMAN  | 7698 | 1981-09-28 00:00:00 | 1250 | 1400 |     30 |
|         7698 | BLAKE  | MANAGER   | 7839 | 1981-05-01 00:00:00 | 2850 | NULL |     30 |
|         7782 | CLARK  | MANAGER   | 7839 | 1981-06-09 00:00:00 | 2450 | NULL |     10 |
|         7788 | SCOTT  | ANALYST   | 7566 | 1987-04-19 00:00:00 | 3000 | NULL |     20 |
|         7839 | KING   | PRESIDENT | NULL | 1981-11-17 00:00:00 | 5000 | NULL |     10 |
|         7844 | TURNER | SALESMAN  | 7698 | 1981-09-08 00:00:00 | 1500 |    0 |     30 |
|         7876 | ADAMS  | CLERK     | 7788 | 1987-05-23 00:00:00 | 1100 | NULL |     20 |
|         7900 | JAMES  | CLERK     | 7698 | 1981-12-03 00:00:00 |  950 | NULL |     30 |
|         7902 | FORD   | ANALYST   | 7566 | 1981-12-03 00:00:00 | 3000 | NULL |     20 |
|         7934 | MILLER | CLERK     | 7782 | 1982-01-23 00:00:00 | 1300 | NULL |     10 |
+--------------+--------+-----------+------+---------------------+------+------+--------+
14 rows in set (0.00 sec)

3. Sqoop 명령어로 데이터 가져오기

sudo su - 명령어를 실행하여 root 계정으로 전환해 주십시오.

[sshuser@m-001-example-1fx8-hd ~]$ sudo su -
[root@m-001-example-1fx8-hd ~]#

import Sqoop 명령어를 이용하여 데이터를 임포트해 주십시오.
- import 명령어로 특정 데이터베이스, 특정 테이블, 쿼리 수행 결과 등 가져올 수 있습니다.

sqoop import --connect jdbc:mysql://[마스터 노드 Private IP]:3306/[데이터베이스이름] --username [클러스터 관리자 계정명] --password [클러스터 관리자 패스워드] --table [대상테이블]

import 명령어로 특정 데이터베이스, 특정 테이블, 쿼리 수행 결과 등 가져올 수 있습니다. Sqoop import의 일반적인 옵션은 아래 표를 참고해 주십시오.

인수	설명
`--connect <jdbc-uri>`	JDBC 연결 문자열 지정
`--connection-manager <class-name>`	사용할 연결 관리자 클래스 지정
`--driver <class-name>`	사용할 JDBC 드라이버 클래스를 수동으로 지정할 때 사용
`--hadoop-home <dir>`	$ HADOOP_HOME 지정
`--help`	도움말
`-P`	콘솔에서 비밀번호 읽기
`--password <password>`	인증 암호 설정
`--username <username>`	인증 사용자 이름 설정
`--verbose`	작업하는 동안 더 많은 정보를 출력할 때 사용
`--connection-param-file <filename>`	연결 매개 변수를 제공하는 선택적 특성 파일 지정

Sqoop은 MapReduce 작업을 통해서 데이터를 가져옵니다. 아래는 가장 일반적인 방법으로 특정 데이터베이스의 특정 테이블을 가져오는 예제입니다.

[root@m-001-example-1fx8-hd ~]# sqoop import --connect jdbc:mysql://[마스터 노드 Private IP]:3306/sqoop \
--username [클러스터 관리자 계정명] \
-P --table sqoop_test \
--hive-import --create-hive-table --hive-table sqoop_workspace.sqoop_test -m 1
...

18/08/03 09:06:38 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
18/08/03 09:06:38 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm1]
18/08/03 09:06:42 INFO db.DBInputFormat: Using read commited transaction isolation
18/08/03 09:06:42 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`SQOOP_TESTNO`), MAX(`SQOOP_TESTNO`) FROM `SQOOP_TEST`
18/08/03 09:06:42 INFO db.IntegerSplitter: Split size: 141; Num splits: 4 from: 7369 to: 7934
18/08/03 09:06:42 INFO mapreduce.JobSubmitter: number of splits:4
18/08/03 09:06:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1532345019075_0002
18/08/03 09:06:43 INFO impl.YarnClientImpl: Submitted application application_1532345019075_0002
18/08/03 09:06:43 INFO mapreduce.Job: The url to track the job: http://m-002-obj001.hd:8088/proxy/application_1532345019075_0002/
18/08/03 09:06:43 INFO mapreduce.Job: Running job: job_1532345019075_0002
18/08/03 09:06:49 INFO mapreduce.Job: Job job_1532345019075_0002 running in uber mode : false
18/08/03 09:06:49 INFO mapreduce.Job:  map 0% reduce 0%
18/08/03 09:06:57 INFO mapreduce.Job:  map 75% reduce 0%
18/08/03 09:07:38 INFO mapreduce.Job:  map 100% reduce 0%
18/08/03 09:07:40 INFO mapreduce.Job: Job job_1532345019075_0002 completed successfully
18/08/03 09:07:40 INFO mapreduce.Job: Counters: 30
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=683688
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                S3A: Number of bytes read=497
                S3A: Number of bytes written=853
                S3A: Number of read operations=44
                S3A: Number of large read operations=0
                S3A: Number of write operations=36
        Job Counters
                Launched map tasks=4
                Other local map tasks=4
                Total time spent by all maps in occupied slots (ms)=277580
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=55516
                Total vcore-milliseconds taken by all map tasks=55516
                Total megabyte-milliseconds taken by all map tasks=170545152
        Map-Reduce Framework
                Map input records=14
                Map output records=14
                Input split bytes=497
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=248
                CPU time spent (ms)=5690
                Physical memory (bytes) snapshot=1087877120
                Virtual memory (bytes) snapshot=12255760384
                Total committed heap usage (bytes)=702545920
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=853
18/08/03 09:07:40 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 62.4343 seconds (0 bytes/sec)
18/08/03 09:07:40 INFO mapreduce.ImportJobBase: Retrieved 14 records.

주의

import 받을 테이블은 반드시 Primary Key를 가지고 있어야 합니다. Primary Key를 가지고 있지 않다면 아래와 같은 에러가 발생합니다.

18/08/03 09:00:25 ERROR tool.ImportTool: Error during import: No primary key could be found for table SQOOP_TEST. Please specify one with --split-by or perform a sequential import with '-m 1'.

4. Hive에서 HDFS 데이터 확인

Hive를 통해 테이블이 잘 저장되어 있는지 확인해 주십시오.
- Hive 사용 방법은 Hive 사용 가이드를 참고해 주십시오.
import한 테이블이 잘 저장되어 있는지 확인하는 쿼리를 작성하고 실행해 주십시오.

SELECT * FROM sqoop_workspace.sqoop_test;

자주 하는 질문을 먼저 확인해 보십시오.

Q. Cloud hadoop 서비스의 마스터 노드에서 Sqoop Client를 사용하려고 하는데 jdbc 드라이버를 datanode에도 추가해야 하나요?
A. jdbc 드라이버를 datanode에 설치하실 필요는 없으며, Sqoop 커맨드를 사용하시는 서버에 jdbc 드라이버가 있으면 사용할 수 있습니다.