[PySpark] ์–ด๋””์—, ์™œ ์“ธ๊นŒ? | ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ์™€ ML ํ™œ์šฉ ๊ฐ€์ด๋“œ

2025. 9. 16. 15:07ยท๐Ÿ’ป Programming/Distributed Computing
๋ฐ˜์‘ํ˜•

1. ๊ฐœ์š”

PySpark๋Š” Apache Spark์˜ Python API์ด๋ฉฐ, ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์‚ฐ ํ™˜๊ฒฝ์—์„œ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ํ‘œ์ค€ ๋„๊ตฌ์ด๋‹ค. ML ๋ฆฌ์„œ์น˜ ์—”์ง€๋‹ˆ์–ด์—๊ฒŒ PySpark๋Š” ํŒ๋‹ค์Šค์™€ ์œ ์‚ฌํ•œ ๋ฌธ๋ฒ•์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ์ž์›์„ ํ™œ์šฉํ•ด ํ™•์žฅ๋œ ์ฒ˜๋ฆฌ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์ ์—์„œ ์œ ์šฉํ•˜๋‹ค. ์ด ๊ธ€์—์„œ๋Š” PySpark๋ฅผ ์–ด๋””์—, ์™œ ์‚ฌ์šฉํ•˜๋Š”์ง€, ๊ทธ๋ฆฌ๊ณ  ์‹ค๋ฌด์ž๊ฐ€ ์–ด๋–ค ๋ถ€๋ถ„์„ ์ค‘์ ์ ์œผ๋กœ ์ดํ•ดํ•ด์•ผ ํ•˜๋Š”์ง€ ์„ค๋ช…ํ•œ๋‹ค.

 

2. PySpark๋Š” ์–ด๋””์— ์“ฐ๋Š”๊ฐ€

PySpark๋Š” ๋ฐ์ดํ„ฐ ์—”์ง€๋‹ˆ์–ด๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ML ์—”์ง€๋‹ˆ์–ด์—๊ฒŒ๋„ ๋ฐ์ดํ„ฐ ์ค€๋น„์™€ ์ „์ฒ˜๋ฆฌ์— ํ•ต์‹ฌ ๋„๊ตฌ๋กœ ํ™œ์šฉ๋œ๋‹ค. ํŠนํžˆ ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์…‹์„ ๋‹ค๋ฃฐ ๋•Œ Pandas๋กœ๋Š” ์ฒ˜๋ฆฌํ•˜๊ธฐ ์–ด๋ ค์šด ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„ PySpark๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ๋ฐ์ดํ„ฐ ๋ถ„์„๊ณผ ML ์‹คํ—˜์„ ์œ„ํ•ด์„œ๋Š” ๋น ๋ฅด๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€๊ณตํ•˜๊ณ  ํšจ์œจ์ ์œผ๋กœ ์ €์žฅํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ๋ฐ, PySpark๋Š” ์ด ๊ณผ์ •์„ ๋ถ„์‚ฐ ํ™˜๊ฒฝ์—์„œ ์•ˆ์ •์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

์‹ค๋ฌด์—์„œ PySpark๋ฅผ ์ž์ฃผ ํ™œ์šฉํ•˜๋Š” ์˜์—ญ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • ๋Œ€๊ทœ๋ชจ ๋กœ๊ทธ ๋ฐ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ
  • ํ•™์Šต/๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ๊ณผ ์Šคํ”Œ๋ฆฟ
  • ๋‹ค์ค‘ ์กฐ์ธ ๋ฐ ํ†ต๊ณ„ ์ง‘๊ณ„
  • ์ด๋ฏธ์ง€/๋น„๋””์˜ค ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
  • ์ธํผ๋Ÿฐ์Šค ๊ฒฐ๊ณผ ๋Œ€๋Ÿ‰ ์ง‘๊ณ„ ๋ฐ ์ง€ํ‘œ ์‚ฐ์ถœ

 

3. PySpark๋ฅผ ์™œ ์“ฐ๋Š”๊ฐ€

๋ฐ์ดํ„ฐ์™€ ML ๊ด€์ ์—์„œ PySpark์˜ ๊ฐ€์น˜๋Š” ํฌ๊ฒŒ ๋„ค ๊ฐ€์ง€๋กœ ์š”์•ฝํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฒซ์งธ, ํ™•์žฅ์„ฑ์ด๋‹ค. ๋‹จ์ผ ๋จธ์‹  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ดˆ๊ณผํ•˜๋Š” ๋ฐ์ดํ„ฐ๋„ ํด๋Ÿฌ์Šคํ„ฐ์— ๋ถ„์‚ฐํ•˜์—ฌ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‘˜์งธ, ์ตœ์ ํ™”๋œ ์‹คํ–‰ ์—”์ง„์ด๋‹ค. Catalyst Optimizer์™€ Tungsten ์‹คํ–‰ ์—”์ง„์„ ํ†ตํ•ด ๋ณต์žกํ•œ ์กฐ์ธ๊ณผ ์ง‘๊ณ„๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. ์…‹์งธ, ํ‘œ์ค€ํ™”๋œ ์ƒํƒœ๊ณ„ ์—ฐ๊ณ„์„ฑ์ด๋‹ค. Parquet, ORC ๊ฐ™์€ ํฌ๋งท๊ณผ Hive Metastore, Presto/Trino ๊ฐ™์€ ๋ถ„์„ ํˆด๊ณผ ๋งค๋„๋Ÿฝ๊ฒŒ ์—ฐ๋™๋œ๋‹ค. ๋„ท์งธ, ๊ฐœ๋ฐœ ํŽธ์˜์„ฑ์ด๋‹ค. ํŒ๋‹ค์Šค์™€ ์œ ์‚ฌํ•œ API์™€ SQL ํ˜ผ์šฉ์ด ๊ฐ€๋Šฅํ•ด Python ์ค‘์‹ฌ ์›Œํฌํ”Œ๋กœ์šฐ์— ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋…น์•„๋“ ๋‹ค.

ํ•ต์‹ฌ ์žฅ์ ์„ ์ •๋ฆฌํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ
  • Catalyst Optimizer ๊ธฐ๋ฐ˜ ์„ฑ๋Šฅ ์ตœ์ ํ™”
  • ํ‘œ์ค€ ํฌ๋งท ๋ฐ ๋ฐ์ดํ„ฐ ์ƒํƒœ๊ณ„ ์—ฐ๋™
  • Python ์นœํ™”์  API์™€ ๊ฐœ๋ฐœ ํŽธ์˜์„ฑ

 

4. ์‹ค๋ฌด์ž๊ฐ€ ์•Œ์•„์•ผ ํ•  PySpark ๋™์ž‘ ์›๋ฆฌ

ML ์—”์ง€๋‹ˆ์–ด์—๊ฒŒ ์ค‘์š”ํ•œ ๊ฒƒ์€ Spark ๋‚ด๋ถ€ ์•„ํ‚คํ…์ฒ˜ ์ „์ฒด๊ฐ€ ์•„๋‹ˆ๋ผ, ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ๋‹ค๋ฃจ๊ณ  ์ตœ์ ํ™”ํ•ด์•ผ ํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ๊ธฐ๋ณธ ์›๋ฆฌ์ด๋‹ค.

  • Lazy evaluation: Transformation์€ ์ฆ‰์‹œ ์‹คํ–‰๋˜์ง€ ์•Š๊ณ  Action ํ˜ธ์ถœ ์‹œ Job์ด ์‹คํ–‰๋œ๋‹ค.
    • ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ์ ๊ฒ€ํ•  ๋•Œ๋Š” show(), count() ๊ฐ™์€ ๊ฐ€๋ฒผ์šด Action๋งŒ ์‚ฌ์šฉ
    • ์—ฌ๋Ÿฌ ๋ฒˆ ์žฌ์‚ฌ์šฉํ•  DataFrame์€ cache()๋‚˜ persist() ํ›„ ํ•œ ๋ฒˆ๋งŒ Action์œผ๋กœ ๋ฌผ๋ฆฌํ™”
  • Job → Stage → Task ๊ตฌ์กฐ: ์กฐ์ธ์ด๋‚˜ ์ง‘๊ณ„ ๊ฐ™์€ ์—ฐ์‚ฐ์€ ์…”ํ”Œ์„ ๋ฐœ์ƒ์‹œํ‚ค๋ฉฐ Stage๋กœ ๋ถ„๋ฆฌ๋œ๋‹ค.
    • ํ•„ํ„ฐ/์ปฌ๋Ÿผ ์„ ํƒ์€ ์กฐ์ธ ์ „์— ์ ์šฉํ•ด ์…”ํ”Œ ์ „์— ๋ฐ์ดํ„ฐ๋Ÿ‰ ์ตœ์†Œํ™”
    • ์ž‘์€ ์ฐธ์กฐ ํ…Œ์ด๋ธ”์€ broadcast() ์กฐ์ธ์œผ๋กœ ์…”ํ”Œ ์ค„์ด๊ธฐ
  • Narrow vs Wide Transformation: map, filter๋Š” ์…”ํ”Œ์ด ์—†์ง€๋งŒ, groupBy, join์€ ์…”ํ”Œ์„ ์œ ๋ฐœํ•œ๋‹ค.
    • wide ์—ฐ์‚ฐ(์ง‘๊ณ„/์กฐ์ธ)์€ ์ตœ๋Œ€ํ•œ ์•ž๋‹จ์—์„œ ๋ฐ์ดํ„ฐ ์ค„์ธ ๋’ค ์‹คํ–‰
  • Partition๊ณผ ํŒŒ์ผ ํฌ๊ธฐ ๊ด€๋ฆฌ: Parquet ์ถœ๋ ฅ์€ 128~512MB ํŒŒ์ผ ํฌ๊ธฐ๋ฅผ ๊ถŒ์žฅํ•œ๋‹ค.
    • ๊ฒฐ๊ณผ ์ €์žฅ ์ „ repartition()์œผ๋กœ ํŒŒํ‹ฐ์…˜ ์ˆ˜ ์กฐ์ •
    • ์ž‘์€ ํŒŒ์ผ์ด ๋งŽ์•„์ง€๋ฉด coalesce()๋กœ ๋ณ‘ํ•ฉ
    • ์ถœ๋ ฅ ํฌ๊ธฐ์— ๋งž์ถฐ ์ ์ ˆํ•œ ํŒŒํ‹ฐ์…˜ ์ˆ˜ ์‚ฐ์ • (์˜ˆ: 200GB ÷ 256MB ≈ 800 ํŒŒํ‹ฐ์…˜)

์ด ์›๋ฆฌ๋“ค์„ ์ดํ•ดํ•˜๋ฉด ๋ถˆํ•„์š”ํ•œ ์…”ํ”Œ๊ณผ ์ž‘์€ ํŒŒ์ผ ๋ฌธ์ œ๋ฅผ ์ค„์ด๊ณ , ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์˜ˆ๋ฐฉํ•  ์ˆ˜ ์žˆ๋‹ค.

 

5. ์‹ค๋ฌด ํ™œ์šฉ ์˜ˆ์‹œ

PySpark๋Š” ๋ฐ์ดํ„ฐ ์ค€๋น„ ํŒŒ์ดํ”„๋ผ์ธ์—์„œ ์‹ค์šฉ์ ์œผ๋กœ ์“ฐ์ธ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, RDS ๊ฐ™์€ DB์—์„œ ์›์ฒœ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ์–ด์™€ ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ์…‹์„ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ •์„ ์ƒ๊ฐํ•ด๋ณด์ž.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, broadcast, expr

spark = (SparkSession.builder.appName("etl-orders").getOrCreate())

# 1. ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ (JDBC)
orders = (spark.read.format("jdbc")
  .option("url", JDBC_URL)
  .option("dbtable","orders")
  .option("user", DB_USER)
  .option("password", DB_PASS)
  .option("partitionColumn","id")
  .option("lowerBound",1).option("upperBound",10_000_000).option("numPartitions",128)
  .load()
  .select("id","user_id","category_id","price","created_at")
  .filter(col("created_at") >= expr("date_sub(current_date(), 30)")))

# 2. ์ฐธ์กฐ ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ (Parquet)
cats = spark.read.parquet("s3://bucket/dim/categories/").select("id","display_name")

# 3. ์กฐ์ธ๊ณผ ์ „์ฒ˜๋ฆฌ
joined = (orders.alias("o")
  .join(broadcast(cats.alias("c")), col("o.category_id") == col("c.id"), "left")
  .select(col("o.*"), col("c.display_name").alias("category")))

# 4. ์ง‘๊ณ„ ๋ฐ ์ €์žฅ
mart = (joined.groupBy("category")
  .agg(expr("count(*) as n"), expr("avg(price) as avg_price"), expr("sum(price) as revenue"))
  .orderBy(col("revenue").desc()))

(mart.repartition(64)
  .write.mode("overwrite")
  .parquet("s3://bucket/mart/orders_30d/"))

spark.stop()

์ด ์˜ˆ์‹œ๋Š” DB์—์„œ ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ → ์กฐ์ธ ๋ฐ ์ „์ฒ˜๋ฆฌ → ์ง‘๊ณ„ → Parquet ์ €์žฅ๊นŒ์ง€ ์ด์–ด์ง€๋Š” ์ „ํ˜•์ ์ธ ML ๋ฐ์ดํ„ฐ ์ค€๋น„ ์›Œํฌํ”Œ๋กœ์šฐ์ด๋‹ค. ๋…ธํŠธ๋ถ์—์„œ ๋จผ์ € ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•์ธํ•˜๊ณ , ๊ฒ€์ฆ๋œ ์ฝ”๋“œ๋ฅผ spark-submit์œผ๋กœ ์ œ์ถœํ•ด ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

 

6. ๊ฒฐ๋ก 

PySpark๋Š” Data / ML ์—”์ง€๋‹ˆ์–ด๊ฐ€ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋ฅผ ์•ˆ์ •์ ์œผ๋กœ ์ „์ฒ˜๋ฆฌํ•˜๊ณ  ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์„ฑํ•  ๋•Œ ํ•„์ˆ˜์ ์ธ ๋„๊ตฌ์ด๋‹ค. ํŒ๋‹ค์Šค์˜ ์ง๊ด€์„ฑ๊ณผ Spark์˜ ํ™•์žฅ์„ฑ์„ ๊ฒฐํ•ฉํ•ด, ๋ฐ์ดํ„ฐ ์ค€๋น„ ๋‹จ๊ณ„์—์„œ ๋ณ‘๋ชฉ์„ ์ค„์ด๊ณ  ์—ฐ๊ตฌ์™€ ์‹คํ—˜์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•๋Š”๋‹ค. ์ค‘์š”ํ•œ ๊ฒƒ์€ DataFrame API์™€ ๊ธฐ๋ณธ ์ตœ์ ํ™” ์›๋ฆฌ๋ฅผ ์ˆ™์ง€ํ•˜๊ณ , ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ์„ ์•ˆ์ •์ ์œผ๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ์ˆ˜์ค€๊นŒ์ง€ ์ตํžˆ๋Š” ๊ฒƒ์ด๋‹ค.

๋ฐ˜์‘ํ˜•

'๐Ÿ’ป Programming > Distributed Computing' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[Ray] ๋ถ„์‚ฐ ์‹คํ–‰ ํ”„๋ ˆ์ž„์›Œํฌ Ray ์„ค๋ช…  (0) 2025.09.17
[PySpark] Spark Job ์‹คํ–‰ ๊ฐ€์ด๋“œ: Ad-hoc vs Batch  (0) 2025.09.17
[PySpark] ์ฃผ์š” ์—ฐ์‚ฐ ๊ฐ€์ด๋“œ: Transformation, Action  (0) 2025.09.16
[PySpark] ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ธฐ๋ณธ๊ธฐ: ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด  (0) 2025.09.16
[PySpark] ์ž์ฃผ ์“ฐ๋Š” ๊ธฐ๋Šฅ ๋ฉ”์„œ๋“œ ์ •๋ฆฌ  (0) 2025.05.12
'๐Ÿ’ป Programming/Distributed Computing' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [PySpark] Spark Job ์‹คํ–‰ ๊ฐ€์ด๋“œ: Ad-hoc vs Batch
  • [PySpark] ์ฃผ์š” ์—ฐ์‚ฐ ๊ฐ€์ด๋“œ: Transformation, Action
  • [PySpark] ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ธฐ๋ณธ๊ธฐ: ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด
  • [PySpark] ์ž์ฃผ ์“ฐ๋Š” ๊ธฐ๋Šฅ ๋ฉ”์„œ๋“œ ์ •๋ฆฌ
๋ญ…์ฆค
๋ญ…์ฆค
AI ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ
    ๋ฐ˜์‘ํ˜•
  • ๋ญ…์ฆค
    moovzi’s Doodle
    ๋ญ…์ฆค
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
  • ๊ณต์ง€์‚ฌํ•ญ

    • โœจ About Me
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (216)
      • ๐Ÿ“– Fundamentals (34)
        • Computer Vision (9)
        • 3D vision & Graphics (6)
        • AI & ML (16)
        • etc. (3)
      • ๐Ÿ› Research (78)
        • Deep Learning (7)
        • Perception (19)
        • OCR (7)
        • Multi-modal (8)
        • Image•Video Generation (18)
        • 3D Vision (4)
        • Material • Texture Recognit.. (8)
        • Large-scale Model (7)
        • etc. (0)
      • ๐Ÿ› ๏ธ Engineering (8)
        • Distributed Training & Infe.. (5)
        • AI & ML ์ธ์‚ฌ์ดํŠธ (3)
      • ๐Ÿ’ป Programming (92)
        • Python (18)
        • Computer Vision (12)
        • LLM (4)
        • AI & ML (18)
        • Database (3)
        • Distributed Computing (6)
        • Apache Airflow (6)
        • Docker & Kubernetes (14)
        • ์ฝ”๋”ฉ ํ…Œ์ŠคํŠธ (4)
        • etc. (7)
      • ๐Ÿ’ฌ ETC (4)
        • ์ฑ… ๋ฆฌ๋ทฐ (4)
  • ๋งํฌ

    • ๋ฆฌํ‹€๋ฆฌ ํ”„๋กœํ•„ (๋ฉ˜ํ† ๋ง, ๋ฉด์ ‘์ฑ…,...)
    • ใ€Ž๋‚˜๋Š” AI ์—”์ง€๋‹ˆ์–ด์ž…๋‹ˆ๋‹คใ€
    • Instagram
    • Brunch
    • Github
  • ์ธ๊ธฐ ๊ธ€

  • ์ตœ๊ทผ ๋Œ“๊ธ€

  • ์ตœ๊ทผ ๊ธ€

  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
๋ญ…์ฆค
[PySpark] ์–ด๋””์—, ์™œ ์“ธ๊นŒ? | ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ์™€ ML ํ™œ์šฉ ๊ฐ€์ด๋“œ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”