Skip to content

Data Version Control

๐Ÿฆ‰ ML Experiments and Data Management with Git

DVC ์ž์ฒด๋งŒ์œผ๋กœ ๋ฐ์ดํ„ฐ์˜ ๋ฒ„์ „ ๊ด€๋ฆฌ๋ฅผ ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค. DVC๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ ํ•˜๋Š” ๋ฐ์— ์‚ฌ์šฉํ•  ๋ฉ”ํƒ€ ๋ฐ์ดํ„ฐ์™€ ์„ค์ • ํŒŒ์ผ์„ ์ƒ์„ฑํ•˜๊ณ , DVC์— ์˜ํ•ด ์ƒ์„ฑ๋œ ํŒŒ์ผ์„ git์ด ์ถ”์ ํ•˜์—ฌ ๋ฒ„์ „ ๊ด€๋ฆฌ๋ฅผ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

Install

pip ๋ฐฉ๋ฒ•:

pip install dvc

Depending on the type of the remote storage you plan to use, you might need to install optional dependencies: [s3], [gdrive], [gs], [azure], [ssh], [hdfs], [webdav], [oss]. Use [all] to include them all.

snap ๋ฐฉ๋ฒ•:

snap install --classic dvc

์ €์žฅ์†Œ ์ดˆ๊ธฐํ™”

.git์ด ์žˆ๋Š” ํด๋”๋กœ ์ด๋™ํ•œ ํ›„

dvc init

ํ•˜๋ฉด ๋‹ค์Œ ํŒŒ์ผ๋“ค์ด ์ถ”๊ฐ€๋œ๋‹ค.

  • .dvc/.gitignore
  • .dvc/config
  • .dvcignore

์›๊ฒฉ์ง€ ์ถ”๊ฐ€

-d ๋Š” --default ์˜ต์…˜๊ณผ ๋™์ผ.

## SSH/SFTP
dvc remote add -d yournas ssh://yournas/volume5/20TB_DATA/DVC/ddrm

## WebDAV Over TLS
dvc remote add -d yournas webdavs://yournas:5006/20TB_DATA/DVC/ddrm

์›๊ฒฉ์ง€ ์„ค์ •

modify ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด .dvc/config ํŒŒ์ผ์— ์ €์žฅ๋œ๋‹ค.

--local ์˜ต์…˜์„ ์‚ฌ์šฉํ•˜๋ฉด Git Ignored ์ฒ˜๋ฆฌ๋œ ๋กœ์ปฌ ์„ค์ • ํŒŒ์ผ (located in .dvc/config.local) ์„ ์‚ฌ์šฉํ•œ๋‹ค.

์ ‘์† ์ •๋ณด ์ถ”๊ฐ€ (์ €์žฅ์†Œ๊ฐ€ ์•„๋‹Œ ์‚ฌ์šฉ์ž ์„ค์ •)

  • dvc remote modify --local {๋ฆฌ๋ชจํŠธ์ด๋ฆ„} user {์‚ฌ์šฉ์ž๋ช…} - ์‚ฌ์šฉ์ž๋ช… ์ง€์ •
  • dvc remote modify --local {๋ฆฌ๋ชจํŠธ์ด๋ฆ„} password {๋น„๋ฐ€๋ฒˆํ˜ธ} - ๋น„๋ฐ€๋ฒˆํ˜ธ ์ง€์ •
  • dvc remote modify --local {๋ฆฌ๋ชจํŠธ์ด๋ฆ„} port {ํฌํŠธ๋ฒˆํ˜ธ} - ํฌํŠธ๋ฒˆํ˜ธ ๋ณ€๊ฒฝ
  • dvc remote modify {๋ฆฌ๋ชจํŠธ์ด๋ฆ„} ask_password true - ํ†ต์‹ ์‹œ ๋น„๋ฐ€๋ฒˆํ˜ธ ์งˆ์˜ ์—ฌ๋ถ€, true ๋˜๋Š” false

SSH KEY ๋กœ ์ ‘์†ํ•  ๊ฒฝ์šฐ:

  • dvc remote modify --local {๋ฆฌ๋ชจํŠธ์ด๋ฆ„} keyfile {/path/to/keyfile}
  • dvc remote modify {๋ฆฌ๋ชจํŠธ์ด๋ฆ„} ask_passphrase true
  • dvc remote modify --local {๋ฆฌ๋ชจํŠธ์ด๋ฆ„} passphrase mypassphrase

SSL ๊ฒ€์ฆ:

  • dvc remote modify {๋ฆฌ๋ชจํŠธ์ด๋ฆ„} ssl_verify false - ๊ฒ€์ฆ SKIP

๋น ๋ฅธ ์‹œ์ž‘

์ž˜๋ชจ๋ฅด๊ฒƒ์Œ dvc init ์ดํ›„ ์•„๋ž˜์™€ ๊ฐ™์ด ์ง„ํ–‰.

INFORMATION

์•„๋ž˜ ์ ์„ ์›๊ฒฉ ์ €์žฅ์†Œ URL์˜ ์—…๋กœ๋“œํ• ๊ฒฝ๋กœ์— ๋””๋ ‰ํ† ๋ฆฌ๊ฐ€ ์กด์žฌํ•ด์•ผ ํ•œ๋‹ค.

.dvc/config file:

[core]
    remote = yournas
    autostage = true
['remote "yournas"']
    url = webdavs://yournas:5006/20TB_DATA/DVC/์—…๋กœ๋“œํ• ๊ฒฝ๋กœ
    ask_password = true
    ssl_verify = false

hosts ํŒŒ์ผ ์—…๋ฐ์ดํŠธ ๋ฐ ์‚ฌ์šฉ์ž ์ •๋ณด ์ถ”๊ฐ€:

sudo echo "{IP} yournas" >> /etc/hosts
dvc remote modify --local yournas user ${USER}

.dvc/config.local file:

['remote "yournas"']
    user = yourid

.dvc/.gitignore file:

/config.local
/tmp
/cache

๊ทธ๋ฆฌ๊ณ  ์ถ”๊ฐ€ํ•˜๊ณ  ์‹ถ์€ ํŒŒ์ผ ์ถ”๊ฐ€. ์˜ˆ์ปจ๋ฐ cvp/assets/*.sqlite ํŒŒ์ผ๋“ค ์ด๋ผ๋ฉด:

dvc add cvp/assets/*.sqlite

git ์œผ๋กœ ๊ด€๋ จ ํŒŒ์ผ๋“ค ์ถ”๊ฐ€ ํ•˜๋˜ ๋ง๋˜ ์•Œ์•„์„œ ํ•˜์‹ ํ›„

dvc commit
dvc push

ํ•˜๋ฉด ๋œ๋‹ค.

์ถ”์  ์ถ”๊ฐ€

์›ํ•˜๋Š” ํŒŒ์ผ ๋˜๋Š” ๋””๋ ‰ํ† ๋ฆฌ ์ถ”๊ฐ€.

dvc add ./ddrm/assets/checkpoints/

๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅ๋œ๋‹ค:

100% Adding...|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ|1/1 [00:04,  4.35s/file]

To track the changes with git, run:

        git add ddrm/assets/checkpoints.dvc

To enable auto staging, run:

        dvc config core.autostage true

DVC ์ž์ฒด๋Š” ์ด๋ ฅ ๊ด€๋ฆฌ๊ฐ€ ์—†์œผ๋ฏ€๋กœ git ์œผ๋กœ ์ถ”๊ฐ€ํ•œ๋‹ค:

git add ddrm/assets/checkpoints.dvc

์ถ”์  ์ œ๊ฑฐ

dvc remove {์‚ญ์ œํ• ๋ฐ์ดํ„ฐํŒŒ์ผ๋ช….dvc}

์›๊ฒฉ ์ €์žฅ์†Œ์˜ ํŒŒ์ผ ์ œ๊ฑฐ

WARNING

ํ™•์ธ ํ•„์š”

์›๊ฒฉ์ €์žฅ์†Œ์—์„œ dvc workspace์—์„œ ์ถ”์ ํ•˜์ง€ ์•Š๋Š” ํŒŒ์ผ ์‚ญ์ œ

dvc gc -w -c -r {์›๊ฒฉ์ €์žฅ์†Œ๋ช…} -f

์ƒํƒœ ํ™•์ธ

*.dvc ํŒŒ์ผ ๊ธฐ์ค€์œผ๋กœ ์ถœ๋ ฅ๋œ๋‹ค.

dvc status

๊ทผ๋ฐ ์ด๋ ‡๊ฒŒ ๋ณด๋ฉด ๋ญ”๋ง์ธ์ž ์ž˜ ๋ชจ๋ฆ„.

"data" ๊ฐ€ ์ถ”๊ฐ€๋˜๋ฉด ์•ฝ๊ฐ„ git status ์Šค๋Ÿฌ์šด ์ถœ๋ ฅ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค.

dvc data status

์˜ˆ๋ฅผ ๋“ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅ๋œ๋‹ค:

Not in cache:
  (use "dvc fetch <file>..." to download files)
        ddrm/assets/checkpoints/mobile_sam.pt
        ddrm/assets/checkpoints/rtmdet-ins_x_8xb16-300e_coco-cam1-epoch_300.pth
        ddrm/assets/checkpoints/rtmdet-ins_x_8xb16-300e_coco_2nd-epoch_300.pth

DVC uncommitted changes:
  (use "dvc commit <file>..." to track changes)
  (use "dvc checkout <file>..." to discard changes)
        deleted: ddrm/assets/checkpoints/mobile_sam.pt
        deleted: ddrm/assets/checkpoints/rtmdet-ins_x_8xb16-300e_coco-cam1-epoch_300.pth
        deleted: ddrm/assets/checkpoints/rtmdet-ins_x_8xb16-300e_coco_2nd-epoch_300.pth
(there are other changes not tracked by dvc, use "git status" to see)

์ถ”์ ์ค‘์ธ ํŒŒ์ผ ๊ฐฑ์‹  (commit)

dvc add๋Š” ์ถ”์  ์ค‘์ด์ง€ ์•Š๋Š” ํŒŒ์ผ์„ ์ถ”์ ํ•˜๋„๋ก ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ. dvc commit์€ ์ด๋ฏธ ์ถ”์  ์ค‘์ธ ํŒŒ์ผ์ด ์ˆ˜์ •๋˜์—ˆ์„ ๋•Œ ์ˆ˜์ • ์‚ฌํ•ญ์„ "ํ™•์ •"ํ•˜๊ณ  ์บ์‹œ๋ฅผ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. (์ฆ‰ ์„œ๋ฒ„์— ์—…๋กœ๋“œํ•˜์ง€ ์•Š๋Š”๋‹ค)

dvc commit

git add -> git commit ์ˆœ์„œ๋กœ ์ดํ–‰๋˜์ง€๋งŒ ์ด๊ฒƒ๊ณผ ๋ณ„๊ฐœ๋กœ ์ƒ๊ฐํ•ด์•ผ ํ•œ๋‹ค.

์˜ˆ์ œ

๋ฐ์ดํ„ฐ๊ฐ€ ์ˆ˜์ •๋œ ์งํ›„ ๋‹ค์Œ ๋ฐ์ดํ„ฐ ์ƒํƒœ๋ฅผ ํ™•์ธํ•ด ๋ณด์ž:

$ opy-dvc data status
DVC uncommitted changes:
  (use "dvc commit <file>..." to track changes)
  (use "dvc checkout <file>..." to discard changes)
        modified: demos/
(there are other changes not tracked by dvc, use "git status" to see)

์ €๊ธฐ "modified:" ๋ถ€๋ถ„์ด <span style="color: yellow;">๋…ธ๋ž€์ƒ‰์œผ๋กœ ์ถœ๋ ฅ๋œ๋‹ค. ์ฃผ์„๋Œ€๋กœ ํ•˜๋ฉด ๋œ๋‹ค.

  • ๋ณ€๊ฒฝ ์‚ฌํ•ญ์„ ์ ์šฉ ํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด "dvc commit <file>..."
  • ๋ณ€๊ฒฝ ์‚ฌํ•ญ์„ ์ทจ์†Œ ํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด "dvc checkout <file>..."

์—ฌ๊ธฐ์„  commit ํ•œ๋‹ค.

$ opy-dvc commit
outputs ['demos'] of stage: 'demos.dvc' changed. Are you sure you want to commit it? [y/n] y

๋ณ€๊ฒฝ๋œ ๋‚ด์šฉ์ด ์žˆ๋‹ค๋ฉด ์ปค๋ฐ‹ํ• ๊ฑด์ง€ ๋ฌผ์–ด๋ณธ๋‹ค. y ๋ฅผ ์ž…๋ ฅํ•˜๋ฉด ๋œ๋‹ค.

$ opy-dvc data status
DVC committed changes:
  (git commit the corresponding dvc files to update the repo)
        modified: demos/
(there are other changes not tracked by dvc, use "git status" to see)

"modified:" ๋ถ€๋ถ„์ด <span style="color: green;">๋…น์ƒ‰์œผ๋กœ ์ถœ๋ ฅ๋œ๋‹ค. ์„œ๋ฒ„์— ์—…๋กœ๋“œํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด push ํ•˜์ž.

$ opy-dvc push
Collecting                                                                                                                       |267 [00:00,  976entry/s]
Pushing
183 files pushed

์›๊ฒฉ์ง€ ๋™๊ธฐํ™”

DVC ์›๊ฒฉ์ง€์—์„œ ๋‹ค์šด๋กœ๋“œ:

dvc pull

DVC ์›๊ฒฉ์ง€์— ์—…๋กœ๋“œ:

dvc push

์ด์ „ ๋ฒ„์ „ ๋ณต๊ท€

dvc checkout

dvc๋Š” ๋ฐ์ดํ„ฐ/๋ชจ๋ธ์˜ ๋ฒ„์ „์„ ๋ณ€๊ฒฝํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ฒ„์ „์„ ๋ณ€๊ฒฝํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ด…์‹œ๋‹ค!

train.py ์—์„œ ์ผ๋ถ€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ณ€๊ฒฝํ•˜๊ณ  ์ƒˆ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค.

ํ›ˆ๋ จ ํ›„ model_output์—๋Š” ์ƒˆ ๋ชจ๋ธ ํŒŒ์ผ์ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค.

dvc add model_output 
git commit model_output.dvc -m "dvc: Model Updata vx.x" 
dvc push 
git push origin master

์œ„์™€ ๊ฐ™์ด ์ƒˆ ๋ฒ„์ „์˜ ๋ฐ์ดํ„ฐ/๋ชจ๋ธ์„ dvc์— pushํ•ฉ๋‹ˆ๋‹ค.

git checkout <commit>
dvc checkout

์ด์ „ ๋ฐ์ดํ„ฐ์˜ ๋ฒ„์ „์œผ๋กœ ์ „ํ™˜ํ•  ๋•Œ๋Š” ์œ„์™€ ๊ฐ™์ด ์‚ฌ์šฉํ•˜๊ณ ์ž ํ•˜๋Š” ํ•ด๋‹น ์ปค๋ฐ‹์œผ๋กœ ๋Œ์•„์™€ ๋‹น์‹œ์˜ dvc ๋‚ด์šฉ์œผ๋กœ ๋Œ์•„๊ฐ€๋ฉด ๋ฉ๋‹ˆ๋‹ค!

์บ์‹œ

Is A TTY

DVC_IGNORE_ISATTY=1 dvc push

Data Management

Remote Storage

DVC ๋ฆฌ๋ชจ์ปจ์€ ์™ธ๋ถ€ ์ €์žฅ์†Œ ์œ„์น˜์— ๋Œ€ํ•œ ์•ก์„ธ์Šค๋ฅผ ์ œ๊ณตํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋ฐ ML ๋ชจ๋ธ์„ ์ถ”์ ํ•˜๊ณ  ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค.

์ผ๋ฐ˜์ ์œผ๋กœ ์ด๋Ÿฌํ•œ ์ •๋ณด๋Š” ํ”„๋กœ์ ํŠธ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์žฅ์น˜๋‚˜ ํŒ€ ๊ตฌ์„ฑ์› ๊ฐ„์— ๊ณต์œ ๋ฉ๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ๋™๋ฃŒ๊ฐ€ ์ƒ์„ฑํ•œ ๋ฐ์ดํ„ฐ ์•„ํ‹ฐํŒฉํŠธ๋ฅผ ๋กœ์ปฌ์—์„œ ์žฌ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์‹œ๊ฐ„๊ณผ ๋ฆฌ์†Œ์Šค๋ฅผ ์†Œ๋น„ํ•˜์ง€ ์•Š๊ณ ๋„ ๋‹ค์šด๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ddrm ์ €์žฅ์†Œ ์ดˆ๊ธฐํ™” ์˜ˆ์ œ

Synology DiskStation (NAS) ์—์„œ sftp๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด mkdirs ์ค‘ ์—๋Ÿฌ๊ฐ€ ๊ณ„์† ๋ฐœ์ƒ๋˜์„œ WebDAV ํŒจํ‚ค์ง€ ์„ค์น˜ ํ›„ TLS ์ ‘์† ํฌํŠธ(5006) ๋กœ ์ ์šฉํ–ˆ๋‹ค.

dvc init
dvc remote add -d yournas webdavs://yournas:5006/20TB_DATA/DVC/ddrm
dvc config core.autostage true
dvc add ./ddrm/assets/checkpoints/
git rm -r --cached ddrm/assets/checkpoints
git add ddrm/assets/checkpoints.dvc

Troubleshooting

Client error '401 Unauthorized' for url '..'

Collecting
Fetching
ERROR: unexpected error - received 401 (Unauthorized): Client error '401 Unauthorized' for url 'https://yournas:5006/20TB_DATA/DVC/ddrm/files/md5/00'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/401

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

์ตœ์†Œํ•œ ์‚ฌ์šฉ์ž ID๋Š” ์ถ”๊ฐ€ํ•ด์•ผ ํ•œ๋‹ค.

dvc remote modify --local yournas user ${USER}

ํ•„์š”ํ•˜๋‹ค๋ฉด ๋น„๋ฐ€๋ฒˆํ˜ธ๋„ ๋ฌผ์–ด๋ณด๊ฒŒ ํ•˜์ž.

dvc remote modify yournas ask_password true

unknown module name (_ssl.c:2633)

webdavs ํ”„๋กœํ† ์ฝœ ์‚ฌ์šฉ, dvc push ํ•  ๊ฒฝ์šฐ:

ERROR: unexpected error - [CONF: UNKNOWN_MODULE_NAME] unknown module name (_ssl.c:2633): [CONF: UNKNOWN_MODULE_NAME] unknown module name (_ssl.c:2633)

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

ํ•˜.. -_-; ๊ทธ๋ƒฅ pip๋กœ ์„ค์น˜ํ•œ ํ›„ ์‚ฌ์šฉํ•˜์ž.

output '...' is already tracked by SCM (e.g. Git)

dvc add ddrm/assets/checkpoints/ ๋ช…๋ น์„ ๋‚ ๋ฆฌ๋‹ˆ:

Computing md5 for a large file '/home/yourid/Project/ddrm/ddrm/assets/checkpoints/rtmdet-ins_x_8xb16-300e_coco-cam1-epoch_300.pth'. This is only done once.
Computing md5 for a large file '/home/yourid/Project/ddrm/ddrm/assets/checkpoints/rtmdet-ins_x_8xb16-300e_coco_2nd-epoch_300.pth'. This is only done once.
Adding...
ERROR:  output 'ddrm/assets/checkpoints' is already tracked by SCM (e.g. Git).
    You can remove it from Git, then add to DVC.
        To stop tracking from Git:
            git rm -r --cached 'ddrm/assets/checkpoints'
            git commit -m "stop tracking ddrm/assets/checkpoints"

๋ช…์‹œ๋œ ํŒŒ์ผ/ํด๋”๋ฅผ SCM ๋ฒ„์ „๊ด€๋ฆฌ์—์„œ ์ œ๊ฑฐํ•ด์•ผ ํ•œ๋‹ค.

git rm -r --cached 'ddrm/assets/checkpoints'
git commit -m "stop tracking ddrm/assets/checkpoints"

See aslo

Favorite site