前文简单的介绍了block device,别急,虽然这个系列的主要目的是介绍Flashcache,这一篇还是不会切入正题,因为我们还需要先了解下什么是device mapper。
假如一台主机插入了多块硬盘,单块硬盘的容量和性能都是有限的,如果能将多块硬盘组合一个逻辑的整体,对于这台主机来讲,就实现了最简单意义上的“云存储”。有很多方法可以实现这个目的,比如Raid卡硬件,比如现在很流行的分布式文件系统的replica机制,等等。Linux内核也看到了这个需求,于是2.6有了device mapper,当然device mapper不只是满足这一个需求,对于多路径IO也做了支持。
3. Device Mapper
简单来讲,Device Mapper是一种组合多个块设备变成一个逻辑块设备的机制。
Device Mapper的设计实现主要分为三层:
(图片出处:参考[1])
Target device的类型,内核自带的几种包括(linux/include/linux/device-mapper.h):
Target type是一种模块化的插件接口,允许自定义。Flashcache就是利用这个接口定义了一种新的target type,将SSD和普通磁盘定义为两种新的target device,设计了缓存的映射规则,逻辑组合成一种新的块设备。
3.1 mapped_device
mapped_device定义了逻辑设备,对内核来说,可以把逻辑设备当作一种普通的block_device。
struct mapped_device { struct request_queue *queue; struct gendisk *disk; char name[16]; void *interface_ptr; struct workqueue_struct *wq; struct dm_table *map; struct bio_set *bs; struct block_device *bdev; make_request_fn *saved_make_request_fn; ... };
3.2 dm_table
dm_table描述了逻辑设备和物理设备之间的映射关系。
struct dm_table { struct mapped_device *md; atomic_t holders; unsigned type; unsigned int num_targets; struct dm_target *targets; struct list_head devices; fmode_t mode; ... };
3.3 dm_target
dm_target定义了一个具体的target device。
struct dm_target { struct dm_table *table; struct target_type *type; void *private; ... };
3.4 target_type
target_type定义一种target device的类型。其中有几个比较重要的函数:
struct target_type { uint64_t features; const char *name; struct module *module; unsigned version[3]; dm_ctr_fn ctr; dm_dtr_fn dtr; dm_map_fn map; dm_map_request_fn map_rq; dm_endio_fn end_io; dm_request_endio_fn rq_end_io; dm_flush_fn flush; dm_presuspend_fn presuspend; dm_postsuspend_fn postsuspend; dm_preresume_fn preresume; dm_resume_fn resume; dm_status_fn status; dm_message_fn message; dm_ioctl_fn ioctl; dm_merge_fn merge; dm_busy_fn busy; dm_iterate_devices_fn iterate_devices; dm_io_hints_fn io_hints; /* For internal device-mapper use. */ struct list_head list; };
3.5 dm_register_target
dm_register_target函数用于注册 一个新的target type。
int dm_register_target(struct target_type *tt) { ... list_add(&tt->list, &_targets); ... }
3.6 dm_io
dm-io为device mapper提供同步或者异步的io服务。
使用dm-io必须设置dm_io_region结构(2.6.26版本以前叫io_region),该结构定义了io操作的区域,读一般针对一个dm_io_region区,而写可以针对一组dm_io_region区。
struct dm_io_region { struct block_device *bdev; sector_t sector; sector_t count; /* If this is zero the region is ignored. */ };
dm-io一共有四种dm_io_mem_type类型(老一点的内核版本只有前面三种,Flashcache主要使用DM_IO_BVEC):
enum dm_io_mem_type { DM_IO_PAGE_LIST,/* Page list */ DM_IO_BVEC, /* Bio vector */ DM_IO_VMA, /* Virtual memory area */ DM_IO_KMEM, /* Kernel memory */ }; struct dm_io_memory { enum dm_io_mem_type type; union { struct page_list *pl; struct bio_vec *bvec; void *vma; void *addr; } ptr; unsigned offset; };
dm-io通过dm_io_request结构来封装请求的类型,如果设置了dm_io_notify.fn则是异步IO,否则是同步IO。
struct dm_io_request { int bi_rw; /* READ|WRITE - not READA */ struct dm_io_memory mem; /* Memory to use for io */ struct dm_io_notify notify; /* Synchronous if notify.fn is NULL */ struct dm_io_client *client; /* Client memory handler */ };
使用dm_io服务前前需要通过dm_io_client_create函数(在2.6.22版本前是dm_io_get)先创建dm_io_client结构,为dm-io的执行过程中分配内存池。使用dm-io服务完毕后,则需要调用dm_io_client_destroy函数(在2.6.22版本前是dm_io_put)释放内存池。
struct dm_io_client { mempool_t *pool; struct bio_set *bios; };
dm-io函数执行具体的io请求。
int dm_io(struct dm_io_request *io_req, unsigned num_regions, struct dm_io_region *where, unsigned long *sync_error_bits) { int r; struct dpages dp; r = dp_init(io_req, &dp); if (r) return r; if (!io_req->notify.fn) return sync_io(io_req->client, num_regions, where, io_req->bi_rw, &dp, sync_error_bits); return async_io(io_req->client, num_regions, where, io_req->bi_rw, &dp, io_req->notify.fn, io_req->notify.context); }
3.7 kcopyd
kcopyd提供了一种异步copy服务,可以将一个block device上的一组连续扇区复制到一个或者多个block device中,很明显,这在mirror/snapshot等target type中是非常需要的功能。实际上flashcache也利用了该功能来实现cache的机制。
使用kcopyd前需要先通过dm_kcopyd_client_create函数(2.6.26版本前叫做kcopyd_client_create)创建一个dm_kcopyd_client
struct dm_kcopyd_client { spinlock_t lock; struct page_list *pages; unsigned int nr_pages; unsigned int nr_free_pages; struct dm_io_client *io_client; wait_queue_head_t destroyq; atomic_t nr_jobs; mempool_t *job_pool; struct workqueue_struct *kcopyd_wq; struct work_struct kcopyd_work; spinlock_t job_lock; struct list_head complete_jobs; struct list_head io_jobs; struct list_head pages_jobs; };
dm_kcopyd_client维护了三job队列,一个等待内存页pages_jobs,一个已经获得内存页等待io命令io_jobs,一个等待io完成complete_jobs,这三个job队列都由job_lock锁保护。job队列由kcopyd_job结构组成。
struct kcopyd_job { struct dm_kcopyd_client *kc; struct list_head list; unsigned long flags; int read_err; unsigned long write_err; int rw; # READ or WRITE struct dm_io_region source; unsigned int num_dests; struct dm_io_region dests[DM_KCOPYD_MAX_REGIONS]; sector_t offset; unsigned int nr_pages; struct page_list *pages; dm_kcopyd_notify_fn fn; void *context; struct mutex lock; atomic_t sub_jobs; sector_t progress; };
dm_kcopyd_copy函数执行实际的copy动作,kcopyd一种异步任务,所以需要预先定义回调函数dm_kcopyd_notify_fn。
int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from, unsigned int num_dests, struct dm_io_region *dests, unsigned int flags, dm_kcopyd_notify_fn fn, void *context) { struct kcopyd_job *job; job = mempool_alloc(kc->job_pool, GFP_NOIO); job->kc = kc; job->flags = flags; job->read_err = 0; job->write_err = 0; job->rw = READ; job->source = *from; job->num_dests = num_dests; memcpy(&job->dests, dests, sizeof(*dests) * num_dests); job->offset = 0; job->nr_pages = 0; job->pages = NULL; job->fn = fn; job->context = context; if (job->source.count < SUB_JOB_SIZE) dispatch_job(job); else { mutex_init(&job->lock); job->progress = 0; split_job(job); } return 0; }
4. Device Mapper工具
操作系统提供了一些管理device mapper的工具,需要确认相关的包已经安装。
$rpm -qa | grep device-mapper device-mapper-libs-1.02.62-3.el6.x86_64 device-mapper-1.02.62-3.el6.x86_64 device-mapper-event-libs-1.02.62-3.el6.x86_64 device-mapper-event-1.02.62-3.el6.x86_64
4.1 dmsetup
Device Mapper的工具主要是dmsetup,可以用来创建/修改/删除/查看DM设备,Flashcache的创建和装载工具flashcache_create/flashcache_load都是调用的dmsetup,只是自己包装了一层。
创建一个新的dm设备的语法为:
dmsetup create dm_device start_sector nr_sectors target argument
不同的target type,带的argument不一样,ctl函数定义了如何根据参数来构建dm设备的逻辑。如果组合的参数规则复杂,也可以将参数写入到文件中,然后通过文件参数来执行创建。
dmsetup create dm_device file_name
以下是一个已经创建好的Flashcache设备的信息:
$sudo dmsetup info Name: cachedev State: ACTIVE Read Ahead: 256 Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 0 Number of targets: 1 $sudo dmsetup status cachedev: 0 3093559296 flashcache stats: reads(731952), writes(554131825) read hits(663370), read hit percent(90) write hits(429687559) write hit percent(77) dirty write hits(185872576) dirty write hit percent(33) replacement(31918), write replacement(13045343) write invalidates(0), read invalidates(1) pending enqueues(13604453), pending inval(13604453) metadata dirties(347477454), metadata cleans(347477257) metadata batch(665458755) metadata ssd writes(29495864) cleanings(347468476) fallow cleanings(536679) no room(7246713) front merge(286174542) back merge(57729678) disk reads(68585), disk writes(368002849) ssd reads(347815066) ssd writes(562902592) uncached reads(1098), uncached writes(20849965), uncached IO requeue(0) uncached sequential reads(0), uncached sequential writes(0) pid_adds(2), pid_dels(2), pid_drops(0) pid_expiry(0) $sudo dmsetup table cachedev: 0 3093559296 flashcache conf: ssd dev (/dev/fioa), disk dev (/dev/sdb1) cache mode(WRITE_BACK) capacity(306408M), associativity(512), data block size(4K) metadata block size(4096b) skip sequential thresh(0K) total blocks(78440448), cached blocks(77048243), cache percent(98) dirty blocks(102), dirty percent(0) nr_queued(0) Size Hist: 1024:2 4096:555006431
未完待续
参考:
[1].Understanding Device-mapper in Linux 2.6 Kernel [Oracle Support ID 456239.1]
[2].Linux 内核中的 Device Mapper 机制
[3].Device-mapper Resource Page
[4].Linux Kernel Documents dm-io.txt
[5].Linux Kernel Documents kcopyd.txt
您可能也喜欢: |
深入浅出Flashcache(四) |
SSD硬盘时代即将到来? |
SSD硬盘的IO性能测试 |
Linux上eclipse本地跑web应用绑定80端口的问题 |
Linux上如何使用裸设备 |
无觅 |