[OpenSolaris][kernel]Solaris内核加载设备驱动过程

来源:互联网 发布:jsp连接sql的方法 编辑:程序博客网 时间:2024/04/30 02:16
Solaris main():

1. High level boot sequence
===========================

-> lgrp_setup()            // Setup the first lgroup, and home t0

-> startup()            // Machine-dependent startup code
                // In a 32-bit OS, boot loads the kernel text at 0xfe800000 and kernel data
                // at 0xfec00000.  On a 64-bit OS, kernel text and data are loaded at
                // 0xffffffff.fe800000 and 0xffffffff.fec00000 respectively.  Those
                // addresses are fixed in the binary at link time.
--> progressbar_init()        // Initialize a rectangle area for progress bar
--> startup_init()
--> startup_xen_version()
--> startup_memlist()        // Build the memlists and other kernel essential memory system data structures
--> startup_kmem()        // Layout the kernel's part of address space and initialize kmem allocator
--> startup_vm()        // Finish initializing the VM system, now that we are no longer relying on the boot time memory allocators.
--> startup_pci_bios()        // Retrieve information from the bios needed for system configuration early during startup.
--> startup_modules()        // setup_ddi() is called at this point to create a kernel device tree.
                // First, create rootnex and then invoke bus specific code to probe devices.
--> startup_bios_disk()        // ???
--> startup_end()        // configure() is called at this point to set up devices.
                // setx86isalist() is called to set the isa_list string to the defined instruction sets we support.
                // psm_install() is called to call the prober() of each psm module
                // (*picinitf)() is called to enabling interrupt. picinitf is set to mach_picinit() in ./i86pc/os/mp_machdep.c. Reading here, I am thinking about the IOMMU code. The best place to initialization and enable Intel IOMMU is here. We shouldn't put this range of code between the first and second pass of pci_enumerate(), since the pci device tree could be changes during the second pass (phantom, subtractive ppb and so on).
--> progressbar_start()        // process bar related, ignore it.

-> segkmem_gc()            // kernel memory segment driver related

-> callb_init()            // Init all callb tables in the system
-> callout_init()         // Initialize all callout tables.  Called at boot time just before clkstart().
-> timer_init()            // timer_init() allocates the internal data structures used by i_timeout(), i_untimeout() and the timer. Timer must be initialized before cyclic starts
-> cbe_init()            // initialize cyclic back end
-> clock_tick_init_pre()        // Clock tick initialization
-> clock_init()            // initialize clock system
-> init_mstate()
-> init_cpu_mstate()        //On some platforms, clkinitf() changes the timing source that gethrtime_unscaled() uses to generate timestamps.  cbe_init() calls clkinitf(), so re-initialize the microstate counters after the timesource has been chosen.
-> lgrp_plat_probe()

-> (**initptr)()        // Call all system initialization functions. Initialization functions are saved in init_tbl[]. For each initptr, call the function.

-> vm_init()            // vm subsystem related initialization
-> physio_bufs_init()        // initialize buffer pool for raw I/O requests
-> XXX                // Drop the interrupt level and allow interrupts. At this point the DDI guarantees that interrupts are enabled
-> vfs_mountroot()        // Mount the root file system. errorq_init(), cpu_kstat_init(CPU), and ddi_walk_devs(ddi_root_node(), pm_adjust_timestamps, NULL) are called after the root file system is mounted.

-> post_startup()

2. setup_ddi
============

invoking stack:
main()
->startup()
-->startup_modules()
--->setup_ddi()

(1) impl_ddi_init_nodeid()
Keep a sorted free list of available nodeids. Allocating a nodeid won't cause memory allocation. Freeing a nodeid does cause memory allocation. The list node is defined as

struct available {
        uint32_t nodeid;
        uint32_t count;
        struct available *next;
        struct available *prev;
};

Mutex lock, nodeid_lock ,is defined to protect the nodeid list. The list head is pointed by nhead. impl_ddi_init_nodeid() is called to initialize the list and mutex lock.

(2) impl_create_root_class()
Create classes and major number bindings for the name of my root. Called immediately before 'loadrootmodules'. rootname = 'i86pc', platform = 'i86pc'

(3) create_devinfo_tree()
init the dev_info mem cache

alloc a dev_info for root and hold it

add the root dip into devimap list, the list is defined as,
struct devi_nodeid_list {
        kmutex_t dno_lock;              /* Protects other fields */
        struct devi_nodeid *dno_head;   /* list of devi nodeid elements */
        struct devi_nodeid *dno_free;   /* Free list */
        uint_t dno_list_length;         /* number of dips in list */
};
each node is an instance of
struct devi_nodeid {
        pnode_t nodeid;
        dev_info_t *dip;
        struct devi_nodeid *next;
};

bind the root node, set the node state to DS_BIND. each driver has a represented data,
struct devnames {
        char            *dn_name;       /* Name of this driver */
        int             dn_flags;       /* per-driver flags, see below */
        struct par_list *dn_pl;         /* parent list, for making devinfos */
        kmutex_t        dn_lock;        /* Per driver lock (see below) */
        dev_info_t      *dn_head;       /* Head of instance list */
        int             dn_instance;    /* Next instance no. to assign */
        void            *dn_inlist;     /* instance # nodes for this driver */
        ddi_prop_list_t *dn_global_prop_ptr; /* per-driver global properties */
        kcondvar_t      dn_wait;        /* for ddi_hold_installed_driver */
        kthread_id_t    dn_busy_thread; /* for debugging only */
        struct mperm    *dn_mperm;      /* minor permissions */
        struct mperm    *dn_mperm_wild; /* default minor permission */
        struct mperm    *dn_mperm_clone; /* minor permission, clone use */
};
All the data structures are put in the array started from devnamsp, indexed by the driver major.

Record that devinfos have been made for "rootnex." di_dfs() is used to read the prom because it doesn't get the next sibling until the function returns, unlike ddi_walk_devs(). On x86, there is no prom. Create device tree by
 probing pci config space. impl_setup_ddi() is called in this case. In this function, some children nodes are created: ramdisk, isa, get_boot_properties() is called to Read in the properties from the boot (vga properties are get independently by get_vga_properties()). Then, impl_bus_initialprobe() is called to do bus dependent probes. This function modload the prom simulator, then let it probe to verify existence and type of PCI support. If the system is a Xen architecture, xpv_autoconfig and pci_autoconfigure are modloaded, otherwise, only pci_autoconfigure is loaded. Each module will hook its own probe function in the bus_probes list. After modload, invoke each bus probe functions in the bus_probes list.

Now the system device tree looks like:

        rootnex
           |
--------------------------------------------
    |        |        |        |
ramdisk           pci       isa        xpv            
        |            |
       -----------        ----------------
     pci device tree    xpv device tree

Anyway, only the dev_info tree is created, device drivers still aren't loaded yet. ndi_devi_bind_driver() binds a driver to a given device. That means the dip node is put into the per-driver list. If it fails to bind the driver, it returns an appropriate error back. Some drivers may want to know if the actually failed to bind.

ndi_devi_bind_driver() hold the parent dip's lock and call i_ndi_config_node(), which binds a driver to a given device. If it fails to bind the driver, it returns an appropriate error back. Some drivers may want to know if the actually failed to bind.

Each dev_info node can be stood in a single state among
/*
 * Definitions for node state.
 *
 * NOTE: DS_ATTACHED and DS_READY should only be used by the devcfg.c state
 * model code itself, other code should use i_ddi_devi_attached() to avoid
 * logic errors associated with transient DS_READY->DS_ATTACHED->DS_READY
 * state changes while the node is attached.
 */
typedef enum {
        DS_INVAL = -1,
        DS_PROTO = 0,
        DS_LINKED,      /* in orphan list */
        DS_BOUND,       /* in per-driver list */
        DS_INITIALIZED, /* bus address assigned */
        DS_PROBED,      /* device known to exist */
        DS_ATTACHED,    /* don't use, see NOTE above: driver attached */
        DS_READY        /* don't use, see NOTE above: post attach complete */
} ddi_node_state_t;

(4) e_ddi_instance_init()

Background: instance node tree

The instance tree is parallel to the dev_info tree, it is rooted in e_ddi_inst_state.ins_root.
/*
 * This plus devnames defines the entire software state of the instance world.
 */
typedef struct in_softstate {
        in_node_t       *ins_root;      /* the root of our instance tree */
        in_drv_t        *ins_no_major;  /* majorless drv entries */
        /*
         * Used to serialize access to data structures
         */
        void            *ins_thread;
        kmutex_t        ins_serial;
        kcondvar_t      ins_serial_cv;
        int             ins_busy;
        char            ins_dirty;      /* need flush */
} in_softstate_t;
static in_softstate_t e_ddi_inst_state;

Each node is read and built from /etc/path_to_inst. The node is defined as,
/*
 * Each node has one or more in_drv entries hanging from it.
 * (It will have more than one if it has been driven by more than one driver
 * over its lifetime.  This can happen due to a generic name
 * or to a "compatible" name giving a more specific driver).
 */
typedef struct in_node {
        char            *in_node_name;  /* devi_node_name of this node  */
        char            *in_unit_addr;  /* address part of name         */
        struct in_node  *in_child;      /* children of this node        */
        struct in_node  *in_sibling;    /* "peers" of this node */
        struct in_drv   *in_drivers;    /* drivers bound to this node   */
        struct in_node  *in_parent;     /* parent of this node          */
} in_node_t;
typedef struct in_drv {
        char            *ind_driver_name; /* canonical name of driver   */
        int             ind_instance;     /* current instance number    */
        int             ind_state;        /* see below                  */
        /*
         * The following field is used to link instance numbers for the
         * same driver off of devnamesp or in_no_major or in_no_instance
         */
        struct in_drv   *ind_next;        /* next for this driver       */
        struct in_drv   *ind_next_drv;    /* next driver this node      */
        struct in_node  *ind_node;        /* node that these hang on    */
} in_drv_t;

e_ddi_instance_init() is intended to build the instance node tree during boot. It reads file /etc/path_to_inst and add each line as a node to the tree.

(5) impl_ddi_callback_init()

callbacks are handled using a L1/L2 cache. The L1 cache
comes out of kmem_cache_alloc and can expand/shrink dynamically. If
we can't get callbacks from the L1 cache [because pageout is doing
I/O at the time freemem is 0], we allocate callbacks out of the
L2 cache. The L2 cache is static and depends on the memory size.
[We might also count the number of devices at probe time and
allocate one structure per device and adjust for deferred attach]

/*
 * Callback definitions
 */
struct ddi_callback {
        struct ddi_callback     *c_nfree;
        struct ddi_callback     *c_nlist;
        int                     (*c_call)();
        int                     c_count;
        caddr_t                 c_arg;
        size_t                  c_size;
};

/*
 * callback free list
 */
static int ncallbacks;
static int nc_low = 170;
static int nc_med = 512;
static int nc_high = 2048;
static struct ddi_callback *callbackq;
static struct ddi_callback *callbackqfree;

(6) log_event_init()
Allocate and initialize log_event data structures.

(7) fm_init()
Initialize the fm architecture

(8) i_ddi_load_drvconf()
Load driver.conf file for all, Attach driver.conf info to devnames for a driver.

Background: devnames array, the soul of the drivers

The LINKED dip is hooked in orphanlist's (common/os/modctl.c) "dn_head", while the BOUND dip is collected by devnamesp (common/os/modctl.c). The devnamesp is indexed by the driver's major number. The following code can retrieve a devnams from the major number:
struct devnames *dnp = &devnamesp[m];

This defines a parallel structure to the devops list.
struct devnames {
        char            *dn_name;       /* Name of this driver */
        int             dn_flags;       /* per-driver flags, see below */
        struct par_list *dn_pl;         /* parent list, for making devinfos */
        kmutex_t        dn_lock;        /* Per driver lock (see below) */
        dev_info_t      *dn_head;       /* Head of instance list */
        int             dn_instance;    /* Next instance no. to assign */
        void            *dn_inlist;     /* instance # nodes for this driver */
        ddi_prop_list_t *dn_global_prop_ptr; /* per-driver global properties */
        kcondvar_t      dn_wait;        /* for ddi_hold_installed_driver */
        kthread_id_t    dn_busy_thread; /* for debugging only */
        struct mperm    *dn_mperm;      /* minor permissions */
        struct mperm    *dn_mperm_wild; /* default minor permission */
        struct mperm    *dn_mperm_clone; /* minor permission, clone use */
};

"int devcnt" (common/io/conf.c) defines the device count the system enumerated during boot.

Use hwc_parse() (the primary kernel interface) to parse driver.conf files. The entries in drv/*.conf belong to two catagories: driver global and node spec. The prevous entries will be linked and pointed by dn_global_prop_ptr member of struct devnames. The later will be put at dn_pl member of struct devnames for making dev_info nodes.

(9) ldi_init()
Layered Driver Interface, The LDI is a set of DDI/DKI that enables a kernel module to access other devices in the system. The LDI also enables you to determine which devices are currently being used by kernel modules.

The LDI includes two categories of interfaces:
   Kernel interfaces. User applications use system calls to open, read, and write to devices that are
   managed by a device driver within the kernel. Kernel modules can use the LDI kernel interfaces
   to open, read, and write to devices that are managed by another device driver within the kernel.
   For example, a user application might use read(2) and a kernel module might use ldi_read(9F)
   to read the same device.
   User interfaces. The LDI user interfaces can provide information to user processes regarding
   which devices are currently being used by other devices in the kernel.

(10) i_ddi_devices_init
(11) i_ddi_read_devices_files()
Ignore above two functions currently.

3. configure()
==============

Configure the hardware on the system. Called before the rootfs is mounted.

(1) fpu_probe()
Try and figure out what kind of FP capabilities we have, and set up the control registers accordingly.

(2) check_driver_disable()
Check for disabled drivers.

for each driver in devnamesp
    if disable-$drv_name property was set "true" for root dev_info
    devnamesp[major].dn_flags |= DN_DRIVER_REMOVED;
endif

This hints me if we want to hide a pci device in dom0, we can set the crospoding property in ddi_root_node().

(3) i_ddi_init_root()
Init and attach the root node. root node is the first one to be attached, so the process is somewhat "handcrafted".

->impl_ddi_sunbus_initchild(top_devinfo)
Initialize root node. Initialize some members of struct dev_info: devi_addr, devi_addr_buf, devi_parent_data and so on.

->ndi_hold_driver(top_devinfo)
load rootnex driver module and set devi_ops of top_devinfo. There are two structures related to the driver modules: devnamesp, devopsp. Both are arrays indexed by major number. The former is the array of
struct devnames {
        char            *dn_name;       /* Name of this driver */
        int             dn_flags;       /* per-driver flags, see below */
        struct par_list *dn_pl;         /* parent list, for making devinfos */
        kmutex_t        dn_lock;        /* Per driver lock (see below) */
        dev_info_t      *dn_head;       /* Head of instance list */
        int             dn_instance;    /* Next instance no. to assign */
        void            *dn_inlist;     /* instance # nodes for this driver */
        ddi_prop_list_t *dn_global_prop_ptr; /* per-driver global properties */
        kcondvar_t      dn_wait;        /* for ddi_hold_installed_driver */
        kthread_id_t    dn_busy_thread; /* for debugging only */
        struct mperm    *dn_mperm;      /* minor permissions */
        struct mperm    *dn_mperm_wild; /* default minor permission */
        struct mperm    *dn_mperm_clone; /* minor permission, clone use */
};
Another is array of
struct dev_ops  {
        int             devo_rev;       /* Driver build version         */
        int             devo_refcnt;    /* device reference count       */

        int             (*devo_getinfo)(dev_info_t *dip,
                            ddi_info_cmd_t infocmd, void *arg, void **result);
        int             (*devo_identify)(dev_info_t *dip);
        int             (*devo_probe)(dev_info_t *dip);
        int             (*devo_attach)(dev_info_t *dip, ddi_attach_cmd_t cmd);
        int             (*devo_detach)(dev_info_t *dip, ddi_detach_cmd_t cmd);
        int             (*devo_reset)(dev_info_t *dip, ddi_reset_cmd_t cmd);

        struct cb_ops   *devo_cb_ops;   /* cb_ops pointer for leaf drivers   */
        struct bus_ops  *devo_bus_ops;  /* bus_ops pointer for nexus drivers */
        int             (*devo_power)(dev_info_t *dip, int component,
                            int level);
};
Rootnex module is loaded heres!!!!

->e_ddi_assign_instance(top_devinfo)
Assign an instance number for top_devinfo. Look up an instance number for a dev_info node, and assign one if it does not have one.

->devi_attach(top_devinfo, DDI_ATTACH)
The rootnex_attach() is executed here.

->ndi_hold_devi(top_devinfo)
Hold top_devinfo for ever

->i_ddi_set_node_state(top_devinfo, DS_READY)
Set the node state to DS_READY. The node's states are defined as
/*      
 * Definitions for node state.
 *
 * NOTE: DS_ATTACHED and DS_READY should only be used by the devcfg.c state
 * model code itself, other code should use i_ddi_devi_attached() to avoid
 * logic errors associated with transient DS_READY->DS_ATTACHED->DS_READY
 * state changes while the node is attached.
 */     
typedef enum {  
        DS_INVAL = -1,
        DS_PROTO = 0,
        DS_LINKED,      /* in orphan list */
        DS_BOUND,       /* in per-driver list */
        DS_INITIALIZED, /* bus address assigned */
        DS_PROBED,      /* device known to exist */
        DS_ATTACHED,    /* don't use, see NOTE above: driver attached */
        DS_READY        /* don't use, see NOTE above: post attach complete */
} ddi_node_state_t;

->i_ndi_make_spec_children(top_devinfo, 0)
expand .conf children of root. Having no idea about the .conf spec, ignore it temporary.

->pm_init_locks()
initiate the pm related locks

->i_ddi_attach_pseudo_node("options")
->i_ddi_attach_pseudo_node(DEVI_PSEUDO_NEXNAME)
->i_ddi_attach_pseudo_node("clone")
->i_ddi_attach_pseudo_node("scsi_vhci")
Attach pseudo dip's.

Summary:

At this point the top_devinfo, and some pseudo dip nodes have been attached. Here, ATTACHED means the dip node state is DS_ATTACHED, driver gets loaded, attach() gets called and MAYBE i_ndi_make_spec_children() get called. What still puzlled me are:
a. the details of the processes to load a driver for the dip node
b. the details of the processes to load a driver which is not for any real hardware device, hance, no dip node in the original device tree
c. what does i_ndi_make_spec_children() intend to do? Any relationship with the pseudo nodes?

(4) impl_bus_reprobe()
reprogram devices not set up by firmware (BIOS).

This is an interesting thing. "bus_probes" is a list for the bus probers. Prober functions are called here with reprogram=1. On x86 platform, we only care about the PCI bus. pci_enumerate() is the prober function for PCI bus. This function is invoked twice: first time, with reprogram=0 to set up the PCI portion of the device tree. The second time is for reprogramming devices not set up by the BIOS.

Let's walk into pci_enumerate() with reprogram = both 0 and 1.

**************************

>> pci_enumerate(0)
->pci_setup_tree()

pci_bios_nbus : int, the max number of pci buses
pci_bus_res: struct pci_bus_resource *, pci bus resource maps, indexed by the bus number

struct pci_bus_resource {
        struct memlist *io_ports;       /* available free io res */
        struct memlist *io_ports_used;  /* used io res */
        struct memlist *mem_space;      /* available free mem res */
        struct memlist *mem_space_used; /* used mem res */
        struct memlist *pmem_space; /* available free prefetchable mem res */
        struct memlist *pmem_space_used; /* used prefetchable mem res */
        struct memlist *bus_space;      /* available free bus res */
                        /* bus_space_used not needed; can read from regs */
        dev_info_t *dip;        /* devinfo node */
        void *privdata;         /* private data for configuration */
        uchar_t par_bus;        /* parent bus number */
        uchar_t sub_bus;        /* highest bus number beyond this bridge */
        uchar_t root_addr;      /* legacy peer bus address assignment */
        uchar_t num_cbb;        /Excise pha* # of CardBus Bridges on the bus */
        boolean_t io_reprogram; /* need io reprog on this bus */
        boolean_t mem_reprogram;        /* need mem reprog on this bus */
        boolean_t subtractive;  /* subtractive PPB */
};

num_root_bus: int, count of root buses

Summary:
All pci device enueration infomation is keept in pci_bus_res array, which is indexed by the bus number. The dev nodes for the devices which located on this bus are allocated, bound and hooked on privdata. For each device, process_devfunc() are called in this pass. PCI configure space are read and parsed, and properties are added into the dev info nodes.

When the first pass finished, the device nodes are bound to the specific drivers, but still not attached.

**************************

>> pci_enumerate(1)
->pci_reprogram()

Summary:
This pass is called to reprogram devices which has'nt been set up by firmware. For example, phantom roots and the subtractive PPB. This function all rearranges the resource usage of each PCI hierachy.

4. init_tbl
===========

./common/conf/param.c
void    (*init_tbl[])(void) = {
        system_taskq_init,    // Create global system dynamic task queue
        binit,            // Initialize the buffer I/O system
        space_init,        // Allocate tunable structures at runtime
        dnlc_init,        // Initialize the directory cache
        vfsinit,        // initialize all loaded vfs's
        finit,            // initialize file-cache
        strinit,
        serializer_init,
        softcall_init,
        ttyinit,
        as_init,
        pvn_init,
        anon_init,
        segvn_init,
        flk_init,
        pg_init,
        pg_cmt_class_init,
        pg_cpu0_init,
        schedctl_init,
        fdb_init,
        deadman_init,
        clock_timer_init,
        clock_realtime_init,
        clock_highres_init,
        0
};
















原创粉丝点击