From 95270b7ecb6e66e42892546590e8bbf44d405ea3 Mon Sep 17 00:00:00 2001 From: Matthew Dillon Date: Tue, 31 Jan 2017 20:14:05 -0800 Subject: [PATCH] kernel - Many fixes for vkernel support, plus a few main kernel fixes REAL KERNEL * The big enchillada is that the main kernel's thread switch code has a small timing window where it clears the PM_ACTIVE bit for the cpu while switching between two threads. However, it *ALSO* checks and avoids loading the %cr3 if the two threads have the same pmap. This results in a situation where an invalidation on the pmap in another cpuc may not have visibility to the cpu doing the switch, and yet the cpu doing the switch also decides not to reload %cr3 and so does not invalidate the TLB either. The result is a stale TLB and bad things happen. For now just unconditionally load %cr3 until I can come up with code to handle the case. This bug is very difficult to reproduce on a normal system, it requires a multi-threaded program doing nasty things (munmap, etc) on one cpu while another thread is switching to a third thread on some other cpu. * KNOTE after handling the vkernel trap in postsig() instead of before. * Change the kernel's pmap_inval_smp() code to take a 64-bit npgs argument instead of a 32-bit npgs argument. This fixes situations that crop up when a process uses more than 16TB of address space. * Add an lfence to the pmap invalidation code that I think might be needed. * Handle some wrap/overflow cases in pmap_scan() related to the use of large address spaces. * Fix an unnecessary invltlb in pmap_clearbit() for unmanaged PTEs. * Test PG_RW after locking the pv_entry to handle potential races. * Add bio_crc to struct bio. This field is only used for debugging for now but may come in useful later. * Add some global debug variables in the pmap_inval_smp() and related paths. Refactor the npgs handling. * Load the tsc_target field after waiting for completion of the previous invalidation op instead of before. Also add a conservative mfence() in the invalidation path before loading the info fields. * Remove the global pmap_inval_bulk_count counter. * Adjust swtch.s to always reload the user process %cr3, with an explanation. FIXME LATER! * Add some test code to vm/swap_pager.c which double-checks that the page being paged out does not get corrupted during the operation. This code is #if 0'd. * We must hold an object lock around the swp_pager_meta_ctl() call in swp_pager_async_iodone(). I think. * Reorder when PG_SWAPINPROG is cleared. Finish the I/O before clearing the bit. * Change the vm_map_growstack() API to pass a vm_map in instead of curproc. * Use atomic ops for vm_object->generation counts, since objects can be locked shared. VKERNEL * Unconditionally save the FP state after returning from VMSPACE_CTL_RUN. This solves a severe FP corruption bug in the vkernel due to calls it makes into libc (which uses %xmm registers all over the place). This is not a complete fix. We need a formal userspace/kernelspace FP abstraction. Right now the vkernel doesn't have a kernelspace FP abstraction so if a kernel thread switches preemptively bad things happen. * The kernel tracks and locks pv_entry structures to interlock pte's. The vkernel never caught up, and does not really have a pv_entry or placemark mechanism. The vkernel's pmap really needs a complete re-port from the real-kernel pmap code. Until then, we use poor hacks. * Use the vm_page's spinlock to interlock pte changes. * Make sure that PG_WRITEABLE is set or cleared with the vm_page spinlock held. * Have pmap_clearbit() acquire the pmobj token for the pmap in the iteration. This appears to be necessary, currently, as most of the rest of the vkernel pmap code also uses the pmobj token. * Fix bugs in the vkernel's swapu32() and swapu64(). * Change pmap_page_lookup() and pmap_unwire_pgtable() to fully busy the page. Note however that a page table page is currently never soft-busied. Also other vkernel code that busies a page table page. * Fix some sillycode in a pmap->pm_ptphint test. * Don't inherit e.g. PG_M from the previous pte when overwriting it with a pte of a different physical address. * Change the vkernel's pmap_clear_modify() function to clear VTPE_RW (which also clears VPTE_M), and not just VPTE_M. Formally we want the vkernel to be notified when a page becomes modified and it won't be unless we also clear VPTE_RW and force a fault. <--- I may change this back after testing. * Wrap pmap_replacevm() with a critical section. * Scrap the old grow_stack() code. vm_fault() and vm_fault_page() handle it (vm_fault_page() just now got the ability). * Properly flag VM_FAULT_USERMODE. --- sys/kern/kern_sig.c | 17 ++-- sys/platform/pc64/include/pmap_inval.h | 2 +- sys/platform/pc64/x86_64/mp_machdep.c | 1 + sys/platform/pc64/x86_64/pmap.c | 67 +++++++++---- sys/platform/pc64/x86_64/pmap_inval.c | 96 +++++++++++-------- sys/platform/pc64/x86_64/swtch.s | 21 +++- sys/platform/pc64/x86_64/trap.c | 17 ++-- sys/platform/vkernel64/include/pmap.h | 1 + sys/platform/vkernel64/include/pmap_inval.h | 4 +- sys/platform/vkernel64/include/proc.h | 3 +- sys/platform/vkernel64/platform/copyio.c | 9 +- sys/platform/vkernel64/platform/pmap.c | 110 ++++++++++----------- sys/platform/vkernel64/platform/pmap_inval.c | 137 ++++++++++++++------------- sys/platform/vkernel64/x86_64/trap.c | 17 +++- sys/platform/vkernel64/x86_64/vm_machdep.c | 12 --- sys/sys/bio.h | 1 + sys/vm/swap_pager.c | 111 ++++++++++++++++++---- sys/vm/vm_fault.c | 10 +- sys/vm/vm_map.c | 19 +++- sys/vm/vm_map.h | 2 +- sys/vm/vm_object.c | 16 ++-- sys/vm/vm_page.c | 9 +- sys/vm/vm_pageout.c | 26 ++--- 23 files changed, 427 insertions(+), 281 deletions(-) diff --git a/sys/kern/kern_sig.c b/sys/kern/kern_sig.c index aed6d2cb54..294e67a2ee 100644 --- a/sys/kern/kern_sig.c +++ b/sys/kern/kern_sig.c @@ -911,7 +911,6 @@ trapsignal(struct lwp *lp, int sig, u_long code) vkernel_trap(lp, tf); } - if ((p->p_flags & P_TRACED) == 0 && SIGISMEMBER(p->p_sigcatch, sig) && !SIGISMEMBER(lp->lwp_sigmask, sig)) { lp->lwp_ru.ru_nsignals++; @@ -2163,12 +2162,14 @@ issignal(struct lwp *lp, int maytrace, int *ptokp) } /* - * Take the action for the specified signal - * from the current set of pending signals. + * Take the action for the specified signal from the current set of + * pending signals. + * + * haveptok indicates whether the caller is holding p->p_token. If the + * caller is, we are responsible for releasing it. * - * haveptok indicates whether the caller is holding - * p->p_token. If the caller is, we are responsible - * for releasing it. + * This routine can only be called from the top-level trap from usermode. + * It is expecting to be able to modify the top-level stack frame. */ void postsig(int sig, int haveptok) @@ -2182,8 +2183,6 @@ postsig(int sig, int haveptok) KASSERT(sig != 0, ("postsig")); - KNOTE(&p->p_klist, NOTE_SIGNAL | sig); - /* * If we are a virtual kernel running an emulated user process * context, switch back to the virtual kernel context before @@ -2195,6 +2194,8 @@ postsig(int sig, int haveptok) vkernel_trap(lp, tf); } + KNOTE(&p->p_klist, NOTE_SIGNAL | sig); + spin_lock(&lp->lwp_spin); lwp_delsig(lp, sig, haveptok); spin_unlock(&lp->lwp_spin); diff --git a/sys/platform/pc64/include/pmap_inval.h b/sys/platform/pc64/include/pmap_inval.h index 079d28dbc8..b3ee64be67 100644 --- a/sys/platform/pc64/include/pmap_inval.h +++ b/sys/platform/pc64/include/pmap_inval.h @@ -54,7 +54,7 @@ typedef struct pmap_inval_bulk { long count; } pmap_inval_bulk_t; -pt_entry_t pmap_inval_smp(pmap_t pmap, vm_offset_t va, int npgs, +pt_entry_t pmap_inval_smp(pmap_t pmap, vm_offset_t va, vm_pindex_t npgs, pt_entry_t *ptep, pt_entry_t npte); int pmap_inval_smp_cmpset(pmap_t pmap, vm_offset_t va, pt_entry_t *ptep, pt_entry_t opte, pt_entry_t npte); diff --git a/sys/platform/pc64/x86_64/mp_machdep.c b/sys/platform/pc64/x86_64/mp_machdep.c index d042241ca2..552e610d44 100644 --- a/sys/platform/pc64/x86_64/mp_machdep.c +++ b/sys/platform/pc64/x86_64/mp_machdep.c @@ -1182,6 +1182,7 @@ loop: cpu_lfence(); CPUMASK_ORMASK(cpumask, smp_invmask); /*cpumask = smp_active_mask;*/ /* XXX */ + cpu_lfence(); if (pmap_inval_intr(&cpumask, toolong) == 0) { /* diff --git a/sys/platform/pc64/x86_64/pmap.c b/sys/platform/pc64/x86_64/pmap.c index 204d0f0a72..c5f615c71b 100644 --- a/sys/platform/pc64/x86_64/pmap.c +++ b/sys/platform/pc64/x86_64/pmap.c @@ -3723,6 +3723,8 @@ pmap_scan(struct pmap_scan_info *info, int smp_inval) info->stop = 0; if (pmap == NULL) return; + if (info->sva == info->eva) + return; if (smp_inval) { info->bulk = &info->bulk_core; pmap_inval_bulk_init(&info->bulk_core, pmap); @@ -3847,9 +3849,13 @@ fast_skip: /* * Nominal scan case, RB_SCAN() for PD pages and iterate from * there. + * + * WARNING! eva can overflow our standard ((N + mask) >> bits) + * bounds, resulting in a pd_pindex of 0. To solve the + * problem we use an inclusive range. */ info->sva_pd_pindex = pmap_pd_pindex(info->sva); - info->eva_pd_pindex = pmap_pd_pindex(info->eva + NBPDP - 1); + info->eva_pd_pindex = pmap_pd_pindex(info->eva - PAGE_SIZE); if (info->sva >= VM_MAX_USER_ADDRESS) { /* @@ -3859,9 +3865,11 @@ fast_skip: bzero(&dummy_pv, sizeof(dummy_pv)); dummy_pv.pv_pindex = info->sva_pd_pindex; spin_lock(&pmap->pm_spin); - while (dummy_pv.pv_pindex < info->eva_pd_pindex) { + while (dummy_pv.pv_pindex <= info->eva_pd_pindex) { pmap_scan_callback(&dummy_pv, info); ++dummy_pv.pv_pindex; + if (dummy_pv.pv_pindex < info->sva_pd_pindex) /*wrap*/ + break; } spin_unlock(&pmap->pm_spin); } else { @@ -3881,6 +3889,10 @@ fast_skip: /* * WARNING! pmap->pm_spin held + * + * WARNING! eva can overflow our standard ((N + mask) >> bits) + * bounds, resulting in a pd_pindex of 0. To solve the + * problem we use an inclusive range. */ static int pmap_scan_cmp(pv_entry_t pv, void *data) @@ -3888,7 +3900,7 @@ pmap_scan_cmp(pv_entry_t pv, void *data) struct pmap_scan_info *info = data; if (pv->pv_pindex < info->sva_pd_pindex) return(-1); - if (pv->pv_pindex >= info->eva_pd_pindex) + if (pv->pv_pindex > info->eva_pd_pindex) return(1); return(0); } @@ -4282,6 +4294,15 @@ pmap_remove(struct pmap *pmap, vm_offset_t sva, vm_offset_t eva) info.func = pmap_remove_callback; info.arg = NULL; pmap_scan(&info, 1); +#if 0 + cpu_invltlb(); + if (eva - sva < 1024*1024) { + while (sva < eva) { + cpu_invlpg((void *)sva); + sva += PAGE_SIZE; + } + } +#endif } static void @@ -4342,17 +4363,14 @@ pmap_remove_callback(pmap_t pmap, struct pmap_scan_info *info, } } else if (sharept == 0) { /* - * Unmanaged page table (pt, pd, or pdp. Not pte). + * Unmanaged pte (pte_placemark is non-NULL) * * pt_pv's wire_count is still bumped by unmanaged pages * so we must decrement it manually. * * We have to unwire the target page table page. - * - * It is unclear how we can invalidate a segment so we - * invalidate -1 which invlidates the tlb. */ - pte = pmap_inval_bulk(info->bulk, (vm_offset_t)-1, ptep, 0); + pte = pmap_inval_bulk(info->bulk, va, ptep, 0); if (pte & pmap->pmap_bits[PG_W_IDX]) atomic_add_long(&pmap->pm_stats.wired_count, -1); atomic_add_long(&pmap->pm_stats.resident_count, -1); @@ -4579,7 +4597,10 @@ again: } #endif if (pbits != cbits) { - if (!pmap_inval_smp_cmpset(pmap, (vm_offset_t)-1, + vm_offset_t xva; + + xva = (sharept) ? (vm_offset_t)-1 : va; + if (!pmap_inval_smp_cmpset(pmap, xva, ptep, pbits, cbits)) { goto again; } @@ -4853,6 +4874,8 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, atomic_add_long(&pt_pv->pv_pmap->pm_stats. resident_count, 1); } + if (newpte & pmap->pmap_bits[PG_RW_IDX]) + vm_page_flag_set(m, PG_WRITEABLE); } else { /* * Entering a managed page. Our pte_pv takes care of the @@ -4869,6 +4892,8 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, pmap_page_stats_adding(m); TAILQ_INSERT_TAIL(&m->md.pv_list, pte_pv, pv_list); vm_page_flag_set(m, PG_MAPPED); + if (newpte & pmap->pmap_bits[PG_RW_IDX]) + vm_page_flag_set(m, PG_WRITEABLE); vm_page_spin_unlock(m); if (pt_pv && opa && @@ -4907,9 +4932,6 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, cpu_invlpg((void *)va); } - if (newpte & pmap->pmap_bits[PG_RW_IDX]) - vm_page_flag_set(m, PG_WRITEABLE); - /* * Cleanup */ @@ -5394,7 +5416,7 @@ pmap_clearbit(vm_page_t m, int bit_index) * related while we hold the vm_page spin lock. * * *pte can be zero due to this race. Since we are clearing - * bits we basically do no harm when this race ccurs. + * bits we basically do no harm when this race occurs. */ if (bit_index != PG_RW_IDX) { vm_page_spin_lock(m); @@ -5440,14 +5462,7 @@ restart: pmap = pv->pv_pmap; /* - * Skip pages which do not have PG_RW set. - */ - pte = pmap_pte_quick(pv->pv_pmap, pv->pv_pindex << PAGE_SHIFT); - if ((*pte & pmap->pmap_bits[PG_RW_IDX]) == 0) - continue; - - /* - * Lock the PV + * We must lock the PV to be able to safely test the pte. */ if (pv_hold_try(pv)) { vm_page_spin_unlock(m); @@ -5458,6 +5473,16 @@ restart: pv_drop(pv); goto restart; } + + /* + * Skip pages which do not have PG_RW set. + */ + pte = pmap_pte_quick(pv->pv_pmap, pv->pv_pindex << PAGE_SHIFT); + if ((*pte & pmap->pmap_bits[PG_RW_IDX]) == 0) { + pv_put(pv); + goto restart; + } + KKASSERT(pv->pv_pmap == pmap && pv->pv_m == m); for (;;) { pt_entry_t nbits; diff --git a/sys/platform/pc64/x86_64/pmap_inval.c b/sys/platform/pc64/x86_64/pmap_inval.c index 4d50ff77be..36e635d60a 100644 --- a/sys/platform/pc64/x86_64/pmap_inval.c +++ b/sys/platform/pc64/x86_64/pmap_inval.c @@ -87,7 +87,7 @@ struct pmap_inval_info { pt_entry_t npte; enum { INVDONE, INVSTORE, INVCMPSET } mode; int success; - int npgs; + vm_pindex_t npgs; cpumask_t done; cpumask_t mask; #ifdef LOOPRECOVER @@ -107,13 +107,16 @@ extern cpumask_t smp_in_mask; #endif extern cpumask_t smp_smurf_mask; #endif -static long pmap_inval_bulk_count; static int pmap_inval_watchdog_print; /* must always default off */ +static int pmap_inval_force_allcpus; +static int pmap_inval_force_nonopt; -SYSCTL_LONG(_machdep, OID_AUTO, pmap_inval_bulk_count, CTLFLAG_RW, - &pmap_inval_bulk_count, 0, ""); SYSCTL_INT(_machdep, OID_AUTO, pmap_inval_watchdog_print, CTLFLAG_RW, &pmap_inval_watchdog_print, 0, ""); +SYSCTL_INT(_machdep, OID_AUTO, pmap_inval_force_allcpus, CTLFLAG_RW, + &pmap_inval_force_allcpus, 0, ""); +SYSCTL_INT(_machdep, OID_AUTO, pmap_inval_force_nonopt, CTLFLAG_RW, + &pmap_inval_force_nonopt, 0, ""); static void pmap_inval_init(pmap_t pmap) @@ -256,7 +259,7 @@ _checksigmask(pmap_inval_info_t *info, const char *file, int line) * ptep must be NULL if npgs > 1 */ pt_entry_t -pmap_inval_smp(pmap_t pmap, vm_offset_t va, int npgs, +pmap_inval_smp(pmap_t pmap, vm_offset_t va, vm_pindex_t npgs, pt_entry_t *ptep, pt_entry_t npte) { globaldata_t gd = mycpu; @@ -268,6 +271,7 @@ pmap_inval_smp(pmap_t pmap, vm_offset_t va, int npgs, /* * Initialize invalidation for pmap and enter critical section. + * This will enter a critical section for us. */ if (pmap == NULL) pmap = &kernel_pmap; @@ -276,32 +280,39 @@ pmap_inval_smp(pmap_t pmap, vm_offset_t va, int npgs, /* * Shortcut single-cpu case if possible. */ - if (CPUMASK_CMPMASKEQ(pmap->pm_active, gd->gd_cpumask)) { + if (CPUMASK_CMPMASKEQ(pmap->pm_active, gd->gd_cpumask) && + pmap_inval_force_nonopt == 0) { /* * Convert to invltlb if there are too many pages to * invlpg on. */ - if (npgs > MAX_INVAL_PAGES) { - npgs = 0; - va = (vm_offset_t)-1; - } - - /* - * Invalidate the specified pages, handle invltlb if requested. - */ - while (npgs) { - --npgs; - if (ptep) { + if (npgs == 1) { + if (ptep) opte = atomic_swap_long(ptep, npte); - ++ptep; - } if (va == (vm_offset_t)-1) - break; - cpu_invlpg((void *)va); - va += PAGE_SIZE; - } - if (va == (vm_offset_t)-1) + cpu_invltlb(); + else + cpu_invlpg((void *)va); + } else if (va == (vm_offset_t)-1 || npgs > MAX_INVAL_PAGES) { + if (ptep) { + while (npgs) { + opte = atomic_swap_long(ptep, npte); + ++ptep; + --npgs; + } + } cpu_invltlb(); + } else { + while (npgs) { + if (ptep) { + opte = atomic_swap_long(ptep, npte); + ++ptep; + } + cpu_invlpg((void *)va); + va += PAGE_SIZE; + --npgs; + } + } pmap_inval_done(pmap); return opte; @@ -316,7 +327,6 @@ pmap_inval_smp(pmap_t pmap, vm_offset_t va, int npgs, * from a lost IPI. Set to 1/16 second for now. */ info = &invinfo[cpu]; - info->tsc_target = rdtsc() + (tsc_frequency * LOOPRECOVER_TIMEOUT1); /* * We must wait for other cpus which may still be finishing up a @@ -338,6 +348,7 @@ pmap_inval_smp(pmap_t pmap, vm_offset_t va, int npgs, cpu_pause(); } KKASSERT(info->mode == INVDONE); + cpu_mfence(); /* * Must set our cpu in the invalidation scan mask before @@ -346,6 +357,7 @@ pmap_inval_smp(pmap_t pmap, vm_offset_t va, int npgs, */ ATOMIC_CPUMASK_ORBIT(smp_invmask, cpu); + info->tsc_target = rdtsc() + (tsc_frequency * LOOPRECOVER_TIMEOUT1); info->va = va; info->npgs = npgs; info->ptep = ptep; @@ -357,6 +369,8 @@ pmap_inval_smp(pmap_t pmap, vm_offset_t va, int npgs, info->mode = INVSTORE; tmpmask = pmap->pm_active; /* volatile (bits may be cleared) */ + if (pmap_inval_force_allcpus) + tmpmask = smp_active_mask; cpu_ccfence(); CPUMASK_ANDMASK(tmpmask, smp_active_mask); @@ -388,7 +402,7 @@ pmap_inval_smp(pmap_t pmap, vm_offset_t va, int npgs, cpu_disable_intr(); ATOMIC_CPUMASK_COPY(info->done, tmpmask); - /* execution can begin here due to races */ + /* execution can begin here on other cpus due to races */ /* * Pass our copy of the done bits (so they don't change out from @@ -437,7 +451,8 @@ pmap_inval_smp_cmpset(pmap_t pmap, vm_offset_t va, pt_entry_t *ptep, /* * Shortcut single-cpu case if possible. */ - if (CPUMASK_CMPMASKEQ(pmap->pm_active, gd->gd_cpumask)) { + if (CPUMASK_CMPMASKEQ(pmap->pm_active, gd->gd_cpumask) && + pmap_inval_force_nonopt == 0) { if (atomic_cmpset_long(ptep, opte, npte)) { if (va == (vm_offset_t)-1) cpu_invltlb(); @@ -457,7 +472,6 @@ pmap_inval_smp_cmpset(pmap_t pmap, vm_offset_t va, pt_entry_t *ptep, * pmap_inval*() command and create confusion below. */ info = &invinfo[cpu]; - info->tsc_target = rdtsc() + (tsc_frequency * LOOPRECOVER_TIMEOUT1); /* * We must wait for other cpus which may still be finishing @@ -475,6 +489,7 @@ pmap_inval_smp_cmpset(pmap_t pmap, vm_offset_t va, pt_entry_t *ptep, cpu_pause(); } KKASSERT(info->mode == INVDONE); + cpu_mfence(); /* * Must set our cpu in the invalidation scan mask before @@ -483,6 +498,7 @@ pmap_inval_smp_cmpset(pmap_t pmap, vm_offset_t va, pt_entry_t *ptep, */ ATOMIC_CPUMASK_ORBIT(smp_invmask, cpu); + info->tsc_target = rdtsc() + (tsc_frequency * LOOPRECOVER_TIMEOUT1); info->va = va; info->npgs = 1; /* unused */ info->ptep = ptep; @@ -495,6 +511,8 @@ pmap_inval_smp_cmpset(pmap_t pmap, vm_offset_t va, pt_entry_t *ptep, info->success = 0; tmpmask = pmap->pm_active; /* volatile */ + if (pmap_inval_force_allcpus) + tmpmask = smp_active_mask; cpu_ccfence(); CPUMASK_ANDMASK(tmpmask, smp_active_mask); CPUMASK_ORBIT(tmpmask, cpu); @@ -609,13 +627,11 @@ pmap_inval_bulk_flush(pmap_inval_bulk_t *bulk) { if (bulk == NULL) return; - if (bulk->count > 0) - pmap_inval_bulk_count += (bulk->count - 1); if (bulk->va_beg != bulk->va_end) { if (bulk->va_beg == (vm_offset_t)-1) { pmap_inval_smp(bulk->pmap, bulk->va_beg, 1, NULL, 0); } else { - long n; + vm_pindex_t n; n = (bulk->va_end - bulk->va_beg) >> PAGE_SHIFT; pmap_inval_smp(bulk->pmap, bulk->va_beg, n, NULL, 0); @@ -627,7 +643,7 @@ pmap_inval_bulk_flush(pmap_inval_bulk_t *bulk) } /* - * Called with a critical section held and interrupts enabled. + * Called from Xinvl with a critical section held and interrupts enabled. */ int pmap_inval_intr(cpumask_t *cpumaskp, int toolong) @@ -656,9 +672,15 @@ pmap_inval_intr(cpumask_t *cpumaskp, int toolong) info = &invinfo[n]; /* + * Checkout cpu (cpu) for work in the target cpu info (n) + * + * if (n == cpu) - check our cpu for a master operation + * if (n != cpu) - check other cpus for a slave operation + * * Due to interrupts/races we can catch a new operation - * in an older interrupt. A fence is needed once we detect - * the (not) done bit. + * in an older interrupt in other cpus. + * + * A fence is needed once we detect the (not) done bit. */ if (!CPUMASK_TESTBIT(info->done, cpu)) continue; @@ -693,7 +715,7 @@ pmap_inval_intr(cpumask_t *cpumaskp, int toolong) */ if (CPUMASK_TESTBIT(info->mask, cpu)) { /* - * Other cpu indicate to originator that they + * Other cpus indicate to originator that they * are quiesced. */ ATOMIC_CPUMASK_NANDBIT(info->mask, cpu); @@ -715,7 +737,7 @@ pmap_inval_intr(cpumask_t *cpumaskp, int toolong) * we can follow up with our own invalidation. */ vm_offset_t va = info->va; - int npgs; + vm_pindex_t npgs; if (va == (vm_offset_t)-1 || info->npgs > MAX_INVAL_PAGES) { @@ -799,7 +821,7 @@ pmap_inval_intr(cpumask_t *cpumaskp, int toolong) * (asynchronously). */ vm_offset_t va = info->va; - int npgs; + vm_pindex_t npgs; if (va == (vm_offset_t)-1 || info->npgs > MAX_INVAL_PAGES) { diff --git a/sys/platform/pc64/x86_64/swtch.s b/sys/platform/pc64/x86_64/swtch.s index 9b0b66b3a7..5a0ceb7a4b 100644 --- a/sys/platform/pc64/x86_64/swtch.s +++ b/sys/platform/pc64/x86_64/swtch.s @@ -417,14 +417,21 @@ ENTRY(cpu_heavy_restore) /* * Restore the MMU address space. If it is the same as the last * thread we don't have to invalidate the tlb (i.e. reload cr3). - * YYY which naturally also means that the PM_ACTIVE bit had better - * already have been set before we set it above, check? YYY + * + * XXX Temporary cludge, do NOT do this optimization! The problem + * is that the pm_active bit for the cpu had dropped for a small + * period of time, just a few cycles, but even one cycle is long + * enough for some other cpu doing a pmap invalidation to not see + * our cpu. + * + * When that happens, and we don't invltlb (by loading %cr3), we + * wind up with a stale TLB. */ movq TD_PCB(%rax),%rdx /* RDX = PCB */ movq %cr3,%rsi /* RSI = current CR3 */ movq PCB_CR3(%rdx),%rcx /* RCX = desired CR3 */ cmpq %rsi,%rcx - je 4f + /*je 4f*/ 2: #if defined(SWTCH_OPTIM_STATS) decl _swtch_optim_stats @@ -829,7 +836,13 @@ ENTRY(cpu_lwkt_restore) testq %r14,%r14 jne 1f /* yes, borrow %cr3 from old thread */ #endif - movq KPML4phys,%rcx /* YYY borrow but beware desched/cpuchg/exit */ + /* + * Don't reload %cr3 if it hasn't changed. Since this is a LWKT + * thread (a kernel thread), and the kernel_pmap always permanently + * sets all pm_active bits, we don't have the same problem with it + * that we do with process pmaps. + */ + movq KPML4phys,%rcx movq %cr3,%rdx cmpq %rcx,%rdx je 1f diff --git a/sys/platform/pc64/x86_64/trap.c b/sys/platform/pc64/x86_64/trap.c index fe5c42f29b..c5ea6749cc 100644 --- a/sys/platform/pc64/x86_64/trap.c +++ b/sys/platform/pc64/x86_64/trap.c @@ -502,20 +502,18 @@ trap(struct trapframe *frame) case T_PAGEFLT: /* page fault */ i = trap_pfault(frame, TRUE); - if (frame->tf_rip == 0) { #ifdef DDB + if (frame->tf_rip == 0) { /* used for kernel debugging only */ while (freeze_on_seg_fault) tsleep(p, 0, "freeze", hz * 20); -#endif } +#endif if (i == -1 || i == 0) goto out; - - - if (i == SIGSEGV) + if (i == SIGSEGV) { ucode = SEGV_MAPERR; - else { + } else { i = SIGSEGV; ucode = SEGV_ACCERR; } @@ -745,9 +743,10 @@ trap(struct trapframe *frame) } /* - * Virtual kernel intercept - if the fault is directly related to a - * VM context managed by a virtual kernel then let the virtual kernel - * handle it. + * Fault from user mode, virtual kernel interecept. + * + * If the fault is directly related to a VM context managed by a + * virtual kernel then let the virtual kernel handle it. */ if (lp->lwp_vkernel && lp->lwp_vkernel->ve) { vkernel_trap(lp, frame); diff --git a/sys/platform/vkernel64/include/pmap.h b/sys/platform/vkernel64/include/pmap.h index 21fb44db11..defab3037c 100644 --- a/sys/platform/vkernel64/include/pmap.h +++ b/sys/platform/vkernel64/include/pmap.h @@ -215,6 +215,7 @@ void pmap_page_set_memattr(vm_page_t m, vm_memattr_t ma); void pmap_unmapdev (vm_offset_t, vm_size_t); void pmap_release(struct pmap *pmap); void pmap_interlock_wait (struct vmspace *); +int pmap_track_modified(pmap_t pmap, vm_offset_t va); struct vm_page *pmap_use_pt (pmap_t, vm_offset_t); diff --git a/sys/platform/vkernel64/include/pmap_inval.h b/sys/platform/vkernel64/include/pmap_inval.h index db8fcb132a..597c13b1e7 100644 --- a/sys/platform/vkernel64/include/pmap_inval.h +++ b/sys/platform/vkernel64/include/pmap_inval.h @@ -51,9 +51,9 @@ void pmap_inval_pte(volatile vpte_t *ptep, struct pmap *pmap, vm_offset_t va); void pmap_inval_pte_quick(volatile vpte_t *ptep, struct pmap *pmap, vm_offset_t va); void pmap_inval_pde(volatile vpte_t *ptep, struct pmap *pmap, vm_offset_t va); void pmap_inval_pde_quick(volatile vpte_t *ptep, struct pmap *pmap, vm_offset_t va); -vpte_t pmap_clean_pte(volatile vpte_t *ptep, struct pmap *pmap, vm_offset_t va); +vpte_t pmap_clean_pte(volatile vpte_t *ptep, struct pmap *pmap, vm_offset_t va, + vm_page_t m); vpte_t pmap_clean_pde(volatile vpte_t *ptep, struct pmap *pmap, vm_offset_t va); -vpte_t pmap_setro_pte(volatile vpte_t *ptep, struct pmap *pmap, vm_offset_t va); vpte_t pmap_inval_loadandclear(volatile vpte_t *ptep, struct pmap *pmap, vm_offset_t va); #endif diff --git a/sys/platform/vkernel64/include/proc.h b/sys/platform/vkernel64/include/proc.h index 34a34734ac..ed7e48e7bc 100644 --- a/sys/platform/vkernel64/include/proc.h +++ b/sys/platform/vkernel64/include/proc.h @@ -42,11 +42,10 @@ * in md_regs so emulation and other code can modify it for the return. */ struct trapframe; +struct vm_map; struct mdproc { struct trapframe *md_regs; /* registers on current frame */ }; -int grow_stack(struct proc *p, u_long sp); /* XXX swildner */ - #endif /* !_MACHINE_PROC_H_ */ diff --git a/sys/platform/vkernel64/platform/copyio.c b/sys/platform/vkernel64/platform/copyio.c index 19ebcd5d33..6c3bf1ed84 100644 --- a/sys/platform/vkernel64/platform/copyio.c +++ b/sys/platform/vkernel64/platform/copyio.c @@ -63,6 +63,7 @@ casu64(volatile uint64_t *p, uint64_t oldval, uint64_t newval) &error, &busy); if (error) return -1; + KKASSERT(m->busy == 0); kva = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m)); dest = (uint64_t *)(kva + ((vm_offset_t)p & PAGE_MASK)); @@ -101,6 +102,7 @@ casu32(volatile u_int *p, u_int oldval, u_int newval) &error, &busy); if (error) return -1; + KKASSERT(m->busy == 0); kva = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m)); dest = (u_int *)(kva + ((vm_offset_t)p & PAGE_MASK)); @@ -138,6 +140,7 @@ swapu64(volatile uint64_t *p, uint64_t val) &error, &busy); if (error) return -1; + KKASSERT(m->busy == 0); kva = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m)); res = atomic_swap_long((uint64_t *)(kva + ((vm_offset_t)p & PAGE_MASK)), @@ -145,7 +148,7 @@ swapu64(volatile uint64_t *p, uint64_t val) if (busy) vm_page_wakeup(m); else - vm_page_dirty(m); + vm_page_unhold(m); return res; } @@ -170,6 +173,7 @@ swapu32(volatile uint32_t *p, uint32_t val) &error, &busy); if (error) return -1; + KKASSERT(m->busy == 0); kva = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m)); res = atomic_swap_int((u_int *)(kva + ((vm_offset_t)p & PAGE_MASK)), @@ -177,7 +181,7 @@ swapu32(volatile uint32_t *p, uint32_t val) if (busy) vm_page_wakeup(m); else - vm_page_dirty(m); + vm_page_unhold(m); return res; } @@ -301,6 +305,7 @@ copyout(const void *kaddr, void *udaddr, size_t len) &error, &busy); if (error) break; + KKASSERT(m->busy == 0); n = PAGE_SIZE - ((vm_offset_t)udaddr & PAGE_MASK); if (n > len) n = len; diff --git a/sys/platform/vkernel64/platform/pmap.c b/sys/platform/vkernel64/platform/pmap.c index 175625c76e..8885b83ec4 100644 --- a/sys/platform/vkernel64/platform/pmap.c +++ b/sys/platform/vkernel64/platform/pmap.c @@ -674,7 +674,7 @@ pmap_init2(void) * XXX User and kernel address spaces are independant for virtual kernels, * this function only applies to the kernel pmap. */ -static int +int pmap_track_modified(pmap_t pmap, vm_offset_t va) { if (pmap != &kernel_pmap) @@ -953,7 +953,7 @@ pmap_qenter(vm_offset_t beg_va, vm_page_t *m, int count) vm_offset_t va; end_va = beg_va + count * PAGE_SIZE; - KKASSERT(beg_va >= KvaStart && end_va < KvaEnd); + KKASSERT(beg_va >= KvaStart && end_va <= KvaEnd); for (va = beg_va; va < end_va; va += PAGE_SIZE) { pt_entry_t *ptep; @@ -1017,7 +1017,7 @@ pmap_page_lookup(vm_object_t object, vm_pindex_t pindex) vm_page_t m; ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); - m = vm_page_lookup_busy_wait(object, pindex, FALSE, "pplookp"); + m = vm_page_lookup_busy_wait(object, pindex, TRUE, "pplookp"); return(m); } @@ -1048,8 +1048,12 @@ pmap_init_proc(struct proc *p) * wire_count, so the page cannot go away. The page representing the page * table is passed in unbusied and must be busied if we cannot trivially * unwire it. + * + * XXX NOTE! This code is not usually run because we do not currently + * implement dynamic page table page removal. The page in + * its parent assumes at least 1 wire count, so no call to this + * function ever sees a wire count less than 2. */ -#include static int pmap_unwire_pgtable(pmap_t pmap, vm_offset_t va, vm_page_t m) { @@ -1060,7 +1064,7 @@ pmap_unwire_pgtable(pmap_t pmap, vm_offset_t va, vm_page_t m) if (vm_page_unwire_quick(m) == 0) return 0; - vm_page_busy_wait(m, FALSE, "pmuwpt"); + vm_page_busy_wait(m, TRUE, "pmuwpt"); KASSERT(m->queue == PQ_NONE, ("_pmap_unwire_pgtable: %p->queue != PQ_NONE", m)); @@ -1247,7 +1251,7 @@ pmap_puninit(pmap_t pmap) if ((p = pmap->pm_pdirm) != NULL) { KKASSERT(pmap->pm_pml4 != NULL); pmap_kremove((vm_offset_t)pmap->pm_pml4); - vm_page_busy_wait(p, FALSE, "pgpun"); + vm_page_busy_wait(p, TRUE, "pgpun"); vm_page_unwire(p, 0); vm_page_flag_clear(p, PG_MAPPED | PG_WRITEABLE); vm_page_free(p); @@ -1289,8 +1293,8 @@ pmap_release_free_page(struct pmap *pmap, vm_page_t p) * page-table pages. Those pages are zero now, and * might as well be placed directly into the zero queue. */ - if (vm_page_busy_try(p, FALSE)) { - vm_page_sleep_busy(p, FALSE, "pmaprl"); + if (vm_page_busy_try(p, TRUE)) { + vm_page_sleep_busy(p, TRUE, "pmaprl"); return 1; } @@ -1368,7 +1372,8 @@ pmap_release_free_page(struct pmap *pmap, vm_page_t p) pmap, p, (void *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(p)), p->pindex, NUPT_TOTAL, NUPD_TOTAL, NUPDP_TOTAL); } - if (pmap->pm_ptphint && (pmap->pm_ptphint->pindex == p->pindex)) + + if (pmap->pm_ptphint == p) pmap->pm_ptphint = NULL; /* @@ -1535,6 +1540,8 @@ pmap_release(struct pmap *pmap) } } while (info.error); + pmap->pm_ptphint = NULL; + KASSERT((pmap->pm_stats.wired_count == (pmap->pm_pdirm != NULL)), ("pmap_release: dangling count %p %ld", pmap, pmap->pm_stats.wired_count)); @@ -1854,6 +1861,7 @@ pmap_insert_entry(pmap_t pmap, vm_offset_t va, vm_page_t mpte, vm_page_t m, m->md.pv_list_count++; TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list); pv = pv_entry_rb_tree_RB_INSERT(&pmap->pm_pvroot, pv); + vm_page_flag_set(m, PG_MAPPED); KKASSERT(pv == NULL); } @@ -2281,9 +2289,6 @@ pmap_protect(pmap_t pmap, vm_offset_t sva, vm_offset_t eva, vm_prot_t prot) pt_m = pmap_hold_pt_page(pde, sva); for (pte = pmap_pde_to_pte(pde, sva); sva != va_next; pte++, sva += PAGE_SIZE) { - pt_entry_t pbits; - vm_page_t m; - /* * Clean managed pages and also check the accessed * bit. Just remove write perms for unmanaged @@ -2291,24 +2296,7 @@ pmap_protect(pmap_t pmap, vm_offset_t sva, vm_offset_t eva, vm_prot_t prot) * access will force a fault rather then setting * the modified bit at an unexpected time. */ - if (*pte & VPTE_MANAGED) { - pbits = pmap_clean_pte(pte, pmap, sva); - m = NULL; - if (pbits & VPTE_A) { - m = PHYS_TO_VM_PAGE(pbits & VPTE_FRAME); - vm_page_flag_set(m, PG_REFERENCED); - atomic_clear_long(pte, VPTE_A); - } - if (pbits & VPTE_M) { - if (pmap_track_modified(pmap, sva)) { - if (m == NULL) - m = PHYS_TO_VM_PAGE(pbits & VPTE_FRAME); - vm_page_dirty(m); - } - } - } else { - pbits = pmap_setro_pte(pte, pmap, sva); - } + pmap_clean_pte(pte, pmap, sva, NULL); } vm_page_unhold(pt_m); } @@ -2340,7 +2328,6 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, pt_entry_t origpte, newpte; vm_paddr_t opa; vm_page_t mpte; - int spun = 0; if (pmap == NULL) return; @@ -2372,7 +2359,7 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, */ pa = VM_PAGE_TO_PHYS(m); origpte = pmap_inval_loadandclear(pte, pmap, va); - /*origpte = pmap_clean_pte(pte, pmap, va);*/ + /*origpte = pmap_clean_pte(pte, pmap, va, NULL);*/ opa = origpte & VPTE_FRAME; if (origpte & VPTE_PS) @@ -2407,6 +2394,7 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, } else { KKASSERT((m->flags & (PG_FICTITIOUS|PG_UNMANAGED))); } + vm_page_spin_lock(m); goto validate; } @@ -2418,11 +2406,13 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, /* * Mapping has changed, invalidate old range and fall through to - * handle validating new mapping. + * handle validating new mapping. Don't inherit anything from + * oldpte. */ if (opa) { int err; err = pmap_remove_pte(pmap, NULL, origpte, va); + origpte = 0; if (err) panic("pmap_enter: pte vanished, va: 0x%lx", va); } @@ -2445,10 +2435,12 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, vm_page_spin_lock(m); pmap_insert_entry(pmap, va, mpte, m, pv); pa |= VPTE_MANAGED; - vm_page_flag_set(m, PG_MAPPED); - spun = 1; /* vm_page_spin_unlock(m); */ + } else { + vm_page_spin_lock(m); } + } else { + vm_page_spin_lock(m); } /* @@ -2469,7 +2461,6 @@ validate: newpte |= VPTE_WIRED; // if (pmap != &kernel_pmap) newpte |= VPTE_U; - if (newpte & VPTE_RW) vm_page_flag_set(m, PG_WRITEABLE); KKASSERT((newpte & VPTE_MANAGED) == 0 || (m->flags & PG_MAPPED)); @@ -2479,8 +2470,7 @@ validate: kprintf("pmap [M] race @ %016jx\n", va); atomic_set_long(pte, VPTE_M); } - if (spun) - vm_page_spin_unlock(m); + vm_page_spin_unlock(m); if (mpte) vm_page_wakeup(mpte); @@ -2959,6 +2949,8 @@ pmap_clearbit(vm_page_t m, int bit) pv_entry_t pv; pt_entry_t *pte; pt_entry_t pbits; + vm_object_t pmobj; + pmap_t pmap; if (!pmap_initialized || (m->flags & PG_FICTITIOUS)) { if (bit == VPTE_RW) @@ -2970,20 +2962,37 @@ pmap_clearbit(vm_page_t m, int bit) * Loop over all current mappings setting/clearing as appropos If * setting RO do we need to clear the VAC? */ - vm_page_spin_lock(m); restart: + vm_page_spin_lock(m); TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) { /* + * Need the pmap object lock(?) + */ + pmap = pv->pv_pmap; + pmobj = pmap->pm_pteobj; + + if (vm_object_hold_try(pmobj) == 0) { + refcount_acquire(&pmobj->hold_count); + vm_page_spin_unlock(m); + vm_object_lock(pmobj); + vm_object_drop(pmobj); + goto restart; + } + + /* * don't write protect pager mappings */ if (bit == VPTE_RW) { - if (!pmap_track_modified(pv->pv_pmap, pv->pv_va)) + if (!pmap_track_modified(pv->pv_pmap, pv->pv_va)) { + vm_object_drop(pmobj); continue; + } } #if defined(PMAP_DIAGNOSTIC) if (pv->pv_pmap == NULL) { kprintf("Null pmap (cb) at va: 0x%lx\n", pv->pv_va); + vm_object_drop(pmobj); continue; } #endif @@ -3007,14 +3016,7 @@ restart: * the page. */ pbits = pmap_clean_pte(pte, pv->pv_pmap, - pv->pv_va); - if (pbits & VPTE_M) { - if (pmap_track_modified(pv->pv_pmap, - pv->pv_va)) { - vm_page_dirty(m); - goto restart; - } - } + pv->pv_va, m); } else if (bit == VPTE_M) { /* * We must invalidate the real-kernel pte @@ -3035,7 +3037,7 @@ restart: * the caller doesn't want us to update * the dirty status of the VM page. */ - pmap_clean_pte(pte, pv->pv_pmap, pv->pv_va); + pmap_clean_pte(pte, pv->pv_pmap, pv->pv_va, m); panic("shouldn't be called"); } else { /* @@ -3045,6 +3047,7 @@ restart: atomic_clear_long(pte, bit); } } + vm_object_drop(pmobj); } if (bit == VPTE_RW) vm_page_flag_clear(m, PG_WRITEABLE); @@ -3141,14 +3144,17 @@ pmap_is_modified(vm_page_t m) } /* - * Clear the modify bits on the specified physical page. + * Clear the modify bits on the specified physical page. For the vkernel + * we really need to clean the page, which clears VPTE_RW and VPTE_M, in + * order to ensure that we take a fault on the next write to the page. + * Otherwise the page may become dirty without us knowing it. * * No other requirements. */ void pmap_clear_modify(vm_page_t m) { - pmap_clearbit(m, VPTE_M); + pmap_clearbit(m, VPTE_RW); } /* @@ -3271,7 +3277,6 @@ pmap_replacevm(struct proc *p, struct vmspace *newvm, int adjrefs) struct vmspace *oldvm; struct lwp *lp; - crit_enter(); oldvm = p->p_vmspace; if (oldvm != newvm) { if (adjrefs) @@ -3283,7 +3288,6 @@ pmap_replacevm(struct proc *p, struct vmspace *newvm, int adjrefs) if (adjrefs) vmspace_rel(oldvm); } - crit_exit(); } /* diff --git a/sys/platform/vkernel64/platform/pmap_inval.c b/sys/platform/vkernel64/platform/pmap_inval.c index 7242026663..2d9abfb208 100644 --- a/sys/platform/vkernel64/platform/pmap_inval.c +++ b/sys/platform/vkernel64/platform/pmap_inval.c @@ -76,6 +76,8 @@ #include #include +#include + extern int vmm_enabled; /* @@ -270,24 +272,67 @@ pmap_inval_pde_quick(volatile vpte_t *ptep, struct pmap *pmap, vm_offset_t va) } /* - * These carefully handle interactions with other cpus and return - * the original vpte. Clearing VPTE_RW prevents us from racing the - * setting of VPTE_M, allowing us to invalidate the TLB (the real cpu's - * pmap) and get good status for VPTE_M. + * This is really nasty. + * + * (1) The vkernel interlocks pte operations with the related vm_page_t + * spin-lock (and doesn't handle unmanaged page races). + * + * (2) The vkernel must also issu an invalidation to the real cpu. It + * (nastily) does this while holding the spin-lock too. * - * By using an atomic op we can detect if the real PTE is writable by - * testing whether VPTE_M was set. If it wasn't set, the real PTE is - * already read-only and we do not have to waste time invalidating it - * further. + * In addition, atomic ops must be used to properly interlock against + * other cpus and the real kernel (which could be taking a fault on another + * cpu and will adjust VPTE_M and VPTE_A appropriately). * - * clean: clear VPTE_M and VPTE_RW - * setro: clear VPTE_RW - * load&clear: clear entire field + * The atomicc ops do a good job of interlocking against other cpus, but + * we still need to lock the pte location (which we use the vm_page spin-lock + * for) to avoid races against PG_WRITEABLE and other tests. + * + * Cleaning the pte involves clearing VPTE_M and VPTE_RW, synchronizing with + * the real host, and updating the vm_page appropriately. + * + * If the caller passes a non-NULL (m), the caller holds the spin-lock, + * otherwise we must acquire and release the spin-lock. (m) is only + * applicable to managed pages. */ vpte_t -pmap_clean_pte(volatile vpte_t *ptep, struct pmap *pmap, vm_offset_t va) +pmap_clean_pte(volatile vpte_t *ptep, struct pmap *pmap, vm_offset_t va, + vm_page_t m) { vpte_t pte; + int spin = 0; + + /* + * Acquire (m) and spin-lock it. + */ + while (m == NULL) { + pte = *ptep; + if ((pte & VPTE_V) == 0) + return pte; + if ((pte & VPTE_MANAGED) == 0) + break; + m = PHYS_TO_VM_PAGE(pte & VPTE_FRAME); + vm_page_spin_lock(m); + + pte = *ptep; + if ((pte & VPTE_V) == 0) { + vm_page_spin_unlock(m); + m = NULL; + continue; + } + if ((pte & VPTE_MANAGED) == 0) { + vm_page_spin_unlock(m); + m = NULL; + continue; + } + if (m != PHYS_TO_VM_PAGE(pte & VPTE_FRAME)) { + vm_page_spin_unlock(m); + m = NULL; + continue; + } + spin = 1; + break; + } if (vmm_enabled == 0) { for (;;) { @@ -306,62 +351,18 @@ pmap_clean_pte(volatile vpte_t *ptep, struct pmap *pmap, vm_offset_t va) pte = *ptep & ~(VPTE_RW | VPTE_M); guest_sync_addr(pmap, ptep, &pte); } - return pte; -} - -#if 0 - -vpte_t -pmap_clean_pde(volatile vpte_t *ptep, struct pmap *pmap, vm_offset_t va) -{ - vpte_t pte; - pte = *ptep; - if (pte & VPTE_V) { - atomic_clear_long(ptep, VPTE_RW); - if (vmm_enabled == 0) { - atomic_clear_long(ptep, VPTE_RW); - pmap_inval_cpu(pmap, va, PAGE_SIZE); - pte = *ptep | (pte & VPTE_RW); - atomic_clear_long(ptep, VPTE_M); - } else { - pte &= ~(VPTE_RW | VPTE_M); - guest_sync_addr(pmap, ptep, &pte); + if (m) { + if (pte & VPTE_A) { + vm_page_flag_set(m, PG_REFERENCED); + atomic_clear_long(ptep, VPTE_A); } - } - return(pte); -} - -#endif - -/* - * This is an odd case and I'm not sure whether it even occurs in normal - * operation. Turn off write access to the page, clean out the tlb - * (the real cpu's pmap), and deal with any VPTE_M race that may have - * occured. - * - * VPTE_M is not cleared. If we accidently removed it due to the swap - * we throw it back into the pte. - */ -vpte_t -pmap_setro_pte(volatile vpte_t *ptep, struct pmap *pmap, vm_offset_t va) -{ - vpte_t pte; - - if (vmm_enabled == 0) { - for (;;) { - pte = *ptep; - cpu_ccfence(); - if ((pte & VPTE_RW) == 0) - break; - if (atomic_cmpset_long(ptep, pte, pte & ~VPTE_RW)) { - pmap_inval_cpu(pmap, va, PAGE_SIZE); - break; - } + if (pte & VPTE_M) { + if (pmap_track_modified(pmap, va)) + vm_page_dirty(m); } - } else { - pte = *ptep & ~(VPTE_RW | VPTE_M); - guest_sync_addr(pmap, ptep, &pte); + if (spin) + vm_page_spin_unlock(m); } return pte; } @@ -406,11 +407,13 @@ cpu_invltlb(void) madvise((void *)KvaStart, KvaEnd - KvaStart, MADV_INVAL); } +/* + * Invalidate the TLB on all cpus. Instead what the vkernel does is + * ignore VM_PROT_NOSYNC on pmap_enter() calls. + */ void smp_invltlb(void) { - /* XXX must invalidate the tlb on all cpus */ - /* at the moment pmap_inval_pte_quick */ /* do nothing */ } diff --git a/sys/platform/vkernel64/x86_64/trap.c b/sys/platform/vkernel64/x86_64/trap.c index 7652f1d39c..70c27c4324 100644 --- a/sys/platform/vkernel64/x86_64/trap.c +++ b/sys/platform/vkernel64/x86_64/trap.c @@ -846,6 +846,7 @@ trap_pfault(struct trapframe *frame, int usermode, vm_offset_t eva) */ PHOLD(lp->lwp_proc); +#if 0 /* * Grow the stack if necessary */ @@ -855,15 +856,16 @@ trap_pfault(struct trapframe *frame, int usermode, vm_offset_t eva) * a growable stack region, or if the stack * growth succeeded. */ - if (!grow_stack (lp->lwp_proc, va)) { + if (!grow_stack (map, va)) { rv = KERN_FAILURE; PRELE(lp->lwp_proc); goto nogo; } +#endif fault_flags = 0; if (usermode) - fault_flags |= VM_FAULT_BURST; + fault_flags |= VM_FAULT_BURST | VM_FAULT_USERMODE; if (ftype & VM_PROT_WRITE) fault_flags |= VM_FAULT_DIRTY; else @@ -1351,6 +1353,7 @@ go_user(struct intrframe *frame) * be faster because the cost of taking a #NM fault through * the vkernel to the real kernel is astronomical. */ + crit_enter(); tf->tf_xflags &= ~PGEX_FPFAULT; if (mdcpu->gd_npxthread != curthread) { if (mdcpu->gd_npxthread) @@ -1393,6 +1396,16 @@ go_user(struct intrframe *frame) gd->gd_flags &= ~GDF_VIRTUSER; frame->if_xflags |= PGEX_U; + + /* + * Immediately save the user FPU state. The vkernel is a + * user program and libraries like libc will use the FP + * unit. + */ + if (mdcpu->gd_npxthread == curthread) { + npxsave(mdcpu->gd_npxthread->td_savefpu); + } + crit_exit(); #if 0 kprintf("GO USER %d trap %ld EVA %08lx RIP %08lx RSP %08lx XFLAGS %02lx/%02lx\n", r, tf->tf_trapno, tf->tf_addr, tf->tf_rip, tf->tf_rsp, diff --git a/sys/platform/vkernel64/x86_64/vm_machdep.c b/sys/platform/vkernel64/x86_64/vm_machdep.c index 865db000c5..477448a599 100644 --- a/sys/platform/vkernel64/x86_64/vm_machdep.c +++ b/sys/platform/vkernel64/x86_64/vm_machdep.c @@ -298,18 +298,6 @@ cpu_thread_exit(void) panic("cpu_thread_exit: lwkt_switch() unexpectedly returned"); } -int -grow_stack(struct proc *p, u_long sp) -{ - int rv; - - rv = vm_map_growstack (p, sp); - if (rv != KERN_SUCCESS) - return (0); - - return (1); -} - /* * Used by /dev/kmem to determine if we can safely read or write * the requested KVA range. Some portions of kernel memory are diff --git a/sys/sys/bio.h b/sys/sys/bio.h index bb502dc22b..bb40a3c2e7 100644 --- a/sys/sys/bio.h +++ b/sys/sys/bio.h @@ -69,6 +69,7 @@ struct bio { biodone_t *bio_done; /* MPSAFE caller completion function */ off_t bio_offset; /* Logical offset relative to device */ void *bio_driver_info; + uint32_t bio_crc; /* Caller-specific */ int bio_flags; union { void *ptr; diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c index ef0be6ca6b..934c565f96 100644 --- a/sys/vm/swap_pager.c +++ b/sys/vm/swap_pager.c @@ -108,6 +108,7 @@ #include #include +#include #include "opt_swap.h" #include #include @@ -681,7 +682,7 @@ swap_pager_condfree_callback(struct swblock *swap, void *data) * into a VM object. Checks whether swap has been assigned to * the page and sets PG_SWAPPED as necessary. * - * No requirements. + * (m) must be busied by caller and remains busied on return. */ void swap_pager_page_inserted(vm_page_t m) @@ -871,9 +872,9 @@ swap_pager_haspage(vm_object_t object, vm_pindex_t pindex) * calls us in a special-case situation * * NOTE!!! If the page is clean and the swap was valid, the caller - * should make the page dirty before calling this routine. This routine - * does NOT change the m->dirty status of the page. Also: MADV_FREE - * depends on it. + * should make the page dirty before calling this routine. + * This routine does NOT change the m->dirty status of the page. + * Also: MADV_FREE depends on it. * * The page must be busied. * The caller can hold the object to avoid blocking, else we might block. @@ -1439,7 +1440,12 @@ swap_pager_getpage(vm_object_t object, vm_page_t *mpp, int seqaccess) } /* - * mreq is left bussied after completion, but all the other pages + * Disallow speculative reads prior to the PG_SWAPINPROG test. + */ + cpu_lfence(); + + /* + * mreq is left busied after completion, but all the other pages * are freed. If we had an unrecoverable read error the page will * not be valid. */ @@ -1649,6 +1655,40 @@ swap_pager_putpages(vm_object_t object, vm_page_t *m, int count, bp->b_cmd = BUF_CMD_WRITE; bio->bio_caller_info1.index = SWBIO_WRITE; +#if 0 + /* PMAP TESTING CODE (useful, keep it in but #if 0'd) */ + bio->bio_crc = iscsi_crc32(bp->b_data, bp->b_bcount); + { + uint32_t crc = 0; + for (j = 0; j < n; ++j) { + vm_page_t mm = bp->b_xio.xio_pages[j]; + char *p = (char *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(mm)); + crc = iscsi_crc32_ext(p, PAGE_SIZE, crc); + } + if (bio->bio_crc != crc) { + kprintf("PREWRITE MISMATCH-A " + "bdata=%08x dmap=%08x bdata=%08x (%d)\n", + bio->bio_crc, + crc, + iscsi_crc32(bp->b_data, bp->b_bcount), + bp->b_bcount); +#ifdef _KERNEL_VIRTUAL + madvise(bp->b_data, bp->b_bcount, MADV_INVAL); +#endif + crc = 0; + for (j = 0; j < n; ++j) { + vm_page_t mm = bp->b_xio.xio_pages[j]; + char *p = (char *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(mm)); + crc = iscsi_crc32_ext(p, PAGE_SIZE, crc); + } + kprintf("PREWRITE MISMATCH-B " + "bdata=%08x dmap=%08x\n", + iscsi_crc32(bp->b_data, bp->b_bcount), + crc); + } + } +#endif + /* * asynchronous */ @@ -1761,6 +1801,24 @@ swp_pager_async_iodone(struct bio *bio) if (bp->b_xio.xio_npages) object = bp->b_xio.xio_pages[0]->object; +#if 0 + /* PMAP TESTING CODE (useful, keep it in but #if 0'd) */ + if (bio->bio_caller_info1.index & SWBIO_WRITE) { + if (bio->bio_crc != iscsi_crc32(bp->b_data, bp->b_bcount)) { + kprintf("SWAPOUT: BADCRC %08x %08x\n", + bio->bio_crc, + iscsi_crc32(bp->b_data, bp->b_bcount)); + for (i = 0; i < bp->b_xio.xio_npages; ++i) { + vm_page_t m = bp->b_xio.xio_pages[i]; + if (m->flags & PG_WRITEABLE) + kprintf("SWAPOUT: " + "%d/%d %p writable\n", + i, bp->b_xio.xio_npages, m); + } + } + } +#endif + /* * remove the mapping for kernel virtual */ @@ -1798,15 +1856,21 @@ swp_pager_async_iodone(struct bio *bio) * up too because we cleared PG_SWAPINPROG and * someone may be waiting for that. * - * NOTE: for reads, m->dirty will probably - * be overridden by the original caller of - * getpages so don't play cute tricks here. + * NOTE: For reads, m->dirty will probably + * be overridden by the original caller + * of getpages so don't play cute tricks + * here. * * NOTE: We can't actually free the page from - * here, because this is an interrupt. It - * is not legal to mess with object->memq - * from an interrupt. Deactivate the page - * instead. + * here, because this is an interrupt. + * It is not legal to mess with + * object->memq from an interrupt. + * Deactivate the page instead. + * + * WARNING! The instant PG_SWAPINPROG is + * cleared another cpu may start + * using the mreq page (it will + * check m->valid immediately). */ m->valid = 0; @@ -1842,15 +1906,17 @@ swp_pager_async_iodone(struct bio *bio) * do have backing store (the vnode). */ vm_page_busy_wait(m, FALSE, "swadpg"); + vm_object_hold(m->object); swp_pager_meta_ctl(m->object, m->pindex, SWM_FREE); vm_page_flag_clear(m, PG_SWAPPED); + vm_object_drop(m->object); if (m->object->type == OBJT_SWAP) { vm_page_dirty(m); vm_page_activate(m); } - vm_page_flag_clear(m, PG_SWAPINPROG); vm_page_io_finish(m); + vm_page_flag_clear(m, PG_SWAPINPROG); vm_page_wakeup(m); } } else if (bio->bio_caller_info1.index & SWBIO_READ) { @@ -1873,15 +1939,21 @@ swp_pager_async_iodone(struct bio *bio) */ /* - * NOTE: can't call pmap_clear_modify(m) from an - * interrupt thread, the pmap code may have to map - * non-kernel pmaps and currently asserts the case. + * NOTE: Can't call pmap_clear_modify(m) from an + * interrupt thread, the pmap code may have to + * map non-kernel pmaps and currently asserts + * the case. + * + * WARNING! The instant PG_SWAPINPROG is + * cleared another cpu may start + * using the mreq page (it will + * check m->valid immediately). */ /*pmap_clear_modify(m);*/ m->valid = VM_PAGE_BITS_ALL; vm_page_undirty(m); - vm_page_flag_clear(m, PG_SWAPINPROG); vm_page_flag_set(m, PG_SWAPPED); + vm_page_flag_clear(m, PG_SWAPINPROG); /* * We have to wake specifically requested pages @@ -1915,12 +1987,15 @@ swp_pager_async_iodone(struct bio *bio) * * When using the swap to cache clean vnode pages * we do not mess with the page dirty bits. + * + * NOTE! Nobody is waiting for the key mreq page + * on write completion. */ vm_page_busy_wait(m, FALSE, "swadpg"); if (m->object->type == OBJT_SWAP) vm_page_undirty(m); - vm_page_flag_clear(m, PG_SWAPINPROG); vm_page_flag_set(m, PG_SWAPPED); + vm_page_flag_clear(m, PG_SWAPINPROG); if (vm_page_count_severe()) vm_page_deactivate(m); vm_page_io_finish(m); diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c index e8a5ac6ac8..c356c329de 100644 --- a/sys/vm/vm_fault.c +++ b/sys/vm/vm_fault.c @@ -357,7 +357,7 @@ RetryFault: { if (result == KERN_INVALID_ADDRESS && growstack && map != &kernel_map && curproc != NULL) { - result = vm_map_growstack(curproc, vaddr); + result = vm_map_growstack(map, vaddr); if (result == KERN_SUCCESS) { growstack = 0; ++retry; @@ -724,7 +724,8 @@ vm_fault_page_quick(vm_offset_t va, vm_prot_t fault_type, * * If busyp is not NULL then *busyp will be set to TRUE if this routine * decides to return a busied page (aka VM_PROT_WRITE), or FALSE if it - * does not (VM_PROT_WRITE not specified or busyp is NULL). + * does not (VM_PROT_WRITE not specified or busyp is NULL). If busyp is + * NULL the returned page is only held. * * If the caller has no intention of writing to the page's contents, busyp * can be passed as NULL along with VM_PROT_WRITE to force a COW operation @@ -735,9 +736,6 @@ vm_fault_page_quick(vm_offset_t va, vm_prot_t fault_type, * If the page cannot be faulted writable and VM_PROT_WRITE was specified, an * error will be returned. * - * The page will either be held or busied on returned depending on what this - * routine sets *busyp to. It will only be held if busyp is NULL. - * * No requirements. */ vm_page_t @@ -821,7 +819,7 @@ RetryFault: { if (result == KERN_INVALID_ADDRESS && growstack && map != &kernel_map && curproc != NULL) { - result = vm_map_growstack(curproc, vaddr); + result = vm_map_growstack(map, vaddr); if (result == KERN_SUCCESS) { growstack = 0; ++retry; diff --git a/sys/vm/vm_map.c b/sys/vm/vm_map.c index 64f5c7d048..206406329f 100644 --- a/sys/vm/vm_map.c +++ b/sys/vm/vm_map.c @@ -1953,7 +1953,6 @@ vm_map_madvise(vm_map_t map, vm_offset_t start, vm_offset_t end, * various clipping operations. Otherwise we only need a read-lock * on the map. */ - count = vm_map_entry_reserve(MAP_RESERVE_COUNT); switch(behav) { @@ -3257,7 +3256,7 @@ vm_map_split(vm_map_entry_t entry) nobject->backing_object = bobject; if (useshadowlist) { bobject->shadow_count++; - bobject->generation++; + atomic_add_int(&bobject->generation, 1); LIST_INSERT_HEAD(&bobject->shadow_head, nobject, shadow_list); vm_object_clear_flag(bobject, OBJ_ONEMAPPING); /*XXX*/ @@ -3728,13 +3727,14 @@ vm_map_stack (vm_map_t map, vm_offset_t addrbos, vm_size_t max_ssize, * No requirements. */ int -vm_map_growstack (struct proc *p, vm_offset_t addr) +vm_map_growstack (vm_map_t map, vm_offset_t addr) { vm_map_entry_t prev_entry; vm_map_entry_t stack_entry; vm_map_entry_t new_stack_entry; - struct vmspace *vm = p->p_vmspace; - vm_map_t map = &vm->vm_map; + struct vmspace *vm; + struct lwp *lp; + struct proc *p; vm_offset_t end; int grow_amount; int rv = KERN_SUCCESS; @@ -3742,6 +3742,15 @@ vm_map_growstack (struct proc *p, vm_offset_t addr) int use_read_lock = 1; int count; + /* + * Find the vm + */ + lp = curthread->td_lwp; + p = curthread->td_proc; + KKASSERT(lp != NULL); + vm = lp->lwp_vmspace; + KKASSERT(map == &vm->vm_map); + count = vm_map_entry_reserve(MAP_RESERVE_COUNT); Retry: if (use_read_lock) diff --git a/sys/vm/vm_map.h b/sys/vm/vm_map.h index 4f848ed98e..527843084e 100644 --- a/sys/vm/vm_map.h +++ b/sys/vm/vm_map.h @@ -609,7 +609,7 @@ void vm_init2 (void); int vm_uiomove (vm_map_t, vm_object_t, off_t, int, vm_offset_t, int *); int vm_map_stack (vm_map_t, vm_offset_t, vm_size_t, int, vm_prot_t, vm_prot_t, int); -int vm_map_growstack (struct proc *p, vm_offset_t addr); +int vm_map_growstack (vm_map_t map, vm_offset_t addr); vm_offset_t vmspace_swap_count (struct vmspace *vmspace); vm_offset_t vmspace_anonymous_count (struct vmspace *vmspace); void vm_map_set_wired_quick(vm_map_t map, vm_offset_t addr, vm_size_t size, int *); diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c index 88745786a4..6576142798 100644 --- a/sys/vm/vm_object.c +++ b/sys/vm/vm_object.c @@ -389,7 +389,7 @@ _vm_object_allocate(objtype_t type, vm_pindex_t size, vm_object_t object) object->backing_object = NULL; object->backing_object_offset = (vm_ooffset_t)0; - object->generation++; + atomic_add_int(&object->generation, 1); object->swblock_count = 0; RB_INIT(&object->swblock_root); vm_object_lock_init(object); @@ -1080,7 +1080,7 @@ skip: if (object->flags & OBJ_ONSHADOW) { LIST_REMOVE(object, shadow_list); temp->shadow_count--; - temp->generation++; + atomic_add_int(&temp->generation, 1); vm_object_clear_flag(object, OBJ_ONSHADOW); } object->backing_object = NULL; @@ -1950,7 +1950,7 @@ vm_object_shadow(vm_object_t *objectp, vm_ooffset_t *offset, vm_size_t length, LIST_INSERT_HEAD(&source->shadow_head, result, shadow_list); source->shadow_count++; - source->generation++; + atomic_add_int(&source->generation, 1); vm_object_set_flag(result, OBJ_ONSHADOW); } /* cpu localization twist */ @@ -2380,7 +2380,7 @@ vm_object_collapse(vm_object_t object, struct vm_object_dealloc_list **dlistp) if (object->flags & OBJ_ONSHADOW) { LIST_REMOVE(object, shadow_list); backing_object->shadow_count--; - backing_object->generation++; + atomic_add_int(&backing_object->generation, 1); vm_object_clear_flag(object, OBJ_ONSHADOW); } @@ -2412,7 +2412,7 @@ vm_object_collapse(vm_object_t object, struct vm_object_dealloc_list **dlistp) LIST_REMOVE(backing_object, shadow_list); bbobj->shadow_count--; - bbobj->generation++; + atomic_add_int(&bbobj->generation, 1); vm_object_clear_flag(backing_object, OBJ_ONSHADOW); } @@ -2424,7 +2424,7 @@ vm_object_collapse(vm_object_t object, struct vm_object_dealloc_list **dlistp) LIST_INSERT_HEAD(&bbobj->shadow_head, object, shadow_list); bbobj->shadow_count++; - bbobj->generation++; + atomic_add_int(&bbobj->generation, 1); vm_object_set_flag(object, OBJ_ONSHADOW); } @@ -2512,7 +2512,7 @@ vm_object_collapse(vm_object_t object, struct vm_object_dealloc_list **dlistp) if (object->flags & OBJ_ONSHADOW) { LIST_REMOVE(object, shadow_list); backing_object->shadow_count--; - backing_object->generation++; + atomic_add_int(&backing_object->generation, 1); vm_object_clear_flag(object, OBJ_ONSHADOW); } @@ -2533,7 +2533,7 @@ vm_object_collapse(vm_object_t object, struct vm_object_dealloc_list **dlistp) LIST_INSERT_HEAD(&bbobj->shadow_head, object, shadow_list); bbobj->shadow_count++; - bbobj->generation++; + atomic_add_int(&bbobj->generation, 1); vm_object_set_flag(object, OBJ_ONSHADOW); } else { diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c index 21c8abd228..863543907d 100644 --- a/sys/vm/vm_page.c +++ b/sys/vm/vm_page.c @@ -1189,7 +1189,7 @@ vm_page_insert(vm_page_t m, vm_object_t object, vm_pindex_t pindex) if (m->object != NULL) panic("vm_page_insert: already inserted"); - object->generation++; + atomic_add_int(&object->generation, 1); /* * Record the object/offset pair in this page and add the @@ -1263,10 +1263,9 @@ vm_page_remove(vm_page_t m) --object->resident_page_count; --mycpu->gd_vmtotal.t_rm; m->object = NULL; + atomic_add_int(&object->generation, 1); vm_page_spin_unlock(m); - object->generation++; - vm_object_drop(object); } @@ -2825,6 +2824,8 @@ vm_page_dontneed(vm_page_t m) * Because vm_pages can overlap buffers m->busy can be > 1. m->busy is only * adjusted while the vm_page is PG_BUSY so the flash will occur when the * busy bit is cleared. + * + * The caller must hold the page BUSY when making these two calls. */ void vm_page_io_start(vm_page_t m) @@ -3165,7 +3166,7 @@ vm_page_set_invalid(vm_page_t m, int base, int size) bits = vm_page_bits(base, size); m->valid &= ~bits; m->dirty &= ~bits; - m->object->generation++; + atomic_add_int(&m->object->generation, 1); } /* diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c index 071101ddb0..41eaa9d4a2 100644 --- a/sys/vm/vm_pageout.c +++ b/sys/vm/vm_pageout.c @@ -237,12 +237,8 @@ PQAVERAGE(int n) /* * vm_pageout_clean_helper: * - * Clean the page and remove it from the laundry. The page must not be - * busy on-call. - * - * We set the busy bit to cause potential page faults on this page to - * block. Note the careful timing, however, the busy bit isn't set till - * late and we cannot do anything that will mess with the page. + * Clean the page and remove it from the laundry. The page must be busied + * by the caller and will be disposed of (put away, flushed) by this routine. */ static int vm_pageout_clean_helper(vm_page_t m, int vmflush_flags) @@ -256,16 +252,7 @@ vm_pageout_clean_helper(vm_page_t m, int vmflush_flags) object = m->object; /* - * It doesn't cost us anything to pageout OBJT_DEFAULT or OBJT_SWAP - * with the new swapper, but we could have serious problems paging - * out other object types if there is insufficient memory. - * - * Unfortunately, checking free memory here is far too late, so the - * check has been moved up a procedural level. - */ - - /* - * Don't mess with the page if it's busy, held, or special + * Don't mess with the page if it's held or special. * * XXX do we really need to check hold_count here? hold_count * isn't supposed to mess with vm_page ops except prevent the @@ -452,9 +439,10 @@ vm_pageout_flush(vm_page_t *mc, int count, int vmflush_flags) vm_object_pip_add(object, count); vm_pager_put_pages(object, mc, count, - (vmflush_flags | - ((object == &kernel_object) ? VM_PAGER_PUT_SYNC : 0)), - pageout_status); + (vmflush_flags | + ((object == &kernel_object) ? + VM_PAGER_PUT_SYNC : 0)), + pageout_status); for (i = 0; i < count; i++) { vm_page_t mt = mc[i]; -- 2.11.4.GIT