Mailing List Archive

Overlapping ioremap() calls, set_memory_*() semantics
At kernel summit, during the semantics of ioremap() session, Paul
mentioned we'd write something up to help get some notes out on what
we need to do and help clarify things. I've run into an issue (just a
warning) with a user on some odd system that I suspect may be the
result of a driver using overlapping ioremap() calls on conflicting
memory regions, so I'm a bit interested to see a resolution to some of
these lingering discussions now.

Although we spoke of quite a bit of things, I raised in particular the
'semantics of overlapping ioremap()' calls as one item of interest we
should address across architectures. At least on x86 it seems we would
not get an error if this is done and in fact its expected behavior;
Toshi had determined we could not enable error'ing out on overlapping
ioremap() calls given we have a few users that use it intentionally,
for instance the /dev/mem setup code. I had suggested long ago then
that one possible resolution was for us to add an API that *enables*
overlapping ioremap() calls, and only use it on select locations in
the kernel. This means we only have to convert a few users to that
call to white list such semantics, and by default we'd disable
overlapping calls. To kick things off -- is this strategy agreeable
for all other architectures?

The problem is that without this it remains up to the developer of the
driver to avoid overlapping calls, and a user may just get sporadic
errors if this happens. As another problem case, set_memor_*() will
not fail on MMIO even though set_memor_*() is designed only for RAM.
If the above strategy on avoiding overlapping is agreeable, could the
next step, or an orthogonal step be to error out on set_memory_*() on
IO memory?

Luis
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
* Luis R. Rodriguez <mcgrof@kernel.org> wrote:

> At kernel summit, during the semantics of ioremap() session, Paul
> mentioned we'd write something up to help get some notes out on what
> we need to do and help clarify things. I've run into an issue (just a
> warning) with a user on some odd system that I suspect may be the
> result of a driver using overlapping ioremap() calls on conflicting
> memory regions, so I'm a bit interested to see a resolution to some of
> these lingering discussions now.
>
> Although we spoke of quite a bit of things, I raised in particular the
> 'semantics of overlapping ioremap()' calls as one item of interest we
> should address across architectures. At least on x86 it seems we would
> not get an error if this is done and in fact its expected behavior;
> Toshi had determined we could not enable error'ing out on overlapping
> ioremap() calls given we have a few users that use it intentionally,
> for instance the /dev/mem setup code. I had suggested long ago then
> that one possible resolution was for us to add an API that *enables*
> overlapping ioremap() calls, and only use it on select locations in
> the kernel. This means we only have to convert a few users to that
> call to white list such semantics, and by default we'd disable
> overlapping calls. To kick things off -- is this strategy agreeable
> for all other architectures?

So I'd say that since ioremap() in itself is fragile enough, we should work
towards eliminating overlapping ranges.

The thing is, the whole vmap_area logic is based around non-overlapping ranges,
sorted into the vmap_area_root rbtree.

Just check the logic in mm/vmalloc.c::alloc_vmap_area(): it's based on finding
holes in the kernel-virtual allocations. 'Overlapping ranges' is very much not
part of that logic, at least to my understanding.

How are overlapping ioremap()s even possible with that logic? The allocator
searches for holes, not allowing for overlaps. What am I missing?

Could you outline a specific case where it's done intentionally - and the purpose
behind that intention?

> The problem is that without this it remains up to the developer of the driver to
> avoid overlapping calls, and a user may just get sporadic errors if this
> happens. As another problem case, set_memor_*() will not fail on MMIO even
> though set_memor_*() is designed only for RAM. If the above strategy on avoiding
> overlapping is agreeable, could the next step, or an orthogonal step be to error
> out on set_memory_*() on IO memory?

So how do drivers set up WC or WB MMIO areas? Does none of our upstream video
drivers do that?

Thanks,

Ingo
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
On Fri, 2016-03-04 at 10:44 +0100, Ingo Molnar wrote:
> * Luis R. Rodriguez <mcgrof@kernel.org> wrote:
>
> > At kernel summit, during the semantics of ioremap() session, Paul
> > mentioned we'd write something up to help get some notes out on what
> > we need to do and help clarify things. I've run into an issue (just a
> > warning) with a user on some odd system that I suspect may be the
> > result of a driver using overlapping ioremap() calls on conflicting
> > memory regions, so I'm a bit interested to see a resolution to some of
> > these lingering discussions now.
> >
> > Although we spoke of quite a bit of things, I raised in particular the
> > 'semantics of overlapping ioremap()' calls as one item of interest we
> > should address across architectures. At least on x86 it seems we would
> > not get an error if this is done and in fact its expected behavior;
> > Toshi had determined we could not enable error'ing out on overlapping
> > ioremap() calls given we have a few users that use it intentionally,
> > for instance the /dev/mem setup code. I had suggested long ago then
> > that one possible resolution was for us to add an API that *enables*
> > overlapping ioremap() calls, and only use it on select locations in
> > the kernel. This means we only have to convert a few users to that
> > call to white list such semantics, and by default we'd disable
> > overlapping calls. To kick things off -- is this strategy agreeable
> > for all other architectures?
>
> So I'd say that since ioremap() in itself is fragile enough, we should
> work towards eliminating overlapping ranges.
>
> The thing is, the whole vmap_area logic is based around non-overlapping
> ranges, sorted into the vmap_area_root rbtree.
>
> Just check the logic in mm/vmalloc.c::alloc_vmap_area(): it's based on
> finding holes in the kernel-virtual allocations. 'Overlapping ranges' is
> very much not part of that logic, at least to my understanding.
>
> How are overlapping ioremap()s even possible with that logic? The
> allocator searches for holes, not allowing for overlaps. What am I
> missing?
>
> Could you outline a specific case where it's done intentionally - and the
> purpose behind that intention?

The term "overlapping" is a bit misleading.  This is "alias" mapping -- a
physical address range is mapped to multiple virtual address ranges.  There
is no overlapping in VMA.

Such alias mappings are used by multiple modules.  For instance, a PMEM
range is mapped to the kernel and user spaces.  /dev/mem is another example
that creates a user space mapping to a physical address where other
mappings may already exist.

Hence, alias mapping itself is a supported use-case.  However, alias
mapping with different cache types is not as it causes undefined behavior.
 Therefore, PAT module protects from this case by tracking cache types used
for mapping physical ranges.  When a different cache type is requested,
is_new_memtype_allowed() checks if the request needs to be failed or can be
changed to the existing type.

I agree that the current implementation is fragile, and some interfaces
skip such check at all, ex. vm_insert_pfn().


> > The problem is that without this it remains up to the developer of the
> > driver to avoid overlapping calls, and a user may just get sporadic
> > errors if this happens.  As another problem case, set_memor_*() will
> > not fail on MMIO even though set_memor_*() is designed only for RAM. If
> > the above strategy on avoiding overlapping is agreeable, could the next
> > step, or an orthogonal step be to error out on set_memory_*() on IO
> > memory?
>
> So how do drivers set up WC or WB MMIO areas? Does none of our upstream
> video drivers do that?

Drivers use ioremap family with a right cache type when mapping MMIO
ranges, ex. ioremap_wc().  They do not need to change the type to MMIO.
 RAM is different since it's already mapped with WB at boot-time.
 set_memory_*() allows us to change the type from WB, and put it back to
WB.

Thanks,
-Toshi
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
On Fri, Mar 4, 2016 at 10:18 AM, Toshi Kani <toshi.kani@hpe.com> wrote:
> On Fri, 2016-03-04 at 10:44 +0100, Ingo Molnar wrote:
>> * Luis R. Rodriguez <mcgrof@kernel.org> wrote:
>>
>> Could you outline a specific case where it's done intentionally - and the
>> purpose behind that intention?
>
> The term "overlapping" is a bit misleading.

To the driver author its overlapping, its overlapping on the physical address.

> This is "alias" mapping -- a
> physical address range is mapped to multiple virtual address ranges. There
> is no overlapping in VMA.

Meanwhile this is what happens behind the scenes, and because of this
a driver developer may end up with two different offsets for
overlapping physical addresses ranges, so long as they use the right
offset they are *supposed* to get then the intended cache type
effects. The resulting cache type effect will depend on architecture
and intended cache type on the ioremap call. Furthermore there are
modifiers possible, such as was MTRR (and fortunately now deprecated
on kernel driver use), and those effects are now documented in
Documentation/x86/pat.txt on a table for PAT/not-PAT systems, which
I'll past here for convenience:

----------------------------------------------------------------------
MTRR Non-PAT PAT Linux ioremap value Effective memory type
----------------------------------------------------------------------
Non-PAT | PAT
PAT
|PCD
||PWT
|||
WC 000 WB _PAGE_CACHE_MODE_WB WC | WC
WC 001 WC _PAGE_CACHE_MODE_WC WC* | WC
WC 010 UC- _PAGE_CACHE_MODE_UC_MINUS WC* | UC
WC 011 UC _PAGE_CACHE_MODE_UC UC | UC
----------------------------------------------------------------------

> Such alias mappings are used by multiple modules. For instance, a PMEM
> range is mapped to the kernel and user spaces. /dev/mem is another example
> that creates a user space mapping to a physical address where other
> mappings may already exist.

The above examples are IMHO perhaps valid uses, and for these perhaps
we should add an API that enables overlapping on physical ranges. I
believe this given that for regular device drivers, at least in review
of write-combining, there were only exceptions I had found that were
using overlapping ranges. The above table was used to devise a
strategy to help phase MTRR in consideration for the overlapping uses
on drivers. Below is a list of what was found:

* atyfb: I ended up splitting the overlapping ranges, it was quite a
bit of work, but I did this mostly to demo the effort needed on a
device driver after a device driver already had used an overlapping
range. To maintain the effective write-combining effects for non-PAT
systems we added ioremap_uc() which can be used on the MMIO range,
this would only be used on the MMIO area, meanwhile the framebuffer
area used ioremap_wc(), the arch_phys_wc_add() call was then used to
trigger write-combining on non-PAT systems due to the above tables.

* ipath: while a split on ipath was possible the changes are quite
significant, way more significant than what was on atyfb. Fortunately
work on this driver for this triggered a discussion of eventually
removing this device driver, so on v4.3-rc2 it was move to staging to
phase it out. For completeness though I'll explain why the addressing
the overlap was hard: apart from changing the driver to use different
offset bases in different regions the driver also has a debug
userspace filemap for the entire register map, the code there would
need to be modified to use the right virtual address base depending on
the virtual address accessed. The qib driver already has logic which
could be mimic'd for this fortunatley, but still - this is
considerable work. One side hack I had hoped for was that overlapping
ioremap*() calls with different page attribute types would work even
if the 2nd returned __iomem address was not used, fortunately this
does not work though, only using the right virtual address would
ensure the right caching technique is used. Since this driver was
being phased out, we decided to annotate it required pat to be
disabled and warn the user if they booted with PAT enabled.

* ivtv: the driver does not have the PCI space mapped out
separately, and in fact it actually does not do the math for the
framebuffer, instead it lets
the device's own CPU do that and assume where its at, see
ivtvfb_get_framebuffer() and CX2341X_OSD_GET_FRAMEBUFFER, it has a get
but not a setter. Its not clear if the firmware would make a split
easy. As with ipath, we decided to just require PAT disabled. This was
a reasonable compromise as the hardware was really old and we
estimated perhaps only a few lost souls were using this device driver,
and for them it should be reasonable to disable PAT.

Fortunately in retrospect it was only old hardware devices that used
overlapping ioremap() calls, but we can't be sure things will remain
that way. It used to be that only framebuffer devices used
write-combining, these days we are seeing infiniband (ipath was one,
qib another) use write-combining for PIO TX for some sort of blast
feature, I would not be surprised if in the future other device
drivers found a quirky use for this. The ivtv case is the *worst*
example we can expect where the firmware hides from us the exact
ranges for write-combining, that we should somehow just hope no one
will ever do again.

> Hence, alias mapping itself is a supported use-case. However, alias
> mapping with different cache types is not as it causes undefined behavior.

Just as-is at times some types of modifiers, such as was with MTRR.

In retrospect, based on a cursory review through the phasing of MTRR,
we only have ipath left (and that will be removed soon) and we just
don't know what ivtv does. If we wanted a more in-depth analysis we
might be able to use something like coccinelle to inspect for
overlapping calls, but that may be a bit of a challenge in and of
itself. It may be easier to phase that behavior out by first warning
against it for a few cycles, and then just laying a hammer down and
saying 'though shalt not do this', and only enable exceptions with a
special API.

> Therefore, PAT module protects from this case by tracking cache types used
> for mapping physical ranges. When a different cache type is requested,
> is_new_memtype_allowed() checks if the request needs to be failed or can be
> changed to the existing type.

Do we have a map of what is allowed for PAT and non-PAT, similar to
the one for MTRR effects above? It would seem we should want something
similar for other architectures ? But if there isn't much valid use
for it, why not just rule this sort of stuff out and only make it an
exception, by only having a few uses being allowed?

> I agree that the current implementation is fragile, and some interfaces
> skip such check at all, ex. vm_insert_pfn().

Ouch.

>> > The problem is that without this it remains up to the developer of the
>> > driver to avoid overlapping calls, and a user may just get sporadic
>> > errors if this happens. As another problem case, set_memor_*() will
>> > not fail on MMIO even though set_memor_*() is designed only for RAM. If
>> > the above strategy on avoiding overlapping is agreeable, could the next
>> > step, or an orthogonal step be to error out on set_memory_*() on IO
>> > memory?
>>
>> So how do drivers set up WC or WB MMIO areas? Does none of our upstream
>> video drivers do that?
>
> Drivers use ioremap family with a right cache type when mapping MMIO
> ranges, ex. ioremap_wc(). They do not need to change the type to MMIO.
> RAM is different since it's already mapped with WB at boot-time.
> set_memory_*() allows us to change the type from WB, and put it back to
> WB.

Shouldn't set_memory_*() fail on IO memory? It does not today.

Luis
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
On Fri, 2016-03-04 at 10:51 -0800, Luis R. Rodriguez wrote:
> On Fri, Mar 4, 2016 at 10:18 AM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > On Fri, 2016-03-04 at 10:44 +0100, Ingo Molnar wrote:
> > > * Luis R. Rodriguez <mcgrof@kernel.org> wrote:
> > >
> > > Could you outline a specific case where it's done intentionally - and
> > > the purpose behind that intention?
> >
> > The term "overlapping" is a bit misleading.
>
> To the driver author its overlapping, its overlapping on the physical
> address.

Right, and I just wanted to state that there is no overlapping in VMA.

> > This is "alias" mapping -- a physical address range is mapped to
> > multiple virtual address ranges.  There is no overlapping in VMA.
>
> Meanwhile this is what happens behind the scenes, and because of this
> a driver developer may end up with two different offsets for
> overlapping physical addresses ranges, so long as they use the right
> offset they are *supposed* to get then the intended cache type
> effects. The resulting cache type effect will depend on architecture
> and intended cache type on the ioremap call. Furthermore there are
> modifiers possible, such as was MTRR (and fortunately now deprecated
> on kernel driver use), and those effects are now documented in
> Documentation/x86/pat.txt on a table for PAT/not-PAT systems, which
> I'll past here for convenience:

Thanks for updating the documentation!

> ----------------------------------------------------------------------
> MTRR Non-PAT   PAT    Linux ioremap value        Effective memory type
> ----------------------------------------------------------------------
>                                                   Non-PAT |  PAT
>      PAT
>      |PCD
>      ||PWT
>      |||
> WC   000      WB      _PAGE_CACHE_MODE_WB            WC   |   WC
> WC   001      WC      _PAGE_CACHE_MODE_WC            WC*  |   WC
> WC   010      UC-     _PAGE_CACHE_MODE_UC_MINUS      WC*  |   UC

In the above case, effective PAT is also WC.

> WC   011      UC      _PAGE_CACHE_MODE_UC            UC   |   UC
> ----------------------------------------------------------------------
>
> > Such alias mappings are used by multiple modules.  For instance, a PMEM
> > range is mapped to the kernel and user spaces.  /dev/mem is another
> > example that creates a user space mapping to a physical address where
> > other mappings may already exist.
>
> The above examples are IMHO perhaps valid uses, and for these perhaps
> we should add an API that enables overlapping on physical ranges. I
> believe this given that for regular device drivers, at least in review
> of write-combining, there were only exceptions I had found that were
> using overlapping ranges. The above table was used to devise a
> strategy to help phase MTRR in consideration for the overlapping uses
> on drivers. Below is a list of what was found:
>
>   * atyfb: I ended up splitting the overlapping ranges, it was quite a
> bit of work, but I did this mostly to demo the effort needed on a
> device driver after a device driver already had used an overlapping
> range. To maintain the effective write-combining effects for non-PAT
> systems we added ioremap_uc() which can be used on the MMIO range,
> this would only be used on the MMIO area, meanwhile the framebuffer
> area used ioremap_wc(), the arch_phys_wc_add() call was then used to
> trigger write-combining on non-PAT systems due to the above tables.
>
>  * ipath: while a split on ipath was possible the changes are quite
> significant, way more significant than what was on atyfb. Fortunately
> work on this driver for this triggered a discussion of eventually
> removing this device driver, so on v4.3-rc2 it was move to staging to
> phase it out. For completeness though I'll explain why the addressing
> the overlap was hard: apart from changing the driver to use different
> offset bases in different regions the driver also has a debug
> userspace filemap for the entire register map, the code there would
> need to be modified to use the right virtual address base depending on
> the virtual address accessed. The qib driver already has logic which
> could be mimic'd for this fortunatley, but still - this is
> considerable work. One side hack I had hoped for was that overlapping
> ioremap*() calls with different page attribute types would work even
> if the 2nd returned __iomem address was not used, fortunately this
> does not work though, only using the right virtual address would
> ensure the right caching technique is used. Since this driver was
> being phased out, we decided to annotate it required pat to be
> disabled and warn the user if they booted with PAT enabled.
>
>   * ivtv: the driver does not have the PCI space mapped out
> separately, and in fact it actually does not do the math for the
> framebuffer, instead it lets
> the device's own CPU do that and assume where its at, see
> ivtvfb_get_framebuffer() and CX2341X_OSD_GET_FRAMEBUFFER, it has a get
> but not a setter. Its not clear if the firmware would make a split
> easy. As with ipath, we decided to just require PAT disabled. This was
> a reasonable compromise as the hardware was really old and we
> estimated perhaps only a few lost souls were using this device driver,
> and for them it should be reasonable to disable PAT.
>
> Fortunately in retrospect it was only old hardware devices that used
> overlapping ioremap() calls, but we can't be sure things will remain
> that way. It used to be that only framebuffer devices used
> write-combining, these days we are seeing infiniband (ipath was one,
> qib another) use write-combining for PIO TX for some sort of blast
> feature, I would not be surprised if in the future other device
> drivers found a quirky use for this. The ivtv case is the *worst*
> example we can expect where the firmware hides from us the exact
> ranges for write-combining, that we should somehow just hope no one
> will ever do again.
>
> > Hence, alias mapping itself is a supported use-case.  However, alias
> > mapping with different cache types is not as it causes undefined
> > behavior.
>
> Just as-is at times some types of modifiers, such as was with MTRR.
>
> In retrospect, based on a cursory review through the phasing of MTRR,
> we only have ipath left (and that will be removed soon) and we just
> don't know what ivtv does. If we wanted a more in-depth analysis we
> might be able to use something like coccinelle to inspect for
> overlapping calls, but that may be a bit of a challenge in and of
> itself. It may be easier to phase that behavior out by first warning
> against it for a few cycles, and then just laying a hammer down and
> saying 'though shalt not do this', and only enable exceptions with a
> special API.

These are great work!  I agree that we should phase out those old drivers
written to work with MTRR.

> > Therefore, PAT module protects from this case by tracking cache types
> > used for mapping physical ranges.  When a different cache type is
> > requested, is_new_memtype_allowed() checks if the request needs to be
> > failed or can be changed to the existing type.
>
> Do we have a map of what is allowed for PAT and non-PAT, similar to
> the one for MTRR effects above?

This type check and tracking code are part of PAT code, so there is no
check for the non-PAT case.

> It would seem we should want something similar for other architectures ?

As far as I know, other architectures does not support alias maps with
different cache types, either.  So, I agree that this is a common issue.

> But if there isn't much valid use for it, why not just rule this sort of
> stuff out and only make it an exception, by only having a few uses being
> allowed?

I agree that it should fail in general.  One exception is WB request, which
is allowed to be modified to any other type.  I think this is needed for
/dev/mem, which maps with WB without any knowledge of a target range.

> > I agree that the current implementation is fragile, and some interfaces
> > skip such check at all, ex. vm_insert_pfn().
>
> Ouch.
>
> > > > The problem is that without this it remains up to the developer of
> > > > the driver to avoid overlapping calls, and a user may just get
> > > > sporadic errors if this happens.  As another problem case,
> > > > set_memor_*() will not fail on MMIO even though set_memor_*() is
> > > > designed only for RAM. If the above strategy on avoiding
> > > > overlapping is agreeable, could the next step, or an orthogonal
> > > > step be to error out on set_memory_*() on IO memory?
> > >
> > > So how do drivers set up WC or WB MMIO areas? Does none of our
> > > upstream video drivers do that?
> >
> > Drivers use ioremap family with a right cache type when mapping MMIO
> > ranges, ex. ioremap_wc().  They do not need to change the type to MMIO.
> >  RAM is different since it's already mapped with WB at boot-time.
> >  set_memory_*() allows us to change the type from WB, and put it back
> > to WB.
>
> Shouldn't set_memory_*() fail on IO memory? It does not today.

Agreed.  It should have had such check to begin with.

Thanks,
-Toshi
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
* Toshi Kani <toshi.kani@hpe.com> wrote:

> > So I'd say that since ioremap() in itself is fragile enough, we should work
> > towards eliminating overlapping ranges.
> >
> > The thing is, the whole vmap_area logic is based around non-overlapping
> > ranges, sorted into the vmap_area_root rbtree.
> >
> > Just check the logic in mm/vmalloc.c::alloc_vmap_area(): it's based on finding
> > holes in the kernel-virtual allocations. 'Overlapping ranges' is very much not
> > part of that logic, at least to my understanding.
> >
> > How are overlapping ioremap()s even possible with that logic? The allocator
> > searches for holes, not allowing for overlaps. What am I missing?
> >
> > Could you outline a specific case where it's done intentionally - and the
> > purpose behind that intention?
>
> The term "overlapping" is a bit misleading. [...]

A bit? It was totally misleading ...

You meant virtual aliases for the same physical address, and those of course are
allowed, as long the cache attributes are compatible, that is what the whole
memtype infrastructure is about, as you yourself note:

> [...]  This is "alias" mapping -- a physical address range is mapped to multiple
> virtual address ranges.  There is no overlapping in VMA.
>
> Such alias mappings are used by multiple modules.  For instance, a PMEM range is
> mapped to the kernel and user spaces.  /dev/mem is another example that creates
> a user space mapping to a physical address where other mappings may already
> exist.
>
> Hence, alias mapping itself is a supported use-case.  However, alias mapping
> with different cache types is not as it causes undefined behavior.  Therefore,
> PAT module protects from this case by tracking cache types used for mapping
> physical ranges.  When a different cache type is requested,
> is_new_memtype_allowed() checks if the request needs to be failed or can be
> changed to the existing type.

So where is the problem? The memtype implementation and hence most ioremap() users
are supposed to be safe. set_memory_*() APIs are supposed to be safe as well, as
they too go via the memtype API.

> I agree that the current implementation is fragile, and some interfaces skip
> such check at all, ex. vm_insert_pfn().

Most of those are really just low level interfaces for special cases that skip the
memtype infrastructure.

Thanks,

Ingo
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
* Luis R. Rodriguez <mcgrof@kernel.org> wrote:

> On Fri, Mar 4, 2016 at 10:18 AM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > On Fri, 2016-03-04 at 10:44 +0100, Ingo Molnar wrote:
> >> * Luis R. Rodriguez <mcgrof@kernel.org> wrote:
> >>
> >> Could you outline a specific case where it's done intentionally - and the
> >> purpose behind that intention?
> >
> > The term "overlapping" is a bit misleading.
>
> To the driver author its overlapping, its overlapping on the physical address.

Guys please stop this crazy talk - when the same physical address is mapped to
multiple virtual addresses is called 'aliasing' - the term 'overlapping' is used
when memory ranges overlap ...

Thanks,

Ingo
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
On Sat, 2016-03-05 at 12:40 +0100, Ingo Molnar wrote:
> * Toshi Kani <toshi.kani@hpe.com> wrote:
>
> > > So I'd say that since ioremap() in itself is fragile enough, we
> > > should work towards eliminating overlapping ranges.
> > >
> > > The thing is, the whole vmap_area logic is based around non-
> > > overlapping ranges, sorted into the vmap_area_root rbtree.
> > >
> > > Just check the logic in mm/vmalloc.c::alloc_vmap_area(): it's based
> > > on finding holes in the kernel-virtual allocations. 'Overlapping
> > > ranges' is very much not part of that logic, at least to my
> > > understanding.
> > >
> > > How are overlapping ioremap()s even possible with that logic? The
> > > allocator searches for holes, not allowing for overlaps. What am I
> > > missing?
> > >
> > > Could you outline a specific case where it's done intentionally - and
> > > the purpose behind that intention?
> >
> > The term "overlapping" is a bit misleading. [...]
>
> A bit? It was totally misleading ...
>
> You meant virtual aliases for the same physical address, and those of
> course are allowed, as long the cache attributes are compatible, that is
> what the whole memtype infrastructure is about, as you yourself note:
>
> > [...]  This is "alias" mapping -- a physical address range is mapped to
> > multiple virtual address ranges.  There is no overlapping in VMA.
> >
> > Such alias mappings are used by multiple modules.  For instance, a PMEM
> > range is mapped to the kernel and user spaces.  /dev/mem is another
> > example that creates a user space mapping to a physical address where
> > other mappings may already exist.
> >
> > Hence, alias mapping itself is a supported use-case.  However, alias
> > mapping with different cache types is not as it causes undefined
> > behavior.  Therefore, PAT module protects from this case by tracking
> > cache types used for mapping physical ranges.  When a different cache
> > type is requested, is_new_memtype_allowed() checks if the request needs
> > to be failed or can be changed to the existing type.
>
> So where is the problem? The memtype implementation and hence most
> ioremap() users are supposed to be safe. set_memory_*() APIs are supposed
> to be safe as well, as they too go via the memtype API.

Let me try to summarize...

The original issue Luis brought up was that drivers written to work with
MTRR may create a single ioremap range covering multiple cache attributes
since MTRR can overwrite cache attribute of a certain range.  Converting
such drivers with PAT-based ioremap interfaces, i.e. ioremap_wc() and
ioremap_nocache(), requires a separate ioremap map for each cache
attribute, which can be challenging as it may result in overlapping ioremap
ranges (in his term) with different cache attributes.

So, Luis asked about 'sematics of overlapping ioremap()' calls.  Hence, I
responded that aliasing mapping itself is supported, but alias with
different cache attribute is not.  We have checks in place to detect such
condition.  Overlapping ioremap calls with a different cache attribute
either fails or gets redirected to the existing cache attribute on x86.

> > I agree that the current implementation is fragile, and some interfaces
> > skip such check at all, ex. vm_insert_pfn().
>
> Most of those are really just low level interfaces forl cases that skip the memtype infrastructure.

Yes, and I'm just stating the fact that some pfn map use-cases, such as
mmap, are not tracked in memtype.  For example, drm_gem_mmap()
-> drm_gem_mmap_obj() sets its VMA as WC attribute.  i915_gem_fault() then
creates a WC map with vm_insert_pfn() at fault.

Thanks,
-Toshi
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
* Toshi Kani <toshi.kani@hpe.com> wrote:

> > So where is the problem? The memtype implementation and hence most
> > ioremap() users are supposed to be safe. set_memory_*() APIs are supposed
> > to be safe as well, as they too go via the memtype API.
>
> Let me try to summarize...
>
> The original issue Luis brought up was that drivers written to work with
> MTRR may create a single ioremap range covering multiple cache attributes
> since MTRR can overwrite cache attribute of a certain range.  Converting
> such drivers with PAT-based ioremap interfaces, i.e. ioremap_wc() and
> ioremap_nocache(), requires a separate ioremap map for each cache
> attribute, which can be challenging as it may result in overlapping ioremap
> ranges (in his term) with different cache attributes.
>
> So, Luis asked about 'sematics of overlapping ioremap()' calls.  Hence, I
> responded that aliasing mapping itself is supported, but alias with
> different cache attribute is not.  We have checks in place to detect such
> condition.  Overlapping ioremap calls with a different cache attribute
> either fails or gets redirected to the existing cache attribute on x86.

Ok, fair enough!

So to go back to the original suggestion from Luis, I've quoted it, but with a
s/overlapping/aliased substitution:

> I had suggested long ago then that one possible resolution was for us to add an
> API that *enables* aliased ioremap() calls, and only use it on select locations
> in the kernel. This means we only have to convert a few users to that call to
> white list such semantics, and by default we'd disable aliased calls. To kick
> things off -- is this strategy agreeable for all other architectures?

I'd say that since the overwhelming majority of ioremap() calls are not aliased,
ever, thus making it 'harder' to accidentally alias is probably a good idea.

The memtype infrastructure of phyisical memory ranges in that case acts as a
security measure, to avoid unintended (not just physically incompatible) aliasing.
I suspect it would be helpful during driver development as well.

What extra API are you thinking about to enable aliasing in the few cases where
it's justified?

the other problem listed:

> As another problem case, set_memor_*() will not fail on MMIO even though
> set_memor_*() is designed only for RAM.

So what does this mean exactly? Having WB caching on MMIO area is not good, but
UC, WC and WB sure is still sensible in some cases, right?

> [...] If the above strategy on avoiding aliasing is agreeable, could the next
> step, or an orthogonal step be to error out on set_memory_*() on IO memory?

Well, do we have drivers that nevertheless change caching attributes on MMIO
areas?

Basically if ioremap_uc() and ioremap_wc() is allowed on MMIO areas, then I see no
reason in principle why it should be invalid to change the area from UC to WC
after it has been ioremap()ed.

Thanks,

Ingo
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
On Tue, 2016-03-08 at 13:16 +0100, Ingo Molnar wrote:
> * Toshi Kani <toshi.kani@hpe.com> wrote:
>
> > > So where is the problem? The memtype implementation and hence most
> > > ioremap() users are supposed to be safe. set_memory_*() APIs are
> > > supposed
> > > to be safe as well, as they too go via the memtype API.
> >
> > Let me try to summarize...
> >
> > The original issue Luis brought up was that drivers written to work
> > with MTRR may create a single ioremap range covering multiple cache
> > attributes since MTRR can overwrite cache attribute of a certain range.
> >  Converting such drivers with PAT-based ioremap interfaces, i.e.
> > ioremap_wc() and ioremap_nocache(), requires a separate ioremap map for
> > each cache attribute, which can be challenging as it may result in
> > overlapping ioremap ranges (in his term) with different cache
> > attributes.
> >
> > So, Luis asked about 'sematics of overlapping ioremap()' calls.  Hence,
> > I responded that aliasing mapping itself is supported, but alias with
> > different cache attribute is not.  We have checks in place to detect
> > such condition.  Overlapping ioremap calls with a different cache
> > attribute either fails or gets redirected to the existing cache
> > attribute on x86.
>
> Ok, fair enough!
>
> So to go back to the original suggestion from Luis, I've quoted it, but
> with a s/overlapping/aliased substitution:
>
> > I had suggested long ago then that one possible resolution was for us
> > to add an API that *enables* aliased ioremap() calls, and only use it
> > on select locations in the kernel. This means we only have to convert a
> > few users to that call to white list such semantics, and by default
> > we'd disable aliased calls. To kick things off -- is this strategy
> > agreeable for all other architectures?
>
> I'd say that since the overwhelming majority of ioremap() calls are not
> aliased, ever, thus making it 'harder' to accidentally alias is probably
> a good idea.

Did you mean 'aliased' or 'aliased with different cache attribute'?  The
former check might be too strict.

> The memtype infrastructure of phyisical memory ranges in that case acts
> as a security measure, to avoid unintended (not just physically
> incompatible) aliasing. I suspect it would be helpful during driver
> development as well.

The memtype infrastructure does not track caller interfaces.  So, it will
check against to any map, i.e. kernel & user map.  I assume a kernel map is
created before user map, though.

> What extra API are you thinking about to enable aliasing in the few cases
> where it's justified?

I'll defer this for Luis...

> the other problem listed:
>
> > As another problem case, set_memor_*() will not fail on MMIO even
> > though set_memor_*() is designed only for RAM.
>
> So what does this mean exactly? Having WB caching on MMIO area is not
> good, but UC, WC and WB sure is still sensible in some cases, right?

I responded to Luis in other email that:
| Drivers use ioremap family with a right cache type when mapping MMIO
| ranges, ex. ioremap_wc().  They do not need to change the type to MMIO.
| RAM is different since it's already mapped with WB at boot-time.
| set_memory_*() allows us to change the type from WB, and put it back to
| WB.

> > [...] If the above strategy on avoiding aliasing is agreeable, could
> > the next step, or an orthogonal step be to error out on set_memory_*()
> > on IO memory?
>
> Well, do we have drivers that nevertheless change caching attributes on
> MMIO areas?

Not sure.  We will need to check all callers of set_memory_xx() if we
change it to fail on MMIO.

> Basically if ioremap_uc() and ioremap_wc() is allowed on MMIO areas, then
> I see no reason in principle why it should be invalid to change the area
> from UC to WC after it has been ioremap()ed.

The current implementation does not support MMIO.
 - It does not track cache attribute correctly for MMIO since it uses
__pa().
 - It only supports attribute transition of {WB -> NewType -> WB} for RAM.
 RAM is tracked differently that WB is treated as "no map".  So, this
transition does not cause a conflict on RAM.  This will causes a conflict
on MMIO when it is tracked correctly.   

Thanks,
-Toshi
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
* Toshi Kani <toshi.kani@hpe.com> wrote:

> On Tue, 2016-03-08 at 13:16 +0100, Ingo Molnar wrote:
> > * Toshi Kani <toshi.kani@hpe.com> wrote:
> >
> > > > So where is the problem? The memtype implementation and hence most
> > > > ioremap() users are supposed to be safe. set_memory_*() APIs are
> > > > supposed
> > > > to be safe as well, as they too go via the memtype API.
> > >
> > > Let me try to summarize...
> > >
> > > The original issue Luis brought up was that drivers written to work
> > > with MTRR may create a single ioremap range covering multiple cache
> > > attributes since MTRR can overwrite cache attribute of a certain range.
> > >  Converting such drivers with PAT-based ioremap interfaces, i.e.
> > > ioremap_wc() and ioremap_nocache(), requires a separate ioremap map for
> > > each cache attribute, which can be challenging as it may result in
> > > overlapping ioremap ranges (in his term) with different cache
> > > attributes.
> > >
> > > So, Luis asked about 'sematics of overlapping ioremap()' calls.  Hence,
> > > I responded that aliasing mapping itself is supported, but alias with
> > > different cache attribute is not.  We have checks in place to detect
> > > such condition.  Overlapping ioremap calls with a different cache
> > > attribute either fails or gets redirected to the existing cache
> > > attribute on x86.
> >
> > Ok, fair enough!
> >
> > So to go back to the original suggestion from Luis, I've quoted it, but
> > with a s/overlapping/aliased substitution:
> >
> > > I had suggested long ago then that one possible resolution was for us
> > > to add an API that *enables* aliased ioremap() calls, and only use it
> > > on select locations in the kernel. This means we only have to convert a
> > > few users to that call to white list such semantics, and by default
> > > we'd disable aliased calls. To kick things off -- is this strategy
> > > agreeable for all other architectures?
> >
> > I'd say that since the overwhelming majority of ioremap() calls are not
> > aliased, ever, thus making it 'harder' to accidentally alias is probably
> > a good idea.
>
> Did you mean 'aliased' or 'aliased with different cache attribute'?  The former
> check might be too strict.

I'd say even 'same attribute' aliasing is probably relatively rare.

And 'different but compatible cache attribute' is in fact more of a sign that the
driver author does the aliasing for a valid _reason_: to have two different types
of access methods to the same piece of physical address space...

> > The memtype infrastructure of phyisical memory ranges in that case acts as a
> > security measure, to avoid unintended (not just physically incompatible)
> > aliasing. I suspect it would be helpful during driver development as well.
>
> The memtype infrastructure does not track caller interfaces.  So, it will check
> against to any map, i.e. kernel & user map.  I assume a kernel map is created
> before user map, though.

I don't understand this, could you give me a specific example?

> > the other problem listed:
> >
> > > As another problem case, set_memor_*() will not fail on MMIO even
> > > though set_memor_*() is designed only for RAM.
> >
> > So what does this mean exactly? Having WB caching on MMIO area is not
> > good, but UC, WC and WB sure is still sensible in some cases, right?
>
> I responded to Luis in other email that:
> | Drivers use ioremap family with a right cache type when mapping MMIO
> | ranges, ex. ioremap_wc().  They do not need to change the type to MMIO.
> | RAM is different since it's already mapped with WB at boot-time.
> | set_memory_*() allows us to change the type from WB, and put it back to
> | WB.

So there's two different uses here:

- set_memory_x() to set the caching attribute
- set_memory_x() to set any of the RWX access attributes

I'd in fact suggest we split these two uses to avoid confusion: keep
set_memory_x() APIs for the RWX access attributes uses, and introduce
a new API that sets caching attributes:

set_cache_attr_wc()
set_cache_attr_uc()
set_cache_attr_wb()
set_cache_attr_wt()

it has the same arguments, so it's basically just a rename initially.

And at that point we could definitely argue that set_cache_attr_*() APIs should
probably generate a warning for _RAM_, because they mostly make sense for MMIO
type of physical addresses, right? Regular RAM should always be WB.

Are there cases where we change the caching attribute of RAM for valid reasons,
outside of legacy quirks?

> > Basically if ioremap_uc() and ioremap_wc() is allowed on MMIO areas, then
> > I see no reason in principle why it should be invalid to change the area
> > from UC to WC after it has been ioremap()ed.
>
> The current implementation does not support MMIO.
>  - It does not track cache attribute correctly for MMIO since it uses
> __pa().

Hm, so __pa() works on mmio addresses as well - at least on x86. The whole mmtype
tree is physical address based - which too is MMIO compatible in principle.

The only problem would be if it was fundamentally struct page * based - but it's
not AFAICS. We have a few APIs that work on struct page * arrays, but those are
just vectored helper APIs AFAICS.

What am I missing?

>  - It only supports attribute transition of {WB -> NewType -> WB} for RAM.
>  RAM is tracked differently that WB is treated as "no map".  So, this
> transition does not cause a conflict on RAM.  This will causes a conflict
> on MMIO when it is tracked correctly.   

That looks like a bug?

Thanks,

Ingo
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
On Mon, Mar 7, 2016 at 9:03 AM, Toshi Kani <toshi.kani@hpe.com> wrote:
> Let me try to summarize...
>
> The original issue Luis brought up was that drivers written to work with
> MTRR may create a single ioremap range covering multiple cache attributes
> since MTRR can overwrite cache attribute of a certain range. Converting
> such drivers with PAT-based ioremap interfaces, i.e. ioremap_wc() and
> ioremap_nocache(), requires a separate ioremap map for each cache
> attribute, which can be challenging as it may result in overlapping ioremap
> ranges (in his term) with different cache attributes.
>
> So, Luis asked about 'sematics of overlapping ioremap()' calls. Hence, I
> responded that aliasing mapping itself is supported, but alias with
> different cache attribute is not. We have checks in place to detect such
> condition. Overlapping ioremap calls with a different cache attribute
> either fails or gets redirected to the existing cache attribute on x86.

A little off-topic, but someone reminded me recently: most recent CPUs
have self-snoop. It's poorly documented, but on self-snooping CPUs, I
think that a lot of the aliasing issues go away. We may be able to
optimize the code quite a bit on these CPUs.

I also wonder whether we can drop a bunch of the memtype tracking.
After all, if we have aliases of different types on a self-snooping
CPU and /dev/mem is locked down hard enough, we could maybe get away
with letting self-snoop handle all the conflicts.

(We could also make /dev/mem always do UC if it would help.)

--Andy
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
Hi Ingo,

My apology for the delay...

On Wed, 2016-03-09 at 10:15 +0100, Ingo Molnar wrote:
> * Toshi Kani <toshi.kani@hpe.com> wrote:
>
> > On Tue, 2016-03-08 at 13:16 +0100, Ingo Molnar wrote:
> > > * Toshi Kani <toshi.kani@hpe.com> wrote:
> > >
> > > > > So where is the problem? The memtype implementation and hence
> > > > > most ioremap() users are supposed to be safe. set_memory_*() APIs
> > > > > are supposed to be safe as well, as they too go via the memtype
> > > > > API.
> > > >
> > > > Let me try to summarize...
> > > >
> > > > The original issue Luis brought up was that drivers written to work
> > > > with MTRR may create a single ioremap range covering multiple cache
> > > > attributes since MTRR can overwrite cache attribute of a certain
> > > > range.  Converting such drivers with PAT-based ioremap interfaces,
> > > > i.e. ioremap_wc() and ioremap_nocache(), requires a separate
> > > > ioremap map for each cache attribute, which can be challenging as
> > > > it may result in overlapping ioremap ranges (in his term) with
> > > > different cache attributes.
> > > >
> > > > So, Luis asked about 'sematics of overlapping ioremap()' calls.
> > > >  Hence, I responded that aliasing mapping itself is supported, but
> > > > alias with different cache attribute is not.  We have checks in
> > > > place to detect such condition.  Overlapping ioremap calls with a
> > > > different cache attribute either fails or gets redirected to the
> > > > existing cache attribute on x86.
> > >
> > > Ok, fair enough!
> > >
> > > So to go back to the original suggestion from Luis, I've quoted it,
> > > but with a s/overlapping/aliased substitution:
> > >
> > > > I had suggested long ago then that one possible resolution was for
> > > > us to add an API that *enables* aliased ioremap() calls, and only
> > > > use it on select locations in the kernel. This means we only have
> > > > to convert a few users to that call to white list such semantics,
> > > > and by default we'd disable aliased calls. To kick things off -- is
> > > > this strategy agreeable for all other architectures?
> > >
> > > I'd say that since the overwhelming majority of ioremap() calls are
> > > not aliased, ever, thus making it 'harder' to accidentally alias is
> > > probably a good idea.
> >
> > Did you mean 'aliased' or 'aliased with different cache attribute'?
> >  The former check might be too strict.
>
> I'd say even 'same attribute' aliasing is probably relatively rare.
>
> And 'different but compatible cache attribute' is in fact more of a sign
> that the driver author does the aliasing for a valid _reason_: to have
> two different types of access methods to the same piece of physical
> address space...

Right.  So, if we change to fail ioremap() on aliased cases, it'd be easier
to start with the different attribute case first.  This case should be rare
enough that we can manage to identify such callers and make them use a new
API as necessary.  If we go ahead to fail any aliased cases, it'd be
challenging to manage without a regression or two.

> > > The memtype infrastructure of phyisical memory ranges in that case
> > > acts as a security measure, to avoid unintended (not just physically
> > > incompatible) aliasing. I suspect it would be helpful during driver
> > > development as well.
> >
> > The memtype infrastructure does not track caller interfaces.  So, it
> > will check against to any map, i.e. kernel & user map.  I assume a
> > kernel map is created before user map, though.
>
> I don't understand this, could you give me a specific example?

Multiple pfn map interfaces use the memtype.  So, when ioremap() detects an
aliased map in the memtype, it may be created by ioremap(),
remap_pfn_range(), or any pfn map that is tracked.  So, strictly speaking,
we cannot check against other ioremap().

> > > the other problem listed:
> > >
> > > > As another problem case, set_memor_*() will not fail on MMIO even
> > > > though set_memor_*() is designed only for RAM.
> > >
> > > So what does this mean exactly? Having WB caching on MMIO area is not
> > > good, but UC, WC and WB sure is still sensible in some cases, right?
> >
> > I responded to Luis in other email that:
> > > Drivers use ioremap family with a right cache type when mapping MMIO
> > > ranges, ex. ioremap_wc().  They do not need to change the type to
> > > MMIO. RAM is different since it's already mapped with WB at boot-
> > > time. set_memory_*() allows us to change the type from WB, and put it
> > > back to WB.
>
> So there's two different uses here:
>
>  - set_memory_x() to set the caching attribute
>  - set_memory_x() to set any of the RWX access attributes
>
> I'd in fact suggest we split these two uses to avoid confusion: keep 
> set_memory_x() APIs for the RWX access attributes uses, and introduce
> a new API that sets caching attributes:
>
> set_cache_attr_wc()
> set_cache_attr_uc()
> set_cache_attr_wb()
> set_cache_attr_wt()
>
> it has the same arguments, so it's basically just a rename initially.

I think the "set_memory_" prefix implies that their target is regular
memory only.

> And at that point we could definitely argue that set_cache_attr_*() APIs
> should probably generate a warning for _RAM_, because they mostly make
> sense for MMIO type of physical addresses, right? Regular RAM should
> always be WB.
>
> Are there cases where we change the caching attribute of RAM for valid
> reasons, outside of legacy quirks?

ati_create_page_map() is one example that it gets a RAM page
by __get_free_page(), and changes it to UC by calling set_memory_uc().
Since RAM is already mapped with WB, the driver has to change its attribute
when it needs a different cache type.  For MMIO, drivers should be able to
create a map with the right attribute since they own the ranges.

> > > Basically if ioremap_uc() and ioremap_wc() is allowed on MMIO areas,
> > > then I see no reason in principle why it should be invalid to change
> > > the area from UC to WC after it has been ioremap()ed.
> >
> > The current implementation does not support MMIO.
> >  - It does not track cache attribute correctly for MMIO since it uses
> > __pa().
>
> Hm, so __pa() works on mmio addresses as well - at least on x86. The
> whole mmtype tree is physical address based - which too is MMIO
> compatible in principle.
>
> The only problem would be if it was fundamentally struct page * based -
> but it's not AFAICS. We have a few APIs that work on struct page *
> arrays, but those are just vectored helper APIs AFAICS.
>
> What am I missing?

I do not think __pa() returns the right physical address for an ioremap'd
virtual address since its virtual address is allocated from the vmalloc
range.  But we can simply use slow_virt_to_phys(), instead.

> >  - It only supports attribute transition of {WB -> NewType -> WB} for
> > RAM.  RAM is tracked differently that WB is treated as "no map".  So,
> > this transition does not cause a conflict on RAM.  This will causes a
> > conflict on MMIO when it is tracked correctly.   
>
> That looks like a bug?

This is by design since set_memory_xx was introduced for RAM only.  If we
extend it to MMIO, then we need to change how memtype manages MMIO.

Thanks,
-Toshi
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
On Thu, 2016-03-10 at 22:47 -0800, Andy Lutomirski wrote:
> On Mon, Mar 7, 2016 at 9:03 AM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > Let me try to summarize...
> >
> > The original issue Luis brought up was that drivers written to work
> > with MTRR may create a single ioremap range covering multiple cache
> > attributes since MTRR can overwrite cache attribute of a certain
> > range.  Converting such drivers with PAT-based ioremap interfaces, i.e.
> > ioremap_wc() and ioremap_nocache(), requires a separate ioremap map for
> > each cache attribute, which can be challenging as it may result in
> > overlapping ioremap ranges (in his term) with different cache
> > attributes.
> >
> > So, Luis asked about 'sematics of overlapping ioremap()' calls.  Hence,
> > I responded that aliasing mapping itself is supported, but alias with
> > different cache attribute is not.  We have checks in place to detect
> > such condition.  Overlapping ioremap calls with a different cache
> > attribute either fails or gets redirected to the existing cache
> > attribute on x86.
>
> A little off-topic, but someone reminded me recently: most recent CPUs
> have self-snoop.  It's poorly documented, but on self-snooping CPUs, I
> think that a lot of the aliasing issues go away.  We may be able to
> optimize the code quite a bit on these CPUs.

Interesting.  I wonder how much we can rely on this feature.  Yes, by
looking at Intel SDM, it is indeed poorly documented. :-( 

> I also wonder whether we can drop a bunch of the memtype tracking.
> After all, if we have aliases of different types on a self-snooping
> CPU and /dev/mem is locked down hard enough, we could maybe get away
> with letting self-snoop handle all the conflicts.
>
> (We could also make /dev/mem always do UC if it would help.)

It'd be interesting to know how it performs on an aliased map when it works
correctly... 

Thanks,
-Toshi
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
On Fri, Mar 11, 2016 at 2:36 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> On Thu, 2016-03-10 at 22:47 -0800, Andy Lutomirski wrote:
>> On Mon, Mar 7, 2016 at 9:03 AM, Toshi Kani <toshi.kani@hpe.com> wrote:
>> > Let me try to summarize...
>> >
>> > The original issue Luis brought up was that drivers written to work
>> > with MTRR may create a single ioremap range covering multiple cache
>> > attributes since MTRR can overwrite cache attribute of a certain
>> > range. Converting such drivers with PAT-based ioremap interfaces, i.e.
>> > ioremap_wc() and ioremap_nocache(), requires a separate ioremap map for
>> > each cache attribute, which can be challenging as it may result in
>> > overlapping ioremap ranges (in his term) with different cache
>> > attributes.
>> >
>> > So, Luis asked about 'sematics of overlapping ioremap()' calls. Hence,
>> > I responded that aliasing mapping itself is supported, but alias with
>> > different cache attribute is not. We have checks in place to detect
>> > such condition. Overlapping ioremap calls with a different cache
>> > attribute either fails or gets redirected to the existing cache
>> > attribute on x86.
>>
>> A little off-topic, but someone reminded me recently: most recent CPUs
>> have self-snoop. It's poorly documented, but on self-snooping CPUs, I
>> think that a lot of the aliasing issues go away. We may be able to
>> optimize the code quite a bit on these CPUs.
>
> Interesting. I wonder how much we can rely on this feature. Yes, by
> looking at Intel SDM, it is indeed poorly documented. :-(

Any Intel people want to give us a hint?

--Andy
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
On Fri, Mar 11, 2016 at 03:13:52PM -0700, Toshi Kani wrote:
> Hi Ingo,
>
> My apology for the delay...
>
> On Wed, 2016-03-09 at 10:15 +0100, Ingo Molnar wrote:
> > * Toshi Kani <toshi.kani@hpe.com> wrote:
> >
> > > On Tue, 2016-03-08 at 13:16 +0100, Ingo Molnar wrote:
> > > > * Toshi Kani <toshi.kani@hpe.com> wrote:
> > > >
> > > > > > So where is the problem? The memtype implementation and hence
> > > > > > most ioremap() users are supposed to be safe. set_memory_*() APIs
> > > > > > are supposed to be safe as well, as they too go via the memtype
> > > > > > API.
> > > > >
> > > > > Let me try to summarize...
> > > > >
> > > > > The original issue Luis brought up was that drivers written to work
> > > > > with MTRR may create a single ioremap range covering multiple cache
> > > > > attributes since MTRR can overwrite cache attribute of a certain
> > > > > range.  Converting such drivers with PAT-based ioremap interfaces,
> > > > > i.e. ioremap_wc() and ioremap_nocache(), requires a separate
> > > > > ioremap map for each cache attribute, which can be challenging as
> > > > > it may result in overlapping ioremap ranges (in his term) with
> > > > > different cache attributes.
> > > > >
> > > > > So, Luis asked about 'sematics of overlapping ioremap()' calls.
> > > > >  Hence, I responded that aliasing mapping itself is supported, but
> > > > > alias with different cache attribute is not.  We have checks in
> > > > > place to detect such condition.  Overlapping ioremap calls with a
> > > > > different cache attribute either fails or gets redirected to the
> > > > > existing cache attribute on x86.
> > > >
> > > > Ok, fair enough!
> > > >
> > > > So to go back to the original suggestion from Luis, I've quoted it,
> > > > but with a s/overlapping/aliased substitution:
> > > >
> > > > > I had suggested long ago then that one possible resolution was for
> > > > > us to add an API that *enables* aliased ioremap() calls, and only
> > > > > use it on select locations in the kernel. This means we only have
> > > > > to convert a few users to that call to white list such semantics,
> > > > > and by default we'd disable aliased calls. To kick things off -- is
> > > > > this strategy agreeable for all other architectures?
> > > >
> > > > I'd say that since the overwhelming majority of ioremap() calls are
> > > > not aliased, ever, thus making it 'harder' to accidentally alias is
> > > > probably a good idea.
> > >
> > > Did you mean 'aliased' or 'aliased with different cache attribute'?
> > >  The former check might be too strict.
> >
> > I'd say even 'same attribute' aliasing is probably relatively rare.
> >
> > And 'different but compatible cache attribute' is in fact more of a sign
> > that the driver author does the aliasing for a valid _reason_: to have
> > two different types of access methods to the same piece of physical
> > address space...
>
> Right.  So, if we change to fail ioremap() on aliased cases, it'd be easier
> to start with the different attribute case first.  This case should be rare
> enough that we can manage to identify such callers and make them use a new
> API as necessary.  If we go ahead to fail any aliased cases, it'd be
> challenging to manage without a regression or two.

From my experience on the ioremap_wc() crusade, I found that the need for
aliasing with different cache types would have been needed in only 3 drivers.
For these 3, the atyfb driver I did the proper split in MMIO and framebuffer,
but that was significant work. I did this work to demo and document such
work. It wasn't easy. For other two, ivtv and ipath we left as requiring
"nopat" to be used. The ipath driver is on its way out of the kenrel now
through staging, and ivtv, well I am not aware of single human being
claiming to use it. The architecture of ivtv actually prohibits us
from ever using PAT for write-combining on the framebuffer as the firmware
is the only one who knows the write-combining area and hides it from us.

We might be able to use tools like Coccinelle to perhaps hunt for
the use of aliasing on drivers with different cache attribute types
to do a full assessment but I really think that will be really hard
to accomplish.

If we can learn anything from the ioremap_wc() crusade I'd say its that
the need for aliasing with different cache types obviously implies we
should disable such drivers with PAT as what we'd really need is a proper
split in maps, but history shows the split can be really hard. It sounded
like you guys were confirming we currently do not allow for aliasing with
different attributes on x86, is that the case for all architectures?

If aliasing with different cache attributes is not allowed for x86 and
if its also rare for other architectures that just leaves the hunt for
valid aliasing uses. That still may be hard to hunt for, but I also
suspect it may be rare.

> > > > The memtype infrastructure of phyisical memory ranges in that case
> > > > acts as a security measure, to avoid unintended (not just physically
> > > > incompatible) aliasing. I suspect it would be helpful during driver
> > > > development as well.
> > >
> > > The memtype infrastructure does not track caller interfaces.  So, it
> > > will check against to any map, i.e. kernel & user map.  I assume a
> > > kernel map is created before user map, though.
> >
> > I don't understand this, could you give me a specific example?
>
> Multiple pfn map interfaces use the memtype.  So, when ioremap() detects an
> aliased map in the memtype, it may be created by ioremap(),
> remap_pfn_range(), or any pfn map that is tracked.  So, strictly speaking,
> we cannot check against other ioremap().
>
> > > > the other problem listed:
> > > >
> > > > > As another problem case, set_memor_*() will not fail on MMIO even
> > > > > though set_memor_*() is designed only for RAM.
> > > >
> > > > So what does this mean exactly? Having WB caching on MMIO area is not
> > > > good, but UC, WC and WB sure is still sensible in some cases, right?
> > >
> > > I responded to Luis in other email that:
> > > > Drivers use ioremap family with a right cache type when mapping MMIO
> > > > ranges, ex. ioremap_wc().  They do not need to change the type to
> > > > MMIO. RAM is different since it's already mapped with WB at boot-
> > > > time. set_memory_*() allows us to change the type from WB, and put it
> > > > back to WB.
> >
> > So there's two different uses here:
> >
> >  - set_memory_x() to set the caching attribute
> >  - set_memory_x() to set any of the RWX access attributes
> >
> > I'd in fact suggest we split these two uses to avoid confusion: keep 
> > set_memory_x() APIs for the RWX access attributes uses, and introduce
> > a new API that sets caching attributes:
> >
> > set_cache_attr_wc()
> > set_cache_attr_uc()
> > set_cache_attr_wb()
> > set_cache_attr_wt()
> >
> > it has the same arguments, so it's basically just a rename initially.
>
> I think the "set_memory_" prefix implies that their target is regular
> memory only.

I did not find any driver using set_memory_wc() on MMIO, its a good thing
as that does not work it seems even if it returns no error. I'm not sure
of the use of other set_memory_*() on MMIO but I would suspect its not
used. A manual hunt may suffice to rule these out.

I guess what I'm trying to say is I am not sure we have a need for
set_cache_attr_*() APIs, unless of course we find such valid use.

> > And at that point we could definitely argue that set_cache_attr_*() APIs
> > should probably generate a warning for _RAM_, because they mostly make
> > sense for MMIO type of physical addresses, right? Regular RAM should
> > always be WB.
> >
> > Are there cases where we change the caching attribute of RAM for valid
> > reasons, outside of legacy quirks?
>
> ati_create_page_map() is one example that it gets a RAM page
> by __get_free_page(), and changes it to UC by calling set_memory_uc().

Should we instead have an API that lets it ask for RAM and of UC type?
That would seem a bit cleaner. BTW do you happen to know *why* it needs
UC RAM types?

> Since RAM is already mapped with WB, the driver has to change its attribute
> when it needs a different cache type.  For MMIO, drivers should be able to
> create a map with the right attribute since they own the ranges.

From my own work so far I tend to agree here, but that's just because
I've seen how drivers can suck badly, and they can quickly think they
can use hacks for things which actually really need a full architectural
proper solution in driver or firmware. In the worst case we have insane
firmware, but I think that's probably really uncommon and likely and its
design was flawed given that at that time PAT was probably rare.

> > > > Basically if ioremap_uc() and ioremap_wc() is allowed on MMIO areas,
> > > > then I see no reason in principle why it should be invalid to change
> > > > the area from UC to WC after it has been ioremap()ed.
> > >
> > > The current implementation does not support MMIO.
> > >  - It does not track cache attribute correctly for MMIO since it uses
> > > __pa().
> >
> > Hm, so __pa() works on mmio addresses as well - at least on x86. The
> > whole mmtype tree is physical address based - which too is MMIO
> > compatible in principle.
> >
> > The only problem would be if it was fundamentally struct page * based -
> > but it's not AFAICS. We have a few APIs that work on struct page *
> > arrays, but those are just vectored helper APIs AFAICS.
> >
> > What am I missing?
>
> I do not think __pa() returns the right physical address for an ioremap'd
> virtual address since its virtual address is allocated from the vmalloc
> range.  But we can simply use slow_virt_to_phys(), instead.
>
> > >  - It only supports attribute transition of {WB -> NewType -> WB} for
> > > RAM.  RAM is tracked differently that WB is treated as "no map".  So,
> > > this transition does not cause a conflict on RAM.  This will causes a
> > > conflict on MMIO when it is tracked correctly.   
> >
> > That looks like a bug?
>
> This is by design since set_memory_xx was introduced for RAM only.  If we
> extend it to MMIO, then we need to change how memtype manages MMIO.

I'd be afraid to *want* to support this on MMIO as I would only expect
hacks from drivers.

Luis
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
On Wed, 2016-03-16 at 02:45 +0100, Luis R. Rodriguez wrote:
> On Fri, Mar 11, 2016 at 03:13:52PM -0700, Toshi Kani wrote:
> > On Wed, 2016-03-09 at 10:15 +0100, Ingo Molnar wrote:
> > > * Toshi Kani <toshi.kani@hpe.com> wrote:
> > >
> > > > On Tue, 2016-03-08 at 13:16 +0100, Ingo Molnar wrote:
> > > > > * Toshi Kani <toshi.kani@hpe.com> wrote:
> > > > >
 :
> > > > Did you mean 'aliased' or 'aliased with different cache attribute'?
> > > >  The former check might be too strict.
> > >
> > > I'd say even 'same attribute' aliasing is probably relatively rare.
> > >
> > > And 'different but compatible cache attribute' is in fact more of a
> > > sign that the driver author does the aliasing for a valid _reason_:
> > > to have two different types of access methods to the same piece of
> > > physical address space...
> >
> > Right.  So, if we change to fail ioremap() on aliased cases, it'd be
> > easier to start with the different attribute case first.  This case
> > should be rare enough that we can manage to identify such callers and
> > make them use a new API as necessary.  If we go ahead to fail any
> > aliased cases, it'd be challenging to manage without a regression or
> > two.
>
> From my experience on the ioremap_wc() crusade, I found that the need for
> aliasing with different cache types would have been needed in only 3
> drivers. For these 3, the atyfb driver I did the proper split in MMIO and
> framebuffer, but that was significant work.  I did this work to demo and
> document such work. It wasn't easy. For other two, ivtv and ipath we left
> as requiring "nopat" to be used. The ipath driver is on its way out of
> the kenrel now through staging, and ivtv, well I am not aware of single
> human being claiming to use it. The architecture of ivtv actually
> prohibits us from ever using PAT for write-combining on the framebuffer
> as the firmware is the only one who knows the write-combining area and
> hides it from us.

At glace, there are 863 references to ioremap(), 329 references to
ioremap_nocache(), and only 68 references to ioremap_wc() on x86.  There
are many more ioremap callers with UC mappings than WC mappings, and it is
hard to say that they never get aliased.

> We might be able to use tools like Coccinelle to perhaps hunt for
> the use of aliasing on drivers with different cache attribute types
> to do a full assessment but I really think that will be really hard
> to accomplish.
>
> If we can learn anything from the ioremap_wc() crusade I'd say its that
> the need for aliasing with different cache types obviously implies we
> should disable such drivers with PAT as what we'd really need is a proper
> split in maps, but history shows the split can be really hard. It sounded
> like you guys were confirming we currently do not allow for aliasing with
> different attributes on x86, is that the case for all architectures?
>
> If aliasing with different cache attributes is not allowed for x86 and
> if its also rare for other architectures that just leaves the hunt for
> valid aliasing uses. That still may be hard to hunt for, but I also
> suspect it may be rare.

Yes, I'd fail the different cache attribute case if we are to place more
strict check.

 :
> >
> > I think the "set_memory_" prefix implies that their target is regular
> > memory only.
>
> I did not find any driver using set_memory_wc() on MMIO, its a good thing
> as that does not work it seems even if it returns no error.  I'm not sure
> of the use of other set_memory_*() on MMIO but I would suspect its not
> used. A manual hunt may suffice to rule these out.

It's good to know that you did not find any case on MMIO.  The thing is,
set_memory_wc() actually works on MMIO today... This is because __pa()
returns a bogus address, which skips the alias check in the memtype.

> I guess what I'm trying to say is I am not sure we have a need for
> set_cache_attr_*() APIs, unless of course we find such valid use.
>
> > > And at that point we could definitely argue that set_cache_attr_*()
> > > APIs should probably generate a warning for _RAM_, because they
> > > mostly make sense for MMIO type of physical addresses, right? Regular
> > > RAM should always be WB.
> > >
> > > Are there cases where we change the caching attribute of RAM for
> > > valid reasons, outside of legacy quirks?
> >
> > ati_create_page_map() is one example that it gets a RAM page
> > by __get_free_page(), and changes it to UC by calling set_memory_uc().
>
> Should we instead have an API that lets it ask for RAM and of UC type?
> That would seem a bit cleaner. BTW do you happen to know *why* it needs
> UC RAM types?

This RAM page is then shared between graphic card and CPU.  I think this is
because graphic card cannot snoop the cache.

> > 
 :
> > > >  - It only supports attribute transition of {WB -> NewType -> WB}
> > > > for RAM.  RAM is tracked differently that WB is treated as "no
> > > > map".  So, this transition does not cause a conflict on RAM.  This
> > > > will causes a conflict on MMIO when it is tracked correctly.   
> > >
> > > That looks like a bug?
> >
> > This is by design since set_memory_xx was introduced for RAM only.  If
> > we extend it to MMIO, then we need to change how memtype manages MMIO.
>
> I'd be afraid to *want* to support this on MMIO as I would only expect
> hacks from drivers.

Agreed, with the hope that they are not used on MMIO already...

Thanks,
-Toshi
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
On Wed, 9 Mar 2016, Ingo Molnar wrote:

> > > So to go back to the original suggestion from Luis, I've quoted it, but
> > > with a s/overlapping/aliased substitution:
> > >
> > > > I had suggested long ago then that one possible resolution was for us
> > > > to add an API that *enables* aliased ioremap() calls, and only use it
> > > > on select locations in the kernel. This means we only have to convert a
> > > > few users to that call to white list such semantics, and by default
> > > > we'd disable aliased calls. To kick things off -- is this strategy
> > > > agreeable for all other architectures?
> > >
> > > I'd say that since the overwhelming majority of ioremap() calls are not
> > > aliased, ever, thus making it 'harder' to accidentally alias is probably
> > > a good idea.
> >
> > Did you mean 'aliased' or 'aliased with different cache attribute'?  The former
> > check might be too strict.
>
> I'd say even 'same attribute' aliasing is probably relatively rare.

Please note that aliased cached mappings (any kinds of, not necessarily
from `ioremap') cause a lot of headache (read: handling trouble) with
architectures such as MIPS which support virtually indexed caches which
suffer from cache aliasing. There is a risk of data corruption if the
same physical memory address space location is accessed through different
virtual mappings as not all hardware catches duplicate cache entries
created in such a case.

We handle it in software for user mappings (although I keep having a
feeling something always keeps escaping, due to the vast diversity of
cache configurations possible), however I don't think we do for `ioremap',
so disallowing aliased `ioremap' mappings by default sounds like a good
idea to me.

FWIW,

Maciej
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
On Mon, Mar 21, 2016 at 05:38:40PM +0000, Maciej W. Rozycki wrote:
> On Wed, 9 Mar 2016, Ingo Molnar wrote:
>
> > > > So to go back to the original suggestion from Luis, I've quoted it, but
> > > > with a s/overlapping/aliased substitution:
> > > >
> > > > > I had suggested long ago then that one possible resolution was for us
> > > > > to add an API that *enables* aliased ioremap() calls, and only use it
> > > > > on select locations in the kernel. This means we only have to convert a
> > > > > few users to that call to white list such semantics, and by default
> > > > > we'd disable aliased calls. To kick things off -- is this strategy
> > > > > agreeable for all other architectures?
> > > >
> > > > I'd say that since the overwhelming majority of ioremap() calls are not
> > > > aliased, ever, thus making it 'harder' to accidentally alias is probably
> > > > a good idea.
> > >
> > > Did you mean 'aliased' or 'aliased with different cache attribute'?  The former
> > > check might be too strict.
> >
> > I'd say even 'same attribute' aliasing is probably relatively rare.
>
> Please note that aliased cached mappings (any kinds of, not necessarily
> from `ioremap') cause a lot of headache (read: handling trouble) with
> architectures such as MIPS which support virtually indexed caches which
> suffer from cache aliasing. There is a risk of data corruption if the
> same physical memory address space location is accessed through different
> virtual mappings as not all hardware catches duplicate cache entries
> created in such a case.
>
> We handle it in software for user mappings (although I keep having a
> feeling something always keeps escaping, due to the vast diversity of
> cache configurations possible), however I don't think we do for `ioremap',
> so disallowing aliased `ioremap' mappings by default sounds like a good
> idea to me.

Great, well lets do the work then.

Luis
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
On Thu, Mar 17, 2016 at 04:44:53PM -0600, Toshi Kani wrote:
> On Wed, 2016-03-16 at 02:45 +0100, Luis R. Rodriguez wrote:
> > On Fri, Mar 11, 2016 at 03:13:52PM -0700, Toshi Kani wrote:
> > > On Wed, 2016-03-09 at 10:15 +0100, Ingo Molnar wrote:
> > > > * Toshi Kani <toshi.kani@hpe.com> wrote:
> > > >
> > > > > On Tue, 2016-03-08 at 13:16 +0100, Ingo Molnar wrote:
> > > > > > * Toshi Kani <toshi.kani@hpe.com> wrote:
> > > > > >
>  :
> > > > > Did you mean 'aliased' or 'aliased with different cache attribute'?
> > > > >  The former check might be too strict.
> > > >
> > > > I'd say even 'same attribute' aliasing is probably relatively rare.
> > > >
> > > > And 'different but compatible cache attribute' is in fact more of a
> > > > sign that the driver author does the aliasing for a valid _reason_:
> > > > to have two different types of access methods to the same piece of
> > > > physical address space...
> > >
> > > Right.  So, if we change to fail ioremap() on aliased cases, it'd be
> > > easier to start with the different attribute case first.  This case
> > > should be rare enough that we can manage to identify such callers and
> > > make them use a new API as necessary.  If we go ahead to fail any
> > > aliased cases, it'd be challenging to manage without a regression or
> > > two.
> >
> > From my experience on the ioremap_wc() crusade, I found that the need for
> > aliasing with different cache types would have been needed in only 3
> > drivers. For these 3, the atyfb driver I did the proper split in MMIO and
> > framebuffer, but that was significant work.  I did this work to demo and
> > document such work. It wasn't easy. For other two, ivtv and ipath we left
> > as requiring "nopat" to be used. The ipath driver is on its way out of
> > the kenrel now through staging, and ivtv, well I am not aware of single
> > human being claiming to use it. The architecture of ivtv actually
> > prohibits us from ever using PAT for write-combining on the framebuffer
> > as the firmware is the only one who knows the write-combining area and
> > hides it from us.
>
> At glace, there are 863 references to ioremap(), 329 references to
> ioremap_nocache(), and only 68 references to ioremap_wc() on x86.  There
> are many more ioremap callers with UC mappings than WC mappings, and it is
> hard to say that they never get aliased.

We need to start somewhere. If we really want to vet / white list aliasing
we probably will need both semantic analysis but perhaps also manual vetting,
and finally a phase where we help WARN on uses we did not white-list.

> > We might be able to use tools like Coccinelle to perhaps hunt for
> > the use of aliasing on drivers with different cache attribute types
> > to do a full assessment but I really think that will be really hard
> > to accomplish.
> >
> > If we can learn anything from the ioremap_wc() crusade I'd say its that
> > the need for aliasing with different cache types obviously implies we
> > should disable such drivers with PAT as what we'd really need is a proper
> > split in maps, but history shows the split can be really hard. It sounded
> > like you guys were confirming we currently do not allow for aliasing with
> > different attributes on x86, is that the case for all architectures?
> >
> > If aliasing with different cache attributes is not allowed for x86 and
> > if its also rare for other architectures that just leaves the hunt for
> > valid aliasing uses. That still may be hard to hunt for, but I also
> > suspect it may be rare.
>
> Yes, I'd fail the different cache attribute case if we are to place more
> strict check.

OK it seems this is a good starting point. How can we get a general
architecture consensus aliasing with different cache attributes is a terrible
idea ? Perhaps a patch to WARN/error out and let architectures opt in to this
piece of code?

> > > I think the "set_memory_" prefix implies that their target is regular
> > > memory only.
> >
> > I did not find any driver using set_memory_wc() on MMIO, its a good thing
> > as that does not work it seems even if it returns no error.  I'm not sure
> > of the use of other set_memory_*() on MMIO but I would suspect its not
> > used. A manual hunt may suffice to rule these out.
>
> It's good to know that you did not find any case on MMIO.  The thing is,
> set_memory_wc() actually works on MMIO today... This is because __pa()
> returns a bogus address, which skips the alias check in the memtype.

Ingo, are you happy with that ? I honestly do not see the need for
use of set_memory_wc() for the cases I reviewed, I think the case for
write-combining can simply be addressed currently with ioremap_wc().

> > I guess what I'm trying to say is I am not sure we have a need for
> > set_cache_attr_*() APIs, unless of course we find such valid use.
> >
> > > > And at that point we could definitely argue that set_cache_attr_*()
> > > > APIs should probably generate a warning for _RAM_, because they
> > > > mostly make sense for MMIO type of physical addresses, right? Regular
> > > > RAM should always be WB.
> > > >
> > > > Are there cases where we change the caching attribute of RAM for
> > > > valid reasons, outside of legacy quirks?
> > >
> > > ati_create_page_map() is one example that it gets a RAM page
> > > by __get_free_page(), and changes it to UC by calling set_memory_uc().
> >
> > Should we instead have an API that lets it ask for RAM and of UC type?
> > That would seem a bit cleaner. BTW do you happen to know *why* it needs
> > UC RAM types?
>
> This RAM page is then shared between graphic card and CPU.  I think this is
> because graphic card cannot snoop the cache.

Was this reason alone sufficient to open such APIs broadly for RAM?

> > > > >  - It only supports attribute transition of {WB -> NewType -> WB}
> > > > > for RAM.  RAM is tracked differently that WB is treated as "no
> > > > > map".  So, this transition does not cause a conflict on RAM.  This
> > > > > will causes a conflict on MMIO when it is tracked correctly.   
> > > >
> > > > That looks like a bug?
> > >
> > > This is by design since set_memory_xx was introduced for RAM only.  If
> > > we extend it to MMIO, then we need to change how memtype manages MMIO.
> >
> > I'd be afraid to *want* to support this on MMIO as I would only expect
> > hacks from drivers.
>
> Agreed, with the hope that they are not used on MMIO already...

OK we'll need to review this.

Luis
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
On Wed, 2016-04-13 at 23:16 +0200, Luis R. Rodriguez wrote:
> On Thu, Mar 17, 2016 at 04:44:53PM -0600, Toshi Kani wrote:
> > On Wed, 2016-03-16 at 02:45 +0100, Luis R. Rodriguez wrote:
> > > On Fri, Mar 11, 2016 at 03:13:52PM -0700, Toshi Kani wrote:
:
> > >
> > > If aliasing with different cache attributes is not allowed for x86
> > > and if its also rare for other architectures that just leaves the
> > > hunt for valid aliasing uses. That still may be hard to hunt for, but
> > > I also suspect it may be rare.
> >
> > Yes, I'd fail the different cache attribute case if we are to place
> > more strict check.
>
> OK it seems this is a good starting point. How can we get a general
> architecture consensus aliasing with different cache attributes is a
> terrible idea ? Perhaps a patch to WARN/error out and let architectures
> opt in to this piece of code?

I expect aliasing with different cache attributes is a bad idea on most
architectures.  Given the fact that track_pfn_remap(), track_pfn_insert(),
etc. are only implemented on x86, I suspect that other architectures would
not be able to implement such check easily, though.  

On x86, ioremap() and remap_pfn_range() already fail on a conflicting cache
type if it is not allowed by the rule defined in is_new_memtype_allowed().
 This exception handling is necessary for remap_pfn_range() called by
/dev/mem, but I do not think it's necessary for ioremap().  I think we can
start from adding a warning message to ioremap().


> > > > > Are there cases where we change the caching attribute of RAM for
> > > > > valid reasons, outside of legacy quirks?
> > > >
> > > > ati_create_page_map() is one example that it gets a RAM page
> > > > by __get_free_page(), and changes it to UC by calling set_memory_uc
> > > > ().
> > >
> > > Should we instead have an API that lets it ask for RAM and of UC
> > > type? That would seem a bit cleaner. BTW do you happen to know *why*
> > > it needs UC RAM types?
> >
> > This RAM page is then shared between graphic card and CPU.  I think
> > this is because graphic card cannot snoop the cache.
>
> Was this reason alone sufficient to open such APIs broadly for RAM?

According to commit 75cbade8ea3, such APIs were introduced because drivers
previously had to deal with low-level staff.  So, I think we need to keep
them as long as we have such drivers...

Thanks,
-Toshi
Re: Overlapping ioremap() calls, set_memory_*() semantics [ In reply to ]
* Toshi Kani <toshi.kani@hpe.com> wrote:

> On x86, ioremap() and remap_pfn_range() already fail on a conflicting cache
> type if it is not allowed by the rule defined in is_new_memtype_allowed().
>  This exception handling is necessary for remap_pfn_range() called by
> /dev/mem, but I do not think it's necessary for ioremap().  I think we can
> start from adding a warning message to ioremap().

Agreed, we should warn and map the situation before doing any behavioral change
that isn't a fix.

Thanks,

Ingo