Mailing List Archive: dns and software, was Re: Reliable Cloud host ?

Re: dns and software, was Re: Reliable Cloud host ? [ In reply to ]

Mar 1, 2012, 6:12 PM

Post #51 of 58 (1084 views)

On Mar 1, 2012, at 17:10, William Herrin <bill@herrin.us> wrote:
> If took you 50 lines of code to do
> 'socket=connect("www.google.com",80,TCP);' and you still managed to
> produce a version which, due to the timeout on dead addresses, is
> worthless for any kind of interactive program like a web browser. And
> because that code isn't found in a system library, every single
> application programmer has to write it all over again.
>
> I'm a fan of Rube Goldberg machines but that was ridiculous.

I'm thinking for this to work it would have to be 2 separate calls:

Call 1 being to the resolver (using lwres, system resolver, or
whatever you want to use) and returning an array of struct addrinfo-
same as gai does currently. If applications need TTL/SRV/$NEWRR
awareness it would be implemented here.

Call 2 would be a "happy eyeballs" connect syscall (mconnect? In the
spirit of sendmmsg) which accepts an array of struct addrinfo and
returns an fd. In the case of O_NONBLOCK it would return a dummy fd
(as non-blocking connects do currently) then once one of the
connections finishes handshake the kernel connects it to the FD and
signals writable to trigger select/poll/epoll. This allows developers
to keep using the same loops (and most of the APIs) they're already
comfortable with, keeps DNS out of the kernel, but hopefully provides
a better and easier to use connect() experience, for SOCK_STREAM at
least.

It's not as neat as a single connect() accepting a name, but seems to
be a happy medium and provides a standardized/predictable connect()
experience without breaking existing APIs.

~Matt

Re: dns and software, was Re: Reliable Cloud host ? [ In reply to ]

marka at isc

Mar 1, 2012, 7:32 PM

Post #52 of 58 (1093 views)

Permalink

In message <596196444196086313@unknownmsgid>, Matt Addison writes:
> On Mar 1, 2012, at 17:10, William Herrin <bill@herrin.us> wrote:
> > If took you 50 lines of code to do
> > 'socket=connect("www.google.com",80,TCP);' and you still managed to
> > produce a version which, due to the timeout on dead addresses, is
> > worthless for any kind of interactive program like a web browser. And
> > because that code isn't found in a system library, every single
> > application programmer has to write it all over again.
> >
> > I'm a fan of Rube Goldberg machines but that was ridiculous.
>
> I'm thinking for this to work it would have to be 2 separate calls:
>
> Call 1 being to the resolver (using lwres, system resolver, or
> whatever you want to use) and returning an array of struct addrinfo-
> same as gai does currently. If applications need TTL/SRV/$NEWRR
> awareness it would be implemented here.
>
> Call 2 would be a "happy eyeballs" connect syscall (mconnect? In the
> spirit of sendmmsg) which accepts an array of struct addrinfo and
> returns an fd. In the case of O_NONBLOCK it would return a dummy fd
> (as non-blocking connects do currently) then once one of the
> connections finishes handshake the kernel connects it to the FD and
> signals writable to trigger select/poll/epoll. This allows developers
> to keep using the same loops (and most of the APIs) they're already
> comfortable with, keeps DNS out of the kernel, but hopefully provides
> a better and easier to use connect() experience, for SOCK_STREAM at
> least.
>
> It's not as neat as a single connect() accepting a name, but seems to
> be a happy medium and provides a standardized/predictable connect()
> experience without breaking existing APIs.
>
> ~Matt

And you can do the same in userland with kqueue and similar.

int
connectxx(struct addrinfo *res0, int *fd, int *timeout, void**state);

0 *fd is a connected socket.
EINPROGRESS Wait on '*fd' with a timeout of 'timeout' nanoseconds.
ETIMEDOUT connect failed.

If timeout or state is NULL you block.
You re-call with res0 set to NULL.

--
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

Re: dns and software, was Re: Reliable Cloud host ? [ In reply to ]

bill at herrin

Mar 1, 2012, 9:34 PM

Post #53 of 58 (1086 views)

Permalink

On Thu, Mar 1, 2012 at 8:47 PM, Owen DeLong <owen@delong.com> wrote:
> On Mar 1, 2012, at 5:15 PM, William Herrin wrote:
>> On Thu, Mar 1, 2012 at 8:02 PM, Owen DeLong <owen@delong.com> wrote:
>>> There's no need to
>>> break the current functionality of the underlying system calls and
>>> libc functions which would be needed by any such library anyway.
>>
>> Owen,
>>
>> Point to one sentence written by anybody in this entire thread in
>> which breaking current functionality was proposed.
>>
> When you said that:
>
> connect(char *name, uint16_t port) should work
>
> That can't work without breaking the existing functionality of the connect() system call.

You know, when I wrote 'socket=connect("www.google.com",80,TCP);' I
stopped and thought to myself, "I wonder if I should change that to
'connectbyname' instead just to make it clear that I'm not replacing
the existing connect() call?" But then I thought, "No, there's a
thousand ways someone determined to misunderstand what I'm saying will
find to misunderstand it. To someone who wants to understand my point,
this is crystal clear."

-Bill

--
William D. Herrin ................ herrin@dirtside.com bill@herrin.us
3005 Crane Dr. ...................... Web: <http://bill.herrin.us/>
Falls Church, VA 22042-3004

Re: dns and software, was Re: Reliable Cloud host ? [ In reply to ]

owen at delong

Mar 1, 2012, 10:03 PM

Post #54 of 58 (1086 views)

Permalink

On Mar 1, 2012, at 9:34 PM, William Herrin wrote:

> On Thu, Mar 1, 2012 at 8:47 PM, Owen DeLong <owen@delong.com> wrote:
>> On Mar 1, 2012, at 5:15 PM, William Herrin wrote:
>>> On Thu, Mar 1, 2012 at 8:02 PM, Owen DeLong <owen@delong.com> wrote:
>>>> There's no need to
>>>> break the current functionality of the underlying system calls and
>>>> libc functions which would be needed by any such library anyway.
>>>
>>> Owen,
>>>
>>> Point to one sentence written by anybody in this entire thread in
>>> which breaking current functionality was proposed.
>>>
>> When you said that:
>>
>> connect(char *name, uint16_t port) should work
>>
>> That can't work without breaking the existing functionality of the connect() system call.
>
> You know, when I wrote 'socket=connect("www.google.com",80,TCP);' I
> stopped and thought to myself, "I wonder if I should change that to
> 'connectbyname' instead just to make it clear that I'm not replacing
> the existing connect() call?" But then I thought, "No, there's a
> thousand ways someone determined to misunderstand what I'm saying will
> find to misunderstand it. To someone who wants to understand my point,
> this is crystal clear."

I'm all for additional library functionality built on top of what exists that does what you want.

As I said, there are many such libraries out there to do that.

If someone wants to add it to libc, more power to them. I'm not the libc maintainer.

I just don't want conect() to stop working the way it does or for getaddrinfo() to stop
working the way it does.

Since you were hell bent on calling the existing mechanisms broken rather than
conceding the point that the current process is not broken, but, could stand some
improvements in the library (http://owend.corp.he.net/ipv6 I even say as much myself),
it was not entirely clear that you did not intend to replace connect() rather than
augment the current capabilities with additional more abstract functions with
different names.

Owen

Re: dns and software, was Re: Reliable Cloud host ? [ In reply to ]

bicknell at ufp

Mar 2, 2012, 6:51 AM

Post #55 of 58 (1084 views)

Permalink

In a message written on Thu, Mar 01, 2012 at 05:02:30PM -0800, Owen DeLong wrote:
> Then push for better written abstraction libraries. There's no need to
> break the current functionality of the underlying system calls and
> libc functions which would be needed by any such library anyway.

Agree in part and disagree in part.

I think where the Open Source community has fallen behind in the
last decade is application level libraries. Open source pioneered
cross platform libraries (libX11, libresolv, libm) in the early
days and the benefit was they worked darn near exactly the same on
all platforms. It made programming and porting easier and lead to
growth in the ecosystem.

Today that mantle has been taken up by Apple and Microsoft. In
Objective-C for example I can in one line of code say "retrieve
this URL", and the libraries know about DNS, IPv4 vrs IPv6, happy
eyeballs algorythms, multi-threading parts so that the user doesn't
wait, and so on. Typical application programs on these platforms
never make any of the system calls that have been discussed in this
thread.

Unfortunately the open source world is without even basic enhancements.
Library work in many areas has stagnated, and in the areas where it is
progressing it's often done in a way to make the same library (by name)
perform differently on different operating systems! Plenty of people
have done research finding rampent file copying and duplication of code,
and that's a bad sign:

http://tagide.com/blog/2011/09/file-cloning-in-open-source-the-good-the-bad-and-the-ugly/
http://www.solidsourceit.com/blog/?p=4
http://pages.cs.wisc.edu/~shanlu/paper/TSE-CPMiner.pdf

I can't find it now but there was a paper a few years back that looked
for a hash or CRC algorythm because they were easy to identify in source
by the fixed, unique constant they used. In the Linux kernel alone was
like 10 implementations, widen to all software in the application
repository and there were like 10,000 instances of (nearly) the same
code!

Now, where I disagree. Better libraries means not just better ones
at a high level (fetch me this URL), but better ones at a lower level.
For instance libresolv discussed here is old and creaky. It was
designed for a different time. Many folks doing DNS work have moved
on to libldns from Unbound because libresolv does not do what they
need with respect to DNSSEC or IPv4/IPv6 issues.

I think the entire community needs to come together with a strong bit of
emphasis on libraries, standardizing them, making them ship with the
base OS so programmers can count on them, and rolling in new stuff that
needs to be in them on a timely basis. Apple and Microsoft do it with
their (mostly closed) platforms, open source can do it better.

--
Leo Bicknell - bicknell@ufp.org - CCIE 3440
PGP keys at http://www.ufp.org/~bicknell/

Re: dns and software, was Re: Reliable Cloud host ? [ In reply to ]

bill at herrin

Mar 2, 2012, 10:12 AM

Post #56 of 58 (1081 views)

Permalink

On Fri, Mar 2, 2012 at 1:03 AM, Owen DeLong <owen@delong.com> wrote:
> On Mar 1, 2012, at 9:34 PM, William Herrin wrote:
>> You know, when I wrote 'socket=connect("www.google.com",80,TCP);' I
>> stopped and thought to myself, "I wonder if I should change that to
>> 'connectbyname' instead just to make it clear that I'm not replacing
>> the existing connect() call?" But then I thought, "No, there's a
>> thousand ways someone determined to misunderstand what I'm saying will
>> find to misunderstand it. To someone who wants to understand my point,
>> this is crystal clear."

"Hyperbole." If I had remembered the word, I could have skipped the
long description.

> I'm all for additional library functionality
> I just don't want conect() to stop working the way it does or for getaddrinfo() to stop
> working the way it does.

Good. Let's move on.

First question: who actually maintains the standard for the C sockets
API these days? Is it a POSIX standard?

Next, we have a set of APIs which, with sufficient caution and skill
(which is rarely the case) it's possible to string together a
reasonable process which starts with a some kind of name in a text
string and ends with established communication with a remote server
for any sort of name and any sort of protocol. These APIs are complete
but we repeatedly see certain kinds of error committed while using
them.

Is there a common set of activities an application programmer intends
to perform 9 times out of 10 when using getaddrinfo+connect? I think
there is, and it has the following functionality:

Create a [stream].to one of the hosts satisfying [name] + [service]
within [timeout] and return a [socket].

Does anybody disagree? Here's my reasoning:

Better than 9 times out of 10 a steam and usually a TCP stream at
that. Connect also designates a receiver for a connectionless protocol
like UDP, but its use for that has always been a little peculiar since
the protocol doesn't actually connect. And indeed, sendto() can
designate a different receiver for each packet sent through the
socket.

Name + Service. If TCP, a hostname and a port.

Sometimes you want to start multiple connection attempts in parallel
or have some not-quire-threaded process implement its own scheduler
for dealing with multiple connections at once, but that's the
exception. Usually the only reason for dealing with the connect() in
non-blocking mode is that you want to implement sensible error recover
with timeouts.

And the timeout - the direction that control should be returned to the
caller no later than X. If it would take more than X to complete, then
fail instead.

Next item: how would this work under the hood?

Well, you have two tasks: find a list of candidate endpoints from the
name, and establish a connection to one of them.

Find the candidates: ask all available name services in parallel
(hosts, NIS, DNS, etc). Finished when:

1. All services have responded negative (failure)

2. You have a positive answer and all services which have not yet
answered are at a lower priority (e.g. hosts answers, so you don't
need to wait for NIS and DNS).

3. You have a positive answer from at least one name service and 1/2
of the requested time out has expired.

4. The full time out has expired (failure).

Cache the knowledge somewhere along with TTLs (locally defined if the
name service doesn't explicitly provide a TTL). This may well be the
first of a series of connection requests for the same host. If cached
and TTL valid knowledge was known for this name for a particular
service, don't ask that service again.

Also need to let the app tell us to deprioritize a particular result
later on. Why? Let's say I get an HTTP connection to a host but then
that connection times out. If the app is managing the address list, it
can try again to another address for the same name. We're now hiding
that detail from the app, so we need a callback for the app to tell
us, "when I try again, avoid giving me this answer because it didn't
turn out to work."

So, now we have a list of addresses with valid TTLs as of the start of
our connection attempt. Next step: start the connection attempt.

Pick the "first" address (chosen by whatever the ordering rules are)
and send the connection request packet and let the OS do its normal
retry schedule. Wait one second (system or sysctl configurable) or
until the previous connection request was either accepted or rejected,
whichever is shorter. If not connected yet, background it, pick the
next address and send a connection request. Repeat until a one
connection request has been issued to all possible destination
addresses for the name.

Finished when:

1. Any of the pending connection requests completes (others are aborted).

2. The time out is reached (all pending request aborted).

Once a connection is established, this should be cached alongside the
address and its TTL so that next time around that address can be tried
first.

Thoughts?

The idea here, of course, is that any application which uses this
function to make its connections should, at an operations level, do a
good job handling both multiple addresses with one of them unreachable
as well as host renumbering that relies on the DNS TTL.

> Since you were hell bent on calling the existing mechanisms broken rather than
> conceding the point that the current process is not broken, but, could stand some
> improvements in the library

I hold that if an architecture encourages a certain implementation
mistake largely to the exclusion of correct implementations then that
architecture is in some way broken. That error may be in a particular
component, but it could be that the components themselves are correct.
There could be in a missing component or the components could strung
together in a way that doesn't work right. Regardless of the exact
cause, there is an architecture level mistake which is the root cause
of the consistently broken implementations.

Regards,
Bill Herrin

--
William D. Herrin ................ herrin@dirtside.com bill@herrin.us
3005 Crane Dr. ...................... Web: <http://bill.herrin.us/>
Falls Church, VA 22042-3004

Re: dns and software, was Re: Reliable Cloud host ? [ In reply to ]

jared at puck

Mar 2, 2012, 11:32 AM

Post #57 of 58 (1079 views)

Permalink

On Mar 1, 2012, at 10:01 AM, Michael Thomas wrote:

> The real issue is that gethostbyxxx has been inadequate for a very
> long time. Moving it across the kernel boundary solves nothing and
> most likely causes even more trouble: what if I want, say, asynchronous
> name resolution? What if I want to use SRV records? What if a new DNS
> RR comes around -- do i have do recompile the kernel? It's for these
> reasons and probably a whole lot more that connect just confuses the
> actual issues.

<software-developer-hat-on>

My experience is that these calls are expensive and require a lot of work to get a true result. Some systems also have interim caching that happens as well (e.g. NSCD).

When building software that did a lot of dns lookups at once, I had to build my own internal cache to maintain performance. Startup costs were expensive, but maintaining it started to space out a bit more and be less of an issue.

I ended up caching these entries for 1 hour by default.

</hat ?xml-fail>

- jared

Re: dns and software, was Re: Reliable Cloud host ? [ In reply to ]

owen at delong

Mar 2, 2012, 12:59 PM

Post #58 of 58 (1080 views)

Permalink

On Mar 2, 2012, at 10:12 AM, William Herrin wrote:

> On Fri, Mar 2, 2012 at 1:03 AM, Owen DeLong <owen@delong.com> wrote:
>> On Mar 1, 2012, at 9:34 PM, William Herrin wrote:
>>> You know, when I wrote 'socket=connect("www.google.com",80,TCP);' I
>>> stopped and thought to myself, "I wonder if I should change that to
>>> 'connectbyname' instead just to make it clear that I'm not replacing
>>> the existing connect() call?" But then I thought, "No, there's a
>>> thousand ways someone determined to misunderstand what I'm saying will
>>> find to misunderstand it. To someone who wants to understand my point,
>>> this is crystal clear."
>
> "Hyperbole." If I had remembered the word, I could have skipped the
> long description.
>
>> I'm all for additional library functionality
>> I just don't want conect() to stop working the way it does or for getaddrinfo() to stop
>> working the way it does.
>
> Good. Let's move on.
>
>
> First question: who actually maintains the standard for the C sockets
> API these days? Is it a POSIX standard?
>

Well, some of it seems to be documented in RFCs, but, I think what you're wanting doesn't require adds to the sockets library, per se. In fact, I think wanting to make it part of that is a mistake. As I said, this should be a
higher level library.

For example, in Perl, you have Socket (and Socket6), but, you also have several other abstraction libraries such as Net::HTTP.

While there's no hierarchical naming scheme for the functions in libc, if you look at the source for any of the open source libc libraries out there, you'll find definite hierarchy.

POSIX certainly controls one standard. The GNU libc maintainers control the standard for the libc that accompanies GCC to the best of my knowledge. I would suggest that is probably the best place to start since I think anything that gains acceptance there will probably filter to the others fairly quickly.

> Next, we have a set of APIs which, with sufficient caution and skill
> (which is rarely the case) it's possible to string together a
> reasonable process which starts with a some kind of name in a text
> string and ends with established communication with a remote server
> for any sort of name and any sort of protocol. These APIs are complete
> but we repeatedly see certain kinds of error committed while using
> them.
>

Right... Since these are user-errors (at the developer level) I wouldn't try to fix them in the APIs. I would, instead, build more developer proof add-on APIs on top of them.

> Is there a common set of activities an application programmer intends
> to perform 9 times out of 10 when using getaddrinfo+connect? I think
> there is, and it has the following functionality:
>
> Create a [stream].to one of the hosts satisfying [name] + [service]
> within [timeout] and return a [socket].
>

Seems reasonable, but ignores UDP. If we're going to do this, I think we should target a more complete solution to include a broader range of probabilities than just the most common TCP connect scenario.

> Does anybody disagree? Here's my reasoning:
>
> Better than 9 times out of 10 a steam and usually a TCP stream at
> that. Connect also designates a receiver for a connectionless protocol
> like UDP, but its use for that has always been a little peculiar since
> the protocol doesn't actually connect. And indeed, sendto() can
> designate a different receiver for each packet sent through the
> socket.
>

Most applications using UDP that I have seen use sendto()/recvfrom() et. al. Netflow data would suggest that it's less than 9 out of ten times for TCP, but, yes, I would agree it is the most common scenario.

> Name + Service. If TCP, a hostname and a port.
>
That would apply to UDP as well. Just the semantics of what you do once you have the filehandle are different. (and it's not really a stream, per se).

> Sometimes you want to start multiple connection attempts in parallel
> or have some not-quire-threaded process implement its own scheduler
> for dealing with multiple connections at once, but that's the
> exception. Usually the only reason for dealing with the connect() in
> non-blocking mode is that you want to implement sensible error recover
> with timeouts.
>

Agreed.

> And the timeout - the direction that control should be returned to the
> caller no later than X. If it would take more than X to complete, then
> fail instead.
>

Actually, this is one thing I would like to see added to connect() and that could be done without breaking the existing API.

>
>
> Next item: how would this work under the hood?
>
> Well, you have two tasks: find a list of candidate endpoints from the
> name, and establish a connection to one of them.
>
> Find the candidates: ask all available name services in parallel
> (hosts, NIS, DNS, etc). Finished when:
>
> 1. All services have responded negative (failure)
>
> 2. You have a positive answer and all services which have not yet
> answered are at a lower priority (e.g. hosts answers, so you don't
> need to wait for NIS and DNS).
>
> 3. You have a positive answer from at least one name service and 1/2
> of the requested time out has expired.
>
> 4. The full time out has expired (failure).
>

I think the existing getaddrinfo() does this pretty well already.

I will note that the services you listed only apply to resolving the host name. Don't forget that you might also need to resolve the service to a port number. (An application should be looking up HTTP, not assuming it is 80, for example).

Conveniently, getaddrinfo simultaneously handles both of these lookups.

> Cache the knowledge somewhere along with TTLs (locally defined if the
> name service doesn't explicitly provide a TTL). This may well be the
> first of a series of connection requests for the same host. If cached
> and TTL valid knowledge was known for this name for a particular
> service, don't ask that service again.
>

I recommend against doing this above the level of getaddrinfo(). Just call getaddrinfo() again each time you need something. If it has cached data, it will return quickly and is cheap. If it doesn't return quickly, it will still work just as quickly as anything else most likely.

If getaddrinfo() on a particular system is not well behaved, we should seek to fix that implementation of getaddrinfo(), not write yet another replacement.

> Also need to let the app tell us to deprioritize a particular result
> later on. Why? Let's say I get an HTTP connection to a host but then
> that connection times out. If the app is managing the address list, it
> can try again to another address for the same name. We're now hiding
> that detail from the app, so we need a callback for the app to tell
> us, "when I try again, avoid giving me this answer because it didn't
> turn out to work."
>

I would suggest that instead of making this opaque and then complicating
it with these hints when we return, that we return use a mecahism where we
return a pointer to a dynamically allocated result (similar to getaddrinfo) and
if we get called again with a pointer to that structure, we know to delete the
previously connected host from the list we try next time.

When the application is done with the struct, it should free it by calling an
appropriate free function exported by this new API.

>
> So, now we have a list of addresses with valid TTLs as of the start of
> our connection attempt. Next step: start the connection attempt.
>
> Pick the "first" address (chosen by whatever the ordering rules are)
> and send the connection request packet and let the OS do its normal
> retry schedule. Wait one second (system or sysctl configurable) or
> until the previous connection request was either accepted or rejected,
> whichever is shorter. If not connected yet, background it, pick the
> next address and send a connection request. Repeat until a one
> connection request has been issued to all possible destination
> addresses for the name.
>
> Finished when:
>
> 1. Any of the pending connection requests completes (others are aborted).
>
> 2. The time out is reached (all pending request aborted).
>
> Once a connection is established, this should be cached alongside the
> address and its TTL so that next time around that address can be tried
> first.
>

Seems mostly reasonable. I would consider possibly having some form of inverse exponential backoff on the initial connection attempts. Maybe wait 5 seconds for the first one before trying the second one and waiting 2 seconds, then 1 second if the third one hasn't connected, then bottoming out somewhere around 500ms for the remainder.

>
>
>> Since you were hell bent on calling the existing mechanisms broken rather than
>> conceding the point that the current process is not broken, but, could stand some
>> improvements in the library
>
> I hold that if an architecture encourages a certain implementation
> mistake largely to the exclusion of correct implementations then that
> architecture is in some way broken. That error may be in a particular

I don't believe that the architecture encourages the implementation mistake.

Rather, I think human behavior and our tendency not to seek proper understanding of the theory of operation of various things prior to implementing things which depend on them is more at fault. I suppose that you can argue that the API should be built to avoid that, but, we'll have to agree to disagree on that point. I think that low-level APIs (and this is a low-level API) have to be able to rely on the engineers that use them making the effort to understand the theory of operation. I believe that the fault here is the lack of a standardized higher-level API in some languages.

> component, but it could be that the components themselves are correct.
> There could be in a missing component or the components could strung
> together in a way that doesn't work right. Regardless of the exact
> cause, there is an architecture level mistake which is the root cause
> of the consistently broken implementations.
>

I suppose by your definition this constitutes a missing component. I don't see it that way. I see it as a complete and functional system for a low-level API. There are high-level APIs available. As you have noted, some better than others. A standardized well-written high-level API would, indeed, be useful. However, that does not make the low-level API broken just because it is common for poorly trained users to make improper use of it. It is common for people using hammers to hit their thumbs. This does not mean that hammers are architecturally broken or that they should be re-engineered to have elaborate thumb-protection mechanisms.

The fact that you can electrocute yourself by sticking a fork into a toaster while it is operating is likewise, not an indication that toasters are architecturally broken.

It is precisely this attitude that has significantly increased the overhead and unnecessary expense of many systems while making product liability lawyers quite wealthy.

Owen