[catsoop-dev] Nicer Fragment Names Patch

adam j hartz hz at mit.edu
Sat Jun 15 14:03:27 EDT 2019


Hi Valerie,

Thanks for sending this!  I'm including some thoughts inline below.


> -        linknum = num.replace(".", "_")
> -        linkname = "catsoop_section_%s" % linknum
> +        linkname = i.text.replace(" ", "_")

Would you mind reworking this slightly so that both anchors remain
(i.e., so that we can link to a section either with #catsoop_section_X
or with #nice_section_name)?  This will help prevent some "link rot" by
making it more likely that links that were shared before this change
continue to work.


> The new fragment names are the section titles with spaces replaced
> with underscores. I don't believe there's a need to replace any other
> characters, like periods/punctuation.

I think this isn't _quite_ right, unfortunately.  Most things are safe,
but from a little bit of poking (and assuming I'm reading
https://tools.ietf.org/html/rfc3986#page-49 correctly), it looks like
there are some characters that aren't valid in a URI fragment.  For
example, the character "#" can't exist in a URI fragment.  There may
also be restrictions on what characters can exist in the name attribute
of an HTML tag, but I'm less sure about that.

One thing that I think would work (or come close to working, at least),
would be to percent-encode (with urllib.parse.quote) the fragment inside
of links we generate but not inside of the <a> tag we make.  This seems
to work fine from a little test on my own machine.  But it has the
downside that will produce somewhat uglier-than-necessary URLs by
encoding some characters that it doesn't actually _need_ to encode in
order to work properly.

A slightly nicer approach might be only to percent-encode those
characters that we know will cause problems, but to leave others as-is.
I _think_ the characters we would need to replace would be any character
that matches the following regex:
    r"[^\/\?A-Za-z0-9-\._~!\$&'\(\)\*\+,;=:@]"
so maybe the quoting can be done with a single call to re.sub(...),
replacing each match with its quoted equivalent.  Again, I _think_ this
change would only need to happen in the links we generate, not in the
<a> tag.

I definitely think replacing spaces with underscores is nice (rather
than percent-encoding them).  But maybe we want to replace any
contiguous sequence of whitespace characters with a single underscore?
I think this would make the URI fragments easier to glance at.  For the
same reason, I think I might also want to make everything lowercase in
the fragments.

Can you please also check to make sure that this will work OK with
titles that contain single- and double-quotes, in addition to the
characters mentioned above?  I'm pretty sure BeautifulSoup will do
something sensible when creating that tag, but I think that would be a
good thing to test.


My other big thought is that with this new method of linking, there is
the possibility that we run into duplicates.  I think it will be
important to add a mechanism by which we can notice and disambiguate
duplicates we make using this method (maybe by appending a number if
we've already seen a name before, or something like that).  I know, for
example, that many 6.01 labs have multiple sections with the same names;
it will be important to make sure those all get their own distinct
links.


> Also--none of the ways in which I tried installing Bugs Everywhere
> were successful. It seems to depend on Python 2.7. Is there a way to
> close the issue without BE?

Yes, that's right, unfortunately.  BE doesn't seem to be under active
development, and it doesn't support Python 3.  That's the main reason
I'm going to be moving the issue tracking to a different platform soon.

If you _want_ to install BE, what worked for me was downloading the
source from http://download.bugseverywhere.org/releases/be-1.1.1.tar.gz,
and then installing using Python 2.  Or, it's also possible to close the
bug by manually editing the files in .be/, since they're stored in a
plain-text format.

But again, I expect to be transitioning away from BE in the very near
future, so there's no need to worry about it; I'm happy to take care of
closing the issue when these changes get merged in.


Of course, I'm happy to chat further about any of the above.

Thanks again!
-Adam


More information about the catsoop-dev mailing list