Hello if anyone knows of a way to get python-markdown to behave in the way I’d like, or of an alternative way to do it, I’d love some help! My use case is I’m converting .md files made with Obsidian into html files. Obsidian has tags that are a pound sign followed by the tag (so like “#TagName”). When the tag is the first item on a line the pound sign is confused for a heading, even though there is no space after it.

Is there a way that I can avoid this so it only reads it as a heading if there is a space between the pound and the next word? I’m even considering some kind of find/replace logic so I can swap it out with like a link to a page that lists all the pages with that tag or something that gets run before the markdown to html conversion.

Edit: The solution I’m going for is a regex find/replace. Currently the string pattern looks like "#[^\s#][^\s" + string.punctuation + "#]*" which can find tags but ignores headers. Since the ultimate goal is to have the tags link to a tag page anyway I can solve it all in one step by doing a replace with a relevant link.

    • Ms. ArmoredThirteen@lemmy.zipOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      20 hours ago

      I’m going to open source the tool once I get it working at base level but it’s going to be pretty bare bones and dedicated to my needs

  • jbrains@sh.itjust.works
    link
    fedilink
    arrow-up
    5
    ·
    4 days ago

    The most bulletproof way to do this seems to be to escape the # characters before running the document through Markdown. It might suffice to use a regex. (Insert regex and two problems joke here.) That seems promising because headings always match #\s+ whereas tags match #[^\s].

    I hope someone has an even better idea, but this ought to work.

      • jbrains@sh.itjust.works
        link
        fedilink
        arrow-up
        1
        ·
        edit-2
        1 day ago

        I don’t mind at all. Beyond my explanation, you might like to try to use an online regular expression checker to explore small changes to the regex to see how it matches what it matches.

        Headings always match #\s+, because that’s the character # followed by whitespace (\s) one or more times (+). Other text matches this, but so not all matches are headings, but all headings match. (You might have # blah in the middle of the text, which would match. If that’s a problem, then you can change the regex to ^#\s+, where ^ means “from the beginning of a line”.

        Tags always match #[^\s], which means the character # followed by one not whitespace character. Be careful: tags match this regex, but this regex doesn’t match the entire tag. It only says “there is a tag here”.

        Fortunately, that doesn’t hurt, because your Python code could match #[^\s] and then turn that # into \# and thereby successfully avoid escaping the #s at the beginning of headings. You could even use regex to do this by capturing the non-whitespace character at the beginning of the tag and “putting it back” using regex search and replace.

        Replace #([^s]) with \#\1.

        The parentheses capture the matching characters (the first character of the tag) and \1 echoes back the captured characters. It would replace #a with \#a and so on.

        I hope I explained this clearly enough. I see the other folks also tried, so I hope that together, you found an explanation that works well enough for you.

        Peace.

        • Ms. ArmoredThirteen@lemmy.zipOP
          link
          fedilink
          English
          arrow-up
          2
          ·
          13 hours ago

          I found a regex checker and it helped so much thank you for the suggestion! I think I better understand what’s going on and was able to use that to modify it to work closer to how I want. Currently I have "#[^\s#][^\s" + string.punctuation + "#]*"

          So what I think is going on is it looks for # followed by not whitespace or another # (before it was matching on headers with multiple pound signs). Then keep looking until it runs into a whitespace, punctuation, another # (in the case of multiple tags) for as many characters as needed.

          My use case is to be able to turn the tags into links to pages with a list of pages including that tag. What I do is blindly replace the tag to where a page should exist, log the tag, and later gather up all the found tags to make the pages with lists. The punctuation was because I had some tags in weird places like the end of sentences that was adding a period or comma to it and making a unique tag (like #Homeworld at the top of a file vs. Find better tag than #Homeworld. as a note)

          • jbrains@sh.itjust.works
            link
            fedilink
            arrow-up
            1
            ·
            5 hours ago

            Excellent! Indeed, I’d completely forgot about H2, H3, and so on, so I’m glad you found it comfortable to figure that out!

            I read Mastering Regular Expressions about 25 years ago and it’s one of the best and simplest investments I ever made in my own programming practice. Regex never goes out of style.

            Enjoy!

      • CosmicGiraffe@lemmy.world
        link
        fedilink
        arrow-up
        4
        ·
        edit-2
        3 days ago

        #\s+ is:

        • #: a literal #

        • \s: any whitespace character (space, tab etc)

        • +: the previous thing (here the whitespace), one or more times

        In words: “a hash followed by at least one whitespace character”

        #[^\s]. is:

        • #: a literal #

        • [^\s] : a negated character class. This matches anything other than the set of characters after the ^. \s has the same meaning as before, any whitespace character

        • . : matches any single character

        In words: “a hash followed by any character other than a whitespace character, then any character”.

        https://regex101.com/ is really good for explaining regex

  • logging_strict@programming.dev
    link
    fedilink
    arrow-up
    2
    ·
    edit-2
    4 days ago

    Learn Sphinx which can mix .rst and .md files. myst-parser is the package which deals with .md files.

    Just up your game a bit and you’ll have variables similar to Obsidian tags which doesn’t cause problems when being rendered into html web site and pdf file

      • moonpiedumplings@programming.dev
        link
        fedilink
        arrow-up
        1
        ·
        3 days ago

        I just did a quick test with quarto, which uses pandoc markdown and pandoc for conversions, and it looks like pandoc doesn’t recognize #nospace as a header (although this could be a quarto specific thing).

        A quick look at the python library op is using and it seems that that is what they are using to convert to html, rather than pandoc.

  • ulterno@programming.dev
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 days ago

    I like kramdown, though not sure if it fixes this particular problem. But it has another nice little way to add tags[1].

    Also, discount is lovely and small.


    1. try converting your HTML with headings to a kramdown document and you will see what I am saying ↩︎