Data loss in titles, descriptions, etc.

There are two major probems concerning data loss. Presumably this is because there's some kind of data sanitation going on that's too aggressive and/or too intelligent for its own good.

First of all,

Anything between < and > is thrown away. Without warning nor consent. After saving, Just poof, gone. There's probably a mechanism in place to remove all HTML tags, and while this is a questionable thing to do in the first place, obviously it is too aggressive.

Consider for instance a title like "This rose < the red one > in a vase". It's a crappy title, but it serves the usecase. After saving this title, what's left is "This rose in a vase".

This should be fixed. This falls under the category data loss. There are no warnings, and the user isn't being given consent that this is happening, or going to happen.

Ideally, just remove this nonsense mechanism and just *escape* angle brackets. That should be more than enough. YES I KNOW ABOUT XSS, but this is concerning data entered by an administrator, who will not XSS their own website I imagine.

Secondly,

Anything, anywhere at all, any piece of text entered no matter where, with an emoji in it, will be cleared ENTIRELY. It's not that it won't be saved, it's overwritten with an empty string if there's an emoji anywhere in the string.

What nonsense is this?! What other characters will be causing this behaviour? Again, the category is data loss. This should be fixed first thing, as, well, it causes data loss!

I wonder what mechanism decides to destroy anything and everything that has an emoji in it?... Well, someone didn't do the pile of poo test, for sure:
https://mathiasbynens.be/notes/javascript-unicode#poo-test

Anyway, please fix these problems. They need not to exist in any system really. It's a simple matter of testing. Heck, even a simple unittest will be able to catch problems like these.

Comments

  • acrylian Administrator, Developer
    Sorry, I don't understand what data loss problem you are referring to. In titles html is indeed not allowed but in descriptions or Zenpage item content it is. Please looke at Options > General > Allowed tags (yes, tags is the wrong term actually…). If there is nothing set it will be stripped of course. It should look like this;
    `
    a => (href =>() title =>() target=>() class=>() id=>())
    abbr =>(class=>() id=>() title =>())
    acronym =>(class=>() id=>() title =>())
    b => (class=>() id=>() )
    blockquote =>(class=>() id=>() cite =>())
    br => (class=>() id=>() )
    code => (class=>() id=>() )
    em => (class=>() id=>() )
    i => (class=>() id=>() )
    strike => (class=>() id=>() )
    strong => (class=>() id=>() )
    ul => (class=>() id=>())
    ol => (class=>() id=>())
    li => (class=>() id=>())
    p => (class=>() id=>() style=>())
    h1=>(class=>() id=>() style=>())
    h2=>(class=>() id=>() style=>())
    h3=>(class=>() id=>() style=>())
    h4=>(class=>() id=>() style=>())
    h5=>(class=>() id=>() style=>())
    h6=>(class=>() id=>() style=>())
    pre=>(class=>() id=>() style=>())
    address=>(class=>() id=>() style=>())
    span=>(class=>() id=>() style=>())
    div=>(class=>() id=>() style=>())
    img=>(class=>() id=>() style=>() src=>() title=>() alt=>() width=>() height=>())
    `
  • If you are using the Wysiwyg editor these characters will normally be replaced by their html entity codes. (This is an editor setting, so of course you can override it, but then you get what you deserve.)

    If you are not using the editor then you need to enter the entity codes yourself since there is no processing going on. In all cases the raw characters cannot be left in the text as they will break the HTML. Ths the tags processing above.
  • There is no html in "This rose < the red one > in a vase".

    Also why is html being stripped? Why not just escape it like any normal CMS does? It's a super simple function, even I could write in 2 minutes flat.

    Also, why does the wysiwyg editor replace characters by html entities? Why can't it store the characters as-is in the database and then escape them while outputting? That way, you won't run into hellish problems when you're starting work on a REST API.

    So. Can we get this fixed or not?

    And as for the emoji characters. How do you explain that?
  • acrylian Administrator, Developer
    HTML or html like is not allowed in titles. Within content fields (descriptions) html needs to be cleand since something like `

    fdfsfsf < fsdfsdf > fsdfsdf

    ` creates broken html. How to decide on output what is hmtl and what is wanted otherwise?
    Also, why does the wysiwyg editor replace characters by html entities?
    Ask the tinyMCE guys. Besides that the editor does add html like paragraphs or linebreaks or else and if it wouldn escape this characters entered as text it would break html. Since how should it know this is broken html or wanted characters otherwise.

    Regarding emojis: I don't see that something like `:simple_smile:` would be replaced or stripped actually.
  • Hmm.. I just edited one of my image descriptions and put a Smile emoji in it and when I hit apply, it erased the contents completely. Luckily I could refresh the meta information and got it back. I tried to put the emoji in the post here and it deleted all the content after the emoji. Hmmm..

    What I put in was the Unicode character for :smile: .

    After reading http://joomla.stackexchange.com/questions/13982/tinymce-is-deleting-emoji-and-everything-after I wonder if it's because the database needs utf8mb4 support? Might explain why some things happen in WordPress with some emojis as well.
  • acrylian Administrator, Developer
    If using unicode chars that might be it as unicode Emojis are quite new plus a TinyMCE configuration thing that might be stricter. Best see its documentation on that. Could also be our HTML cleanup.

    Not sure if the TinyMCE emoticon plugin is usable or the same as Emojis. Never used it (and I am not a fan of Emoji either I admit). However, it should be enabled on the full configuration available

    Zenphoto currently uses indeed just `utf8_unicode_ci` for tables creation. But `utf8mb4_unicode_ci` (MySQL 5.5.3+) probably would work as well. I have not tried it but I guess changing this afterwards will work, too.
  • The emoji problem (and other non-BMP characters I suspect!) has nothing to do with tinyMCE. I'm not using tinyMCE anywhere and I still have this problem.

    As for the database tables: use proper utf-8. I don't know (or care, to be brutally honest) what magic it needs, but the fact of the matter is that it's the database's job to store the data exactly as-is. And as it so happens, emoji can be encoded to utf-8 as perfectly as a simple Chinese character or even a classic ascii character.

    As for the html problem:

    If you enter "<p>fdfsfsf < fsdfsdf > fsdfsdf </p>" in a fields, then I would expect it to display "<p>fdfsfsf < fsdfsdf > fsdfsdf </p>" on the website. All the angle brackets neatly escaped, because, well, I entered them, didn't I??

    Whatever tinyMCE does, that's not the issue here. I'm not using tinyMCE, and never will (because it's a horrible crap editor that has no place in any CMS, since it produces HTML, which does not belong in the database) so when I literally enter angle brackets, I want to see them outputted literally as well. Simple as that.

    Anything removed or changed I consider data loss and should be fixed.
  • acrylian Administrator, Developer
    I'm not using tinyMCE, and never will (because it's a horrible crap editor that has no place in any CMS, since it produces HTML, which does not belong in the database)

    Well, I don't agree. This is of course a developer view of things. The "normal user" is used to something like that which is probably why such editors are used widely.

    HMTL certainly has its place there since it is meant to structure text. And wrong html is corrected/cleared, yes. But we don't need to agree on this I guess.
Sign In or Register to comment.