There are two major probems concerning data loss. Presumably this is because there's some kind of data sanitation going on that's too aggressive and/or too intelligent for its own good.
First of all,
Anything between < and > is thrown away. Without warning nor consent. After saving, Just poof, gone. There's probably a mechanism in place to remove all HTML tags, and while this is a questionable thing to do in the first place, obviously it is too aggressive.
Consider for instance a title like "This rose < the red one > in a vase". It's a crappy title, but it serves the usecase. After saving this title, what's left is "This rose in a vase".
This should be fixed. This falls under the category data loss. There are no warnings, and the user isn't being given consent that this is happening, or going to happen.
Ideally, just remove this nonsense mechanism and just *escape* angle brackets. That should be more than enough. YES I KNOW ABOUT XSS, but this is concerning data entered by an administrator, who will not XSS their own website I imagine.
Secondly,
Anything, anywhere at all, any piece of text entered no matter where, with an emoji in it, will be cleared ENTIRELY. It's not that it won't be saved, it's overwritten with an empty string if there's an emoji anywhere in the string.
What nonsense is this?! What other characters will be causing this behaviour? Again, the category is data loss. This should be fixed first thing, as, well, it causes data loss!
I wonder what mechanism decides to destroy anything and everything that has an emoji in it?... Well, someone didn't do the pile of poo test, for sure:
https://mathiasbynens.be/notes/javascript-unicode#poo-testAnyway, please fix these problems. They need not to exist in any system really. It's a simple matter of testing. Heck, even a simple unittest will be able to catch problems like these.
Comments
`
a => (href =>() title =>() target=>() class=>() id=>())
abbr =>(class=>() id=>() title =>())
acronym =>(class=>() id=>() title =>())
b => (class=>() id=>() )
blockquote =>(class=>() id=>() cite =>())
br => (class=>() id=>() )
code => (class=>() id=>() )
em => (class=>() id=>() )
i => (class=>() id=>() )
strike => (class=>() id=>() )
strong => (class=>() id=>() )
ul => (class=>() id=>())
ol => (class=>() id=>())
li => (class=>() id=>())
p => (class=>() id=>() style=>())
h1=>(class=>() id=>() style=>())
h2=>(class=>() id=>() style=>())
h3=>(class=>() id=>() style=>())
h4=>(class=>() id=>() style=>())
h5=>(class=>() id=>() style=>())
h6=>(class=>() id=>() style=>())
pre=>(class=>() id=>() style=>())
address=>(class=>() id=>() style=>())
span=>(class=>() id=>() style=>())
div=>(class=>() id=>() style=>())
img=>(class=>() id=>() style=>() src=>() title=>() alt=>() width=>() height=>())
`
If you are not using the editor then you need to enter the entity codes yourself since there is no processing going on. In all cases the raw characters cannot be left in the text as they will break the HTML. Ths the tags processing above.
Also why is html being stripped? Why not just escape it like any normal CMS does? It's a super simple function, even I could write in 2 minutes flat.
Also, why does the wysiwyg editor replace characters by html entities? Why can't it store the characters as-is in the database and then escape them while outputting? That way, you won't run into hellish problems when you're starting work on a REST API.
So. Can we get this fixed or not?
And as for the emoji characters. How do you explain that?
fdfsfsf < fsdfsdf > fsdfsdf
` creates broken html. How to decide on output what is hmtl and what is wanted otherwise? Ask the tinyMCE guys. Besides that the editor does add html like paragraphs or linebreaks or else and if it wouldn escape this characters entered as text it would break html. Since how should it know this is broken html or wanted characters otherwise.Regarding emojis: I don't see that something like `:simple_smile:` would be replaced or stripped actually.
What I put in was the Unicode character for .
After reading http://joomla.stackexchange.com/questions/13982/tinymce-is-deleting-emoji-and-everything-after I wonder if it's because the database needs utf8mb4 support? Might explain why some things happen in WordPress with some emojis as well.
Not sure if the TinyMCE emoticon plugin is usable or the same as Emojis. Never used it (and I am not a fan of Emoji either I admit). However, it should be enabled on the full configuration available
Zenphoto currently uses indeed just `utf8_unicode_ci` for tables creation. But `utf8mb4_unicode_ci` (MySQL 5.5.3+) probably would work as well. I have not tried it but I guess changing this afterwards will work, too.
As for the database tables: use proper utf-8. I don't know (or care, to be brutally honest) what magic it needs, but the fact of the matter is that it's the database's job to store the data exactly as-is. And as it so happens, emoji can be encoded to utf-8 as perfectly as a simple Chinese character or even a classic ascii character.
As for the html problem:
If you enter "<p>fdfsfsf < fsdfsdf > fsdfsdf </p>" in a fields, then I would expect it to display "<p>fdfsfsf < fsdfsdf > fsdfsdf </p>" on the website. All the angle brackets neatly escaped, because, well, I entered them, didn't I??
Whatever tinyMCE does, that's not the issue here. I'm not using tinyMCE, and never will (because it's a horrible crap editor that has no place in any CMS, since it produces HTML, which does not belong in the database) so when I literally enter angle brackets, I want to see them outputted literally as well. Simple as that.
Anything removed or changed I consider data loss and should be fixed.
HMTL certainly has its place there since it is meant to structure text. And wrong html is corrected/cleared, yes. But we don't need to agree on this I guess.