Performance, performance ;)

Hi,

I'm just playing with zenphoto on the big tables and made some "improvements" for speed and also have some questions regarding the improvments.

I don't want to commit stuff to your repository as it is not tested/stable enough so I don't want to break things.

- Function compressedIDList() in class-search.php

I just shortened the whole stuff to:

`

protected function compressedIDList($idlist) {

asort($idlist);

return ' ``id` IN(' . implode(',',$idlist) . ')';

}

This allows the database to use the "shortest way" to get the information for you :). You can use the explain command in the shell to check that :).

- Index on albums and image tables

You should create an index over the default fields that one can select in settings -> picture -> default exif fields.

I just tested on about 300.000 pictures in the images table and it lowered the searchtime from 2 miuntes 23 secs to 7 seconds (yes, seven seconds).

- getSearchImages() in class-search.php

The foreach loop iterates over $search_results and queries the database for every result it gets. I created an array that holds the albumid's that have been checked already and only add each albumid once. When the loop is finished i just query the database with the array of albumids and let the database return one array with all information included, like:

`

$_albumvalues = query_full_aray("select id, title, folder, ``show`from ".prefix('albums') . " where ". $this->compressedIDList($_albumids),true,'id');

// after this i can just get the album name from the in the main foreach loop :)

$albumname = $_albumvalues[$albumid]['folder'];

Actually i'm thinking of changing the search engine to let the database do the work. When you search over a bigger table and get 5000 or more results you should be able to limit the search in the database already (let the database do the work).

It becomes really annoying if you have a search with 10.000 hits, and you want to go over to the second page of the search results and the whole stuff is done again. Running the same search as "SELECT bleh,blah,blubb order by foo asc limit " . ($number_of_pictures_per_page * $page_to_show) . "," . $number_of_pictures_per_page would help a lot. The Database does the work for you and you don't have to read 10.000 lines to memory and then get ten lines out of it for your slice of pictures … It also lowers the other queries that are done when the class checks for picture/album tags and stuff. This would speed things up a lot - I haven't implemented it yet so I can't tell you how much.

If you like I'll commit my work :)

Just let me know.

KR,

Grimeton

Comments

  • If you have commit priviledges, please do commit tested changes. But only to the DEV branch. If we find a problem we can always back out the changes.

    Some questions and observations:

    The default search fields can, of course, change. Do you the drop the old index and add a new one?

    The getSearchImages() method cache images for further use. If you do not retireve all the images the functions that rely on this will fail. Also, all that is stored in memory is the file and folder names. So maybe not so much savings? This could always be changed to use fetch associated if there is a base and limit, but that information is not currently passed to the function.

    If you take this approach I think you would need to do performance measurements on multiple different themes to see how they are effected over all. For instnace, a theme that counts the results of a search before showing results will probably not benefit at all and may actually have a performance hit.

    I don't think I understand what you have done with capturing the albums in the image loop. The concept sounds good, but the description does not sound correct. Why would you not just store the needed data when you capture the album on first reference and then use it if it already exists? What is the query at the end for? The folder name is stored with the image name in the images array and not used otherwise. So really all that is needed is to also store the folder name "permit/deny" state for use next time you see the album ID.
  • acrylian Administrator, Developer
    Grimeton, you can't commit to the repository at all. We only provide access to certain persons.

    You can however create a ticket and attach your changes there for review. As sbillard said be sure to work with the dev nightly build for this.
  • Hi,

    you're right. It's hard to get an idea of what I'm talking about when you don't have the whole picture. So here we go:

    First thing that should be changed is the indexing feature that indexes pictures on the fly while someone is watching the gallery. I'm not saying it should be completely gone, but there should be an option to disable it. It's one of the most performance killing things. Put a knob in the config section and then put things indexing images like in Album::getImages() into the garbageCollect() method so that getImages() only returns the known images from the database (and if OTF indexing is needed calls garbageCollect() ).

    One could create three settings:

    1 - always on (do indexing every time)
    2 - only after a number of queries/number of days passed/whatever fits here
    3 - off - to index you have to run a script (via cron maybe)

    Second thing is that only those parts of a class get populated which are really needed the moment they're needed. From the OOP point of view it's a good idea to have everything at hand, but not from the php point of view, as those are scripts where you can't reuse the results in a later connection/query again.

    Third thing is: let the database do most of the work. IMHO the stuff get's heavy and slow when more data is transferred between the database and the php script than really needed.

    E.g. if someone calls Album::getNumImages() you should do a "select count(1) from zp_images where albumid = $this->id;

    The database is ten times faster in counting the rows than the whole process is when you're selecting the rows, transferring them to the php script and then doing a sizeof() or looping over the returned array. You can even create callbacks that are executed if someone tries to access a value of the class like Album->$images and stuff.

    Same goes for the search I was talking about yesterday:

    In your code the following happens inside SearchEngine::getSearchImages() - I used nopaste for this, it's easier to read if the code is colored and formatted.

    So read on here: http://pastebin.com/p2e2636E I created comments in the code.

    There are many solutions to this problem. One could be to just create a select statement that joins the result id's of the pictures with the album information.

    like "select zp_images.id,zp_images.albumid,zp_images.`desc`from zp_images left join (zp_albums) on (zp_images.albumid=zp_albums.id) where $searchcolumn1 = '%search1%' and …. ; -- btw. the guy that had the idea to name the column for the description "desc" is a genius ;)

    This would give you a huge dataset that you can just loop on and create everything you need without bugging the database again and again…. You could also work with the "limit" statement here.

    Other way could be to just query for the album ids via distinct and then get the album information for all albums used in this search from the database. The bad thing about this is that you have to run the query twice - one time for the unique set of the album ids, second time to get your results so it's too expensive.

    The compromise could look like this:

    http://pastebin.com/xkK8J9n2

    It loops over the search results twice, but it only uses a small amount of memory as the information for each album is only queried and stored in memory once and not over and over again. One could also use the "limit" statement here. If the indexing OTF is not used you can also skip the tests if the album/file exists on the storage device and just return the array of images. That would speed things up a lot as you can just return an array that you got back from the database. Imagine this used together with LIMIT ;)

    To your questions:

    - Default search fields can change

    One way could be to drop the old index and create a new one. Another way could be to create a second index (maybe up to $number_defined_in_config). If the database was used for a few days with your default index you can run "analyze table zp_images" and check for the most used columns in a search and then create an index based on this result.

    = getSearchImages() caches images for further use.

    The moment you call getSearchImages() the SearchClass() only gets the results created by the sql query statement. And everything inside this class and other classes that use/rely on the SearchClass() are only seeing and working on the result set returned by the database query - so IMHO it doesn't make a difference for the classes if they work on those 20 or 10000 pictures - they're working with the stuff they get. But it's a lot faster because all the other things that come after the database query are working with a smaller result set …

    = Theme counting the results before going on

    While explaining the getSearchImages() changes I made, I already talked about this. There are multiple ways todo it - run a "select count(1) …" as the search query, or just count the rows returned. Speed will always be a problem at this particular point because you can run the same query two times or just count the returned values while creating the unique id set. The problem here is that you have to remember that the only "constant" thing in all this is the database. When the php script served the first page of the result set it dies and when you go to page two a new script is launched, initialized, doing it's work, returning the result set and dies again.

    So if you hit a big result set there are multiple ways to go:

    Just fyi: when I talk about temporary tables, I'm talking about static tables generated for temporary use. The temporary tables of MySQL are destroyed the moment the connection is dropped and are only visible to the connection that created them, so the data can't be saved across the calls of the script.

    To store a temporary search result into a "temp" table the following construct could be used:

    - We're running a cron job that indexes the files
    - We're using two tables:
    = The first one holds the search fields, their values, the temporary table name and the date the result set was created.
    = The second one is the table named in table one and created by a "select into".

    Something like this:

    create table temp_tables(
    id int auto_increment key,
    query_fields text,
    query_values text,
    query_done TIMESTAMP default now()
    );

    So a new search does something like this:

    insert into temp_tables(query_fields,query_values) values("desc,title,filename","blah,bleh,blubb");
    select last_insert_id();

    Btw: On a "real" DBMS the above would be "insert into temp_tables(query_fields,query_values) values("desc,title,filename","blah,blah,blubb") returning id;" ;)

    And then you create a new table like:

    create table tmp_$id_from_previous_insert select a,b,c from zp_images where …. $your_search_query_here…

    Then you get a table with your results and a timestamp to it. So if one does a search you check if the search already exists *AND* if the script indexing the pictures ran since the time the result was created and then just output the results you need (via limit) and you can get the number of results by an easy count(1) over the results table. One could use serialize() to put the query's columns and values into one table - could be easier - maybe. As storage is cheap and memory gets cheaper everyday it's a good way to use those tables and - if enough memory is available - create the temporary tables just in memory.

    If the indexing script runs, it just sets an option in zp_options to the timestamp the moment it starts running or just deletes all temporary tables after it ran (another option). So that you generate new result sets in the end.

    Just my $0.02

    KR,

    Grimeton
  • Hi,

    just another thing: if one uses serialize() over the array that holds the search fields and their values like array('desc' => 'bleh', 'title' => 'blubb'); then the search class can just "impersonate" as this search and has *everything* at hand after reading the serialized array.

    One could extend the array and just put stuff into it that was returned after the first time the search was executed, like the number of rows, or the columns and their names and stuff :)

    Cu
  • Clearly your knowledge of databases far excedes mine. I fear I am not up to the task of making the sweeping changes you describe.

    We welcome contributaions from our community. You can create a ticket and attach your changed files. (From my experience DIFF files are a waste of time--too often some compatibility issue makes them useless to me.)

    The areas of concern I have are all involved in user rights versus access and visibility. If you think those are all addressed in your proposal that is great. But remember that counting the images in the database does not give the same result as counting the images a particular user can see/access.
  • I guess there is one other area that might need attention--the beavior of software an image no longer exists.
  • acrylian Administrator, Developer
    It sounds all interessting. I see that using the database functions directly has speed benefits over doing the stuff with php afterwards.As my collegue said we are open for contributions so please indeed attach the files to a ticket. (the "desc" issue, well, maybe a little unfortunate but can be used if escaped...;-))

    I think the biggest problem with relying on the database preliminary is Zenphoto's feature being filesystem based which makes it so convenient to use. I think the static html cache could at least help with some symptoms.

    Of course you gallery is of quite exceptional size..;) Keep in mind that Zenphoto primarily is for personal gallery/portfolio type sites on especially shared hosts. So cron jobs might not be possible easily.
  • Hi,

    IMHO it doesn't make much sense if I patch the whole thing and then those patches aren't included into the main stuff.

    So let us agree to the following: I'll hack the PHP and Database stuff together and include it into the main line and you - whoever from your team - implements the option settings in the configuration menu :)

    I'm more the C/C++ programmer than a web designer so my html skills are from 2004 or something and a bit rusty ;)

    I hope this is ok :)

    KR,

    Grimeton
  • Unless you are willing to make a complete update and test, we will probably go nowhere with this.
  • So you're not interested in the changes?
  • So you are not intersted in doing a complete job?
  • I don't want to sit down and implement everything and then end up with my changes not included.

    This will take a few days to work everything out and to implement it, so I don't want to go that long road down without someone that adds the necessary config options to the config interface.

    As I wrote before - I'm not the HTML guy, nor do I know a lot about CSS and stuff...

    cu
  • Well, there are never guarantees in the world. So be it.
  • acrylian Administrator, Developer
    What my collegue actually means that these would be quite significant changes that will require a lot of tests so no other functionality breaks. And there are lots of places like user and access right management.

    So at no chance we will implement any changes officially at all before they have been tested and that will take quite some time as well. Otherwise we probably would sit here with the work to fix afterwards. So we can't guarantee that they or all will be included (my collegue already did implement some of your ideas btw as he told me).

    Also given that my collegue will be away soon for some time and I couldn't handle such a huge change myself alone at all that also means there will not be much core development until September so you could work on it and we all could test afterwards.

    For the functionality itself there should be no HTML or CSS knowledge needed. Options can be added later.
Sign In or Register to comment.