postgis/doc/performance_tips.xml

<?xml version="1.0" encoding="UTF-8"?>
<chapter>
  <title>Performance tips</title>

  <sect1>
    <title>Small tables of large geometries</title>

    <sect2>
      <title>Problem description</title>

      <para>Current PostgreSQL versions (including 8.0) suffer from a query
      optimizer weakness regarding TOAST tables. TOAST tables are a kind of
      "extension room" used to store large (in the sense of data size) values
      that do not fit into normal data pages (like long texts, images or
      complex geometries with lots of vertices), see
      <ulink url="http://www.postgresql.org/docs/current/static/storage-toast.html">the PostgreSQL Documentation for TOAST</ulink> for more
      information).</para>

      <para>The problem appears if you happen to have a table with rather
      large geometries, but not too much rows of them (like a table containing
      the boundaries of all European countries in high resolution). Then the
      table itself is small, but it uses lots of TOAST space. In our example
      case, the table itself had about 80 rows and used only 3 data pages, but
      the TOAST table used 8225 pages.</para>

      <para>Now issue a query where you use the geometry operator &amp;&amp;
      to search for a bounding box that matches only very few of those rows.
      Now the query optimizer sees that the table has only 3 pages and 80
      rows. He estimates that a sequential scan on such a small table is much
      faster than using an index. And so he decides to ignore the GIST index.
      Usually, this estimation is correct. But in our case, the &amp;&amp;
      operator has to fetch every geometry from disk to compare the bounding
      boxes, thus reading all TOAST pages, too.</para>

      <para>To see whether your suffer from this bug, use the "EXPLAIN
      ANALYZE" postgresql command. For more information and the technical
      details, you can read the thread on the postgres performance mailing
      list:
      http://archives.postgresql.org/pgsql-performance/2005-02/msg00030.php</para>
    </sect2>

    <sect2>
      <title>Workarounds</title>

      <para>The PostgreSQL people are trying to solve this issue by making the
      query estimation TOAST-aware. For now, here are two workarounds:</para>

      <para>The first workaround is to force the query planner to use the
      index. Send "SET enable_seqscan TO off;" to the server before issuing
      the query. This basically forces the query planner to avoid sequential
      scans whenever possible. So it uses the GIST index as usual. But this
      flag has to be set on every connection, and it causes the query planner
      to make misestimations in other cases, so you should "SET enable_seqscan
      TO on;" after the query.</para>

      <para>The second workaround is to make the sequential scan as fast as
      the query planner thinks. This can be achieved by creating an additional
      column that "caches" the bbox, and matching against this. In our
      example, the commands are like:</para>

      <programlisting>SELECT AddGeometryColumn('myschema','mytable','bbox','4326','GEOMETRY','2');
UPDATE mytable SET bbox = ST_Envelope(ST_Force_2d(the_geom));</programlisting>

      <para>Now change your query to use the &amp;&amp; operator against bbox
      instead of geom_column, like:</para>

      <programlisting>SELECT geom_column
FROM mytable
WHERE bbox &amp;&amp; ST_SetSRID('BOX3D(0 0,1 1)'::box3d,4326);</programlisting>

      <para>Of course, if you change or add rows to mytable, you have to keep
      the bbox "in sync". The most transparent way to do this would be
      triggers, but you also can modify your application to keep the bbox
      column current or run the UPDATE query above after every
      modification.</para>
    </sect2>
  </sect1>

  <sect1>
    <title>CLUSTERing on geometry indices</title>

    <para>For tables that are mostly read-only, and where a single index is
    used for the majority of queries, PostgreSQL offers the CLUSTER command.
    This command physically reorders all the data rows in the same order as
    the index criteria, yielding two performance advantages: First, for index
    range scans, the number of seeks on the data table is drastically reduced.
    Second, if your working set concentrates to some small intervals on the
    indices, you have a more efficient caching because the data rows are
    spread along fewer data pages. (Feel invited to read the CLUSTER command
    documentation from the PostgreSQL manual at this point.)</para>

    <para>However, currently PostgreSQL does not allow clustering on PostGIS
    GIST indices because GIST indices simply ignores NULL values, you get an
    error message like:</para>

    <programlisting>lwgeom=# CLUSTER my_geom_index ON my_table;
ERROR: cannot cluster when index access method does not handle null values
HINT: You may be able to work around this by marking column "the_geom" NOT NULL.</programlisting>

    <para>As the HINT message tells you, one can work around this deficiency
    by adding a "not null" constraint to the table:</para>

    <programlisting>lwgeom=# ALTER TABLE my_table ALTER COLUMN the_geom SET not null;
ALTER TABLE</programlisting>

    <para>Of course, this will not work if you in fact need NULL values in
    your geometry column. Additionally, you must use the above method to add
    the constraint, using a CHECK constraint like "ALTER TABLE blubb ADD CHECK
    (geometry is not null);" will not work.</para>
  </sect1>

  <sect1>
    <title>Avoiding dimension conversion</title>

    <para>Sometimes, you happen to have 3D or 4D data in your table, but
    always access it using OpenGIS compliant ST_AsText() or ST_AsBinary()
    functions that only output 2D geometries. They do this by internally
    calling the ST_Force_2d() function, which introduces a significant
    overhead for large geometries. To avoid this overhead, it may be feasible
    to pre-drop those additional dimensions once and forever:</para>

    <programlisting>UPDATE mytable SET the_geom = ST_Force_2d(the_geom);
VACUUM FULL ANALYZE mytable;</programlisting>

    <para>Note that if you added your geometry column using
    AddGeometryColumn() there'll be a constraint on geometry dimension. To
    bypass it you will need to drop the constraint. Remember to update the
    entry in the geometry_columns table and recreate the constraint
    afterwards.</para>

    <para>In case of large tables, it may be wise to divide this UPDATE into
    smaller portions by constraining the UPDATE to a part of the table via a
    WHERE clause and your primary key or another feasible criteria, and
    running a simple "VACUUM;" between your UPDATEs. This drastically reduces
    the need for temporary disk space. Additionally, if you have mixed
    dimension geometries, restricting the UPDATE by "WHERE
    dimension(the_geom)&gt;2" skips re-writing of geometries that already are
    in 2D.</para>

  </sect1>

  <sect1>
    <title>Tuning your configuration</title>

    <para>These tips are taken from Kevin Neufeld's presentation "Tips for the
    PostGIS Power User" at the FOSS4G 2007 conference.  Depending on your
    use of PostGIS (for example, static data and complex analysis vs frequently
    updated data and lots of users) these changes can provide significant
    speedups to your queries.</para>

    <para>For a more tips (and better formatting), the original presentation
        is at
        <ulink url="http://2007.foss4g.org/presentations/view.php?abstract_id=117">
        http://2007.foss4g.org/presentations/view.php?abstract_id=117</ulink>.
    </para>

      <sect2>
        <title>Startup</title>

        <para>
          These settings are configured in postgresql.conf:
        </para>

        <para>
          <ulink url="http://www.postgresql.org/docs/current/static/runtime-config-wal.html#GUC-CHECKPOINT-SEGMENTS">checkpoint_segments</ulink>
        </para>

        <itemizedlist>
          <listitem>
            <para>
              Maximum number of log file segments between automatic WAL checkpoints
              (each segment is normally 16MB); default is 3
            </para>
          </listitem>
          <listitem>
            <para>
              Set to at least 10 or 30 for databases with heavy write activity, or
              more for large database loads. Another article on the topic worth reading  <ulink url="http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm">Greg Smith: Checkpoint and Background writer</ulink>
            </para>
          </listitem>
          <listitem>
            <para>
              Possibly store the xlog on a separate disk device
            </para>
          </listitem>
        </itemizedlist>

        <para>
           <ulink url="http://www.postgresql.org/docs/current/static/runtime-config-query.html#GUC-CONSTRAINT-EXCLUSION">constraint_exclusion</ulink>
        </para>

        <itemizedlist>
          <listitem>
            <para>
              Default: off (prior to PostgreSQL 8.4 and for PostgreSQL 8.4+ is set to partition)
            </para>
          </listitem>
          <listitem>
            <para>
              This is generally used for table partitioning.  If you are running PostgreSQL versions below 8.4, set to "on" to ensure the query planner will optimize as desired.
			  As of PostgreSQL 8.4, the default for this is set to "partition" which is ideal for PostgreSQL 8.4 and above since
			  it will force the planner to only analyze tables for constraint consideration if they are in an inherited hierarchy
			  and not pay the planner penalty otherwise.
            </para>
          </listitem>
        </itemizedlist>

        <para>
           <ulink url="http://www.postgresql.org/docs/current/static/runtime-config-resource.html#GUC-SHARED-BUFFERS">shared_buffers</ulink>
        </para>

        <itemizedlist>
          <listitem>
            <para>
              Default: ~32MB
            </para>
          </listitem>
          <listitem>
            <para>
              Set to about 1/3 to 3/4 of available RAM
            </para>
          </listitem>
        </itemizedlist>
      </sect2>

      <sect2>
        <title>Runtime</title>

        <para>
          <ulink url="http://www.postgresql.org/docs/current/static/runtime-config-resource.html#GUC-WORK-MEM">work_mem</ulink> (the memory used for sort operations and complex queries)
        </para>

        <itemizedlist>
          <listitem>
            <para>
              Default: 1MB
            </para>
          </listitem>
          <listitem>
            <para>
              Adjust up for large dbs, complex queries, lots of RAM
            </para>
          </listitem>
          <listitem>
            <para>
              Adjust down for many concurrent users or low RAM.
            </para>
          </listitem>
          <listitem>
            <para>
                If you have lots of RAM and few developers:
                <programlisting>
                    SET work_mem TO 1200000;
                </programlisting>
            </para>
          </listitem>
        </itemizedlist>

        <para>
          <ulink url="http://www.postgresql.org/docs/current/static/runtime-config-resource.html#GUC-MAINTENANCE-WORK-MEM">maintenance_work_mem</ulink> (used for VACUUM, CREATE INDEX, etc.)
        </para>

        <itemizedlist>
          <listitem>
            <para>
              Default: 16MB
            </para>
          </listitem>
          <listitem>
            <para>
              Generally too low - ties up I/O, locks objects while swapping memory
            </para>
          </listitem>
          <listitem>
            <para>
              Recommend 32MB to 256MB on production servers w/lots of RAM, but depends
              on the # of concurrent users.  If you have lots of RAM and few developers:
                <programlisting>
                    SET maintainence_work_mem TO 1200000;
                </programlisting>
            </para>
          </listitem>
        </itemizedlist>

      </sect2>

  </sect1>
</chapter>