helion-prime
home about us blogs contacts

Posts Tagged ‘postgresql’

Full Text Search with several tables in PostgreSQL

Thursday, October 21st, 2010

Preamble

Quite often to overcome performance degradation in Full Text Search with several tables in PostgreSQL people use external full text search engines. And surely you should use them if you have really big amount of date.

It seems that currently there are 2 most popular engines:

Sphinx: http://sphinxsearch.com/
license: GPL version 2. Commercial licensing (eg. for embedded use) is available upon request.
native API implementations: PHP, Perl, Ruby, and Java.
written in: C++

Lucene: http://lucene.apache.org/
license: Apache License 2.0.
native API implementations: Delphi, Perl, C#, C++, Python, Ruby and PHP.
written in: Java

Basics

As all active users of PostgreSQL know support of full text search is integrated into the core database system since version 8.3.
If you don’t have basic understanding of full text search in PostgreSQL, please check:
http://www.postgresql.org/docs/8.4/interactive/textsearch-intro.html

Simple full text search query:

1
2
SELECT * FROM blog_post
WHERE to_tsvector('english', text || ' ') @@ plainto_tsquery('test')

As for any query to achieve acceptable performance using big amount of data we need to add DB index.

1
2
CREATE INDEX blog_post_ts_idx
ON blog_post USING gin(to_tsvector('english', text || ' '));

The problem arises when we need to perform search with several linked tables. The experiments show that even with necessary indexes on tables during full text search PostgreSQL uses only 1 index and performance degrades.

Solving of the issue with PostgreSQL

For example we have 2 tables:
table: Blog_post, field: text,
table: User, field: name

Queries with UNION
We need indexes on both tables:

1
2
CREATE INDEX blog_post_ts_idx ON blog_post USING gin(to_tsvector('english', text || ' '));
CREATE INDEX user_ts_idx ON USER USING gin(to_tsvector('english', name || ' '));

We execute full text search with composite query (UNION )

1
2
3
4
5
SELECT id FROM blog_post WHERE to_tsvector('english', text || ' ') @@ plainto_tsquery('test')
UNION
SELECT blog_post.id FROM blog_post
   LEFT JOIN USER ON USER.id = blog_post.user_id
   WHERE to_tsvector('english', name || ' ' ) @@ plainto_tsquery('test')

In that case PostgreSQL perform 2 separate query that accomplish search by different fields, and that solve issue of index using.

Using of materialized View

Materialized View – is a database object that contains the results of a query. And as it’s a real object we can add DB index to it. Unfortunately at this time Postgresql doesn’t support that type of DB view.
Therefore we need to create such object by itself. In the capacity of materialized view we will use regular table that consists of 2 fields:

1
2
3
CREATE TABLE blog_post_ts_keywords (
   blog_post_id INTEGER NOT NULL PRIMARY KEY,
   keywords TSVECTOR);

Then we can add DB index to it:

1
2
CREATE INDEX index_blog_post_ts_keywords ON blog_post_ts_keywords
   USING GIN(keywords);

Now we can execute regular full text search:

1
SELECT blog_post_id FROM blog_post_ts_keywords WHERE keywords @@ plainto_tsquery('test')

Main complication here is maintenance of that view in actual state.
We can use DB triggers: http://www.postgresql.org/docs/8.4/interactive/trigger-definition.html or simple update queries by some schedule

1
2
3
4
5
INSERT INTO blog_post_ts_keywords
SELECT blog_post.id AS blog_post_id,  
      (to_tsvector('english', blog_post.text ||  usr.name || ' ')) AS keywords
   FROM blog_post
   LEFT JOIN USER ON USER.id = blog_post.user_id

Here you can see simple query for materialized view filling. The query for data refreshing looking quite the same considering checking of changes. And your query for outdated data deleting should take into account deleted or not active records.

Tuning of Postgresql under OpenBSD

Thursday, February 25th, 2010

Preamble

I assume that you already made your best with help of your favorite programming language
and recommended postgresql performance tips: http://www.postgresql.org/docs/8.4/static/performance-tips.html

Postgresql resources

No doubt that standard postgresql configuration is far from modern production environments.
Therefore you need to spend enough time with following sources.

1. resource consumption documentation:
http://www.postgresql.org/docs/8.4/static/runtime-config-resource.html

The most important parameters are:
work_mem (integer)
shared_buffers (integer)

2. Query Planning documentation:
http://www.postgresql.org/docs/current/static/runtime-config-query.html

The most important parameters are:
effective_cache_size (integer)
random_page_cost (floating point)

OpenBSD resources

The default sizes in the GENERIC kernel are insignificant also and waiting for your tuning as well.
Posgtresql doesn’t start without enough memory size so always know when you need to increase kern.shminfo.shmmax.

Setting that we can change in /etc/sysctl.conf
the maximum number of System V IPC system-wide semaphore sets (and identifiers) which can exist at any given time:
kern.seminfo.semmni

the maximum total individual System V IPC semaphores which can be assigned by applications:
kern.seminfo.semmns

the amount of shared memory available in the system (bytes):
kern.shminfo.shmmax

the maximum number of shared memory segments:
sysctl kern.shminfo.shmseg

Full list of setting you can see with:
# man sysctl

OpenBSD kernel parameters
So, there are set of parameters that can be tuned only with kernel rebuild.

You should tune them only if system works unstable with default values and you have:
kernel warnings: “uvm_mapent_alloc: out of static map entries”
or panics like: “panic: malloc: out of space in kmem_map”

NKMEMPAGES
This option defines number of pages in kernel kmem_map structure.

MAX_KMAPENT
It defines number of static entries in kernel kmem_map (kernel virtual memory).

They can be changed in:
/usr/src/sys/arch/conf/GENERIC

As start you need to recheck ‘Building the System from Source’ part of OpenBSD documentation:
http://openbsd.org/faq/faq5.htm

Usually administrators select these parameters using set of tests on dedicated testing box where
they emulate load of production servers.

Example

our test server: 1x Intel Quad core CPU, 2GB RAM
software: Ruby on Rails application, postgresql DB, memcached.
load: about 15.000 users/day, peak load: 10 users/sec.

postgresql_dir/data/postgresql.conf

1
2
3
4
5
6
7
8
9
# RESOURCE USAGE
shared_buffers = 738MB
max_prepared_transactions = 30
work_mem = 16MB
max_fsm_pages = 2000000

# QUERY TUNING
effective_cache_size = 512MB
random_page_cost = 1.7

/etc/sysctl.conf

1
2
3
kern.seminfo.semmni = 256
kern.seminfo.semmns = 2048
kern.shminfo.shmmax = 805306368    # Shared memory segment size is 768M

/usr/src/sys/arch/conf/GENERIC

1
2
3
## custom settings
option MAX_KMAPENT = 3072
option NKMEMPAGES = 32768
©2010 Helion-Prime Solutions Ltd.
Custom Software Development Agile Company.