Texana
Text Analysis






Description



A tool for text analysis which includes Named Entity Recognition (NER), tokenization, stemming, lemmatization, decompounding words and detecting a text's language.


Usage

Intranet Service

At pc-4336 a texana server should be running. Named Entity Recognition (NER) with the Named Entity Classifier for Wikidata (NECKAr) dataset can be performed in the following way.

curl --request POST \
  --url http://pc-4336.kl.dfki.de:8483/neckar/search \
  --header 'content-type: application/json' \
  --data '{
	"body": "Douglas Adams and George Carlin"
}'
See another example below.

Build

After building the projects, there is a texana-server.jar in the server project.

Configure

Put a config.json file in the working directory of the server. In the file the Finite State Machines (FSTs) can be configured in a JSON array. Every FST should have an id to identify it. The file is read when the server starts.
{
    "fst": [
    ]
}

Named Entity Recognition (NER)

We use data from Named Entity Classifier for Wikidata (NECKAr) to load named entities. Download WikidataNE_20170320_NECKAR_1_0.json_.gz. Add the FST to the config.json. This also will create a neckar.sqlite database to store meta data about the loaded entities.

{
    "fst": [
        {
            "id": "neckar",
            "reader": "NECKArMultiFST",
            "path": "WikidataNE_20170320_NECKAR_1_0.json_.gz",
            "serializationFile": "WikidataNE_20170320_NECKAR_1_0.serial"
        }
    ]
}
Option Description
path Location of the json gz file from NECKAr.
serializationFile Location where the serialization file will be stored. This allows a faster loading time for the FST. Delete this file to force a reloading.
max Maximum number of entities to read from the json gz NECKAr file. Use a small value for testing (e.g. 5000).
bulkSize Bulk size for inserting into database.

Run

Start the server with the following command:

java -jar texana-server.jar

The server is running at http://localhost:8483. The config.json file is read on start-up. Wait until the console outputs [id] is ready.


Perform Named Entity Recognition (NER)

curl --request POST \
  --url http://localhost:8483/neckar/search \
  --header 'content-type: application/json' \
  --data '{
	"body": "Bill Maher is an american stand-up comedian"
}'
Option Description
body The text where named entities are searched.
metadata If true, adds meta data to the result from wikidata such as occupation, gender, alias, etc.
Output
{
  "size": 1,
  "resources": [
    {
      "coveredText": "Bill Maher",
      "from": 0,
      "id": 489,
      "to": 10,
      "lang": "de",
      "type": "PER"
    }
  ],
  "body": "Bill Maher is an american stand-up comedian"
}


Code



Paper