A tool for text analysis which includes Named Entity Recognition (NER), tokenization, stemming, lemmatization, decompounding words and detecting a text's language.
At pc-4336 a texana server should be running. Named Entity Recognition (NER) with the Named Entity Classifier for Wikidata (NECKAr) dataset can be performed in the following way.
curl --request POST \
--url http://pc-4336.kl.dfki.de:8483/neckar/search \
--header 'content-type: application/json' \
--data '{
"body": "Douglas Adams and George Carlin"
}'
See another example below.
After building the projects, there is a texana-server.jar
in the server project.
config.json
file in the working directory of the server.
In the file the Finite State Machines (FSTs) can be configured in a JSON array.
Every FST should have an id to identify it.
The file is read when the server starts.
{
"fst": [
]
}
We use data from Named Entity Classifier for Wikidata (NECKAr) to load named entities.
Download WikidataNE_20170320_NECKAR_1_0.json_.gz.
Add the FST to the config.json
. This also will create a neckar.sqlite
database to store meta data about the loaded entities.
{
"fst": [
{
"id": "neckar",
"reader": "NECKArMultiFST",
"path": "WikidataNE_20170320_NECKAR_1_0.json_.gz",
"serializationFile": "WikidataNE_20170320_NECKAR_1_0.serial"
}
]
}
Option | Description |
---|---|
path |
Location of the json gz file from NECKAr. |
serializationFile |
Location where the serialization file will be stored. This allows a faster loading time for the FST. Delete this file to force a reloading. |
max |
Maximum number of entities to read from the json gz NECKAr file. Use a small value for testing (e.g. 5000). |
bulkSize |
Bulk size for inserting into database. |
Start the server with the following command:
java -jar texana-server.jar
The server is running at http://localhost:8483
.
The config.json
file is read on start-up.
Wait until the console outputs [id] is ready
.
curl --request POST \
--url http://localhost:8483/neckar/search \
--header 'content-type: application/json' \
--data '{
"body": "Bill Maher is an american stand-up comedian"
}'
Option | Description |
---|---|
body |
The text where named entities are searched. |
metadata |
If true, adds meta data to the result from wikidata such as occupation, gender, alias, etc. |
{
"size": 1,
"resources": [
{
"coveredText": "Bill Maher",
"from": 0,
"id": 489,
"to": 10,
"lang": "de",
"type": "PER"
}
],
"body": "Bill Maher is an american stand-up comedian"
}