Input and Output Data¶
Input Data¶
The input data for uploading items to the dataset includes manifest,csv and txt files.
These files are utilized within the dataset during the uploading process, specifically when the Select Manifest/CSV/TXT option is selected.
For a detailed view of the dataset types and the compatible file types, please refer to this section Upload files to Dataset.
During dataset-item upload, it is essential to adhere to a specific format for the manifest file to ensure successful processing.Additionally,metadata can be included in both manifest and csv files.
Use an input manifest file
The following is an example of a manifest file for files stored in an Amazon S3 bucket:
{"Sr_No":11,"source":"s3://EXAMPLE-BUCKET/example1.tiff","presigned":"http://abc.com"}
{"Sr_No":12,"source":"s3://EXAMPLE-BUCKET/example2.pdf","presigned":"http://abd.com"}
Use an input CSV file
The following is an example of a CSV file for files stored in an Amazon S3 bucket:
id File Batch Pages Source
1 tif 1 1 s3://objectways-ergo-poc/input_documents/MLK_F4BKRD3R00FI2OA.tif
2 tif 2 2 s3://objectways-ergo-poc/input_documents/MLK_REZ_multipage.tif
3 tif 3 1 s3://objectways-ergo-poc/input_documents/AmbA_J3Y61NDT0021050L2.tif
The following is an example of a CSV file for Text as source:
origin text
wikipedia ChatGPT[a] is an artificial intelligence (AI) chatbot developed by OpenAI and released in November 2022. It is built on top of OpenAI's GPT-3.5 and GPT-4 foundational large language models (LLMs) and has been fine-tuned (an approach to transfer learning) using both supervised and reinforcement learning techniques.
wikipedia ChatGPT launched as a prototype on November 30, 2022, and garnered attention for its detailed responses and articulate answers across many domains of knowledge.[3] Its propensity, at times, to confidently provide factually incorrect responses, however, has been identified as a significant drawback.[4] In 2023, following the release of ChatGPT, OpenAI's valuation was estimated at US$29 billion.[5] The advent of the chatbot has increased competition within the space, motivating the creation of Google's Bard and Meta's LLaMA.
wikipedia The original release of ChatGPT was based on GPT-3.5. A version based on GPT-4, the newest OpenAI model, was released on March 14, 2023, and is available for paid subscribers on a limited basis.
wikipedia ChatGPT is a member of the generative pre-trained transformer (GPT) family of language models. It was fine-tuned over an improved version of OpenAI's GPT-3 known as "GPT-3.5".[6]
wikipedia he fine-tuning process leveraged both supervised learning as well as reinforcement learning in a process called reinforcement learning from human feedback (RLHF).[7][8] Both approaches use human trainers to improve the model's performance. In the case of supervised learning, the model was provided with conversations in which the trainers played both sides: the user and the AI assistant. In the reinforcement learning step, human trainers first ranked responses that the model had created in a previous conversation.[9] These rankings were used to create "reward models" that were used to fine-tune the model further by using several iterations of Proximal Policy Optimization (PPO).[7][10]
Use an input Txt file
The following is an example of a txt file for files stored in an Amazon S3 bucket:
s3://EXAMPLE-BUCKET/example1.pdf
s3://EXAMPLE-BUCKET/example2.pdf
Output Data¶
Dataset exports¶
There are three export formats available for datasets:
Dataset Export: This format allows you to export the dataset in its original form, containing the raw data without any specific model run or ground truth annotations.
Dataset Export with Model Run: This format includes the dataset along with the results of a specific model run.It captures the model’s predictions, classifications, or other outputs generated by applying the trained model to the dataset.
Dataset Export with Ground Truth Project: This format exports the dataset along with the annotations created in a ground truth project. A ground truth project involves manual annotation or labeling of data by human annotators. The export includes both the original dataset and the annotations, providing valuable labeled data that can be used for training or validating machine learning models.
1. Scanned(OCR) Dataset¶
1. Dataset Export¶
{
"source": "https://********/presigned/92d1f5a0b8667c883e797608571c8616.pdf?sig=62cc37fc5ec6d91864ae062e4da9f6ad81dda083b2207b0b12517f6d5d37a1be4400dd381a64d25ec5c4fda7c6a76103ca54f8434687b948b0a6175007fc82d3:6ee530eca4b69d17633726d4ad1220b2:64cddaf9:1243ecda3ebadc9a1ce6fa1fea8f3808",
"name": "ABSTRACT - Axia.tiff",
"itemId": "4611f8e6331908278b5160ca",
"datasetId": "e3773b85655ea8646005158a",
"type": "application/pdf",
"tags": [
"invoice tag"
],
"metadata": {
"ocr_model": "Textract (default)",
"use-textract-only": true,
"source_ref": "/uploads/e3773b85655ea8646005158a/4611f8e6331908278b5160ca",
"document_id": "4611f8e6331908278b5160ca"
},
"active": true,
"ext": "pdf"
}
- table:Scanned(OCR) Dataset Export Summary:
Field Names |
Type |
Description |
---|---|---|
source |
str |
The presigned URL or S3 path of the data source |
name |
str |
The name of the dataset item |
itemId |
str |
The Id of the dataset item |
datasetId |
str |
The Id of the dataset |
type |
str |
The type of the dataset |
tags |
list |
List of tags associated with the dataset item |
metadata |
dict |
The metadata associated with the dataset and dataset item |
ocr_model |
str |
The OCR model used for processing |
use_textract_only |
bool |
Indicates if only Textract is used for processing |
source_ref |
str |
Reference to the source of dataset item |
document_id |
str |
The Id of the document |
active |
bool |
Indicates the dataset item is currently active |
ext(localfiles) |
str |
Extension of local files, if any |
2. With Model Run¶
{
"source": "https://sandboxdocuments.tensoract.com/presigned/92d1f5a0b8667c883e797608571c8616.pdf?sig=7da49bf64299dc09cb3405ff33cb0444ed6d310208f13655ae773407d405b9d003db5d7f980679f85f0949151c4f9904c837f90eb434a5ba2bfbd2520f1d43b9:4887b3566f7f75b587ad0ea9ebe0e6dc:64cdc5b3:37f85a3533b89761e7282654214ba4bf",
"name": "ABSTRACT - Axia.tiff",
"itemId": "4611f8e6331908278b5160ca",
"datasetId": "e3773b85655ea8646005158a",
"type": "application/pdf",
"tags": [],
"metadata": {
"ocr_model": "Textract (default)",
"use-textract-only": true,
"source_ref": "/uploads/e3773b85655ea8646005158a/4611f8e6331908278b5160ca",
"document_id": "4611f8e6331908278b5160ca"
},
"active": true,
"modelRuns": [
{
"modelRunId": "2023-08-04T03:44:10.5248871",
"tags": [
{
"type": "Organization Entity",
"text": "Women's Health",
"page": 1,
"boxes": [
[
0.861730694770813,
0.08322431892156601,
0.937999427318573,
0.09277199301868677
],
[
0.06499018520116806,
0.11739349365234375,
0.07347860559821129,
0.12546881940215826
]
],
"kv_type": "value"
},
{
"type": "Organization Entity",
"text": "Womens Health",
"page": 1,
"boxes": [
[
0.0935770571231842,
0.11707708239555359,
0.11941905505955219,
0.1253887191414833
],
[
0.12276646494865417,
0.11710146069526672,
0.17684946581721306,
0.1254600789397955
]
],
"kv_type": "value"
},
{
"type": "Organization Entity",
"text": "HP Main Line LLC",
"page": 1,
"boxes": [
[
0.5065937042236328,
0.11725140362977982,
0.5151590080931783,
0.1253813849762082
],
[
0.5505043864250183,
0.11691059917211533,
0.7601913809776306,
0.1277432944625616
],
[
0.5505043864250183,
0.1170012354850769,
0.6015823110938072,
0.1273022061213851
],
[
0.6051791906356812,
0.11715823411941528,
0.6570645309984684,
0.12562930211424828
]
],
"kv_type": "value"
},
{
"type": "Location Entity",
"text": "Laurel Road",
"page": 1,
"boxes": [
[
0.06458062678575516,
0.13079734146595,
0.0742951761931181,
0.1387380100786686
],
[
0.09427980333566666,
0.13045595586299896,
0.19984688609838486,
0.1387722697108984
]
],
"kv_type": "value"
},
{
"type": "Location Entity",
"text": "Bryn Mawr",
"page": 1,
"boxes": [
[
0.5502040982246399,
0.13049453496932983,
0.5714401658624411,
0.13857773877680302
],
[
0.5759334564208984,
0.13051286339759827,
0.6116651147603989,
0.13872544467449188
]
],
"kv_type": "value"
},
{
"type": "Organization Entity",
"text": "Regional Womens Health Management",
"page": 1,
"boxes": [
[
0.6013859510421753,
0.14393498003482819,
0.6286550257354975,
0.15307497046887875
],
[
0.6331984400749207,
0.14391060173511505,
0.6625380869954824,
0.1522916592657566
],
[
0.6665995121002197,
0.14416542649269104,
0.6881919391453266,
0.1522371843457222
],
[
0.06526166200637817,
0.15757058560848236,
0.07337938901036978,
0.16564789321273565
]
],
"kv_type": "value"
},
{
"type": "Organization Entity",
"text": "ABA",
"page": 1,
"boxes": [
[
0.5443362593650818,
0.23810118436813354,
0.574140515178442,
0.2483070008456707
]
],
"kv_type": "value"
},
{
"type": "Organization Entity",
"text": "ABA",
"page": 1,
"boxes": [
[
0.22924329340457916,
0.2950398325920105,
0.28950661048293114,
0.3033293457701802
]
],
"kv_type": "value"
}
]
}
],
"ext": "pdf"
}
- table:Scanned(OCR) Dataset Export With Model Run Summary:
Field Names |
Type |
Description |
---|---|---|
source |
str |
Thep resigned URL or S3 path of the data source |
name |
str |
The name of the dataset item |
itemId |
str |
The Id of the dataset item |
datasetId |
str |
The Id of the dataset |
type |
str |
The Type of the dataset |
tags |
list |
List of tags associated with the dataset item |
metadata |
dict |
Metadata associated with the dataset and dataset item |
ocr_model |
str |
The OCR model used for processing |
use_textract_only |
bool |
Indicates if only Textract is used for processing |
source_ref |
str |
Reference to the source of dataset item |
document_id |
str |
The Id of the document |
active |
bool |
Indicates the dataset item is currently active |
modelRuns |
list |
List of dictionaries containing details of predicted labels |
modelRunId |
str |
The Id of the model run |
tags |
list |
List of dictionaries containing the predicted labels |
type |
str |
The type of the label |
text |
str |
Selected Text for prediction |
page |
int |
Page number associated with text |
boxes |
list |
List of bounding box coordinates for OCRed words |
kv_type |
str |
Flag to indicate whether tag is key or value (KEY/VAL) |
ext(local file) |
str |
Extension of local files, if any |
3. With GroundTruth Project¶
{
"source": "https://**********/presigned/a01f5c95d843b4fd4f890570e5cac51c.pdf?sig=838fefa7e55ab214cfa71b70d36d19ee3a263b5c750f49d8ddb105d90f81b82668548ecec76dc79f3df8195c45a7e2702e543611f7f210e761755db7a6c1ea86:3c4f5271ef6c34813cb136a93ba8e7bd:64cdea6d:ae53c052c424a59ee74995c52cc94222",
"name": "ABSTRACT - Axia.tiff",
"itemId": "9cebea4c95edc877ca6f2603",
"datasetId": "e3773b85655ea8646005158a",
"type": "application/pdf",
"tags": [
"invoice tag"
],
"metadata": {
"ocr_model": "Textract (default)",
"use-textract-only": true,
"source_ref": "/uploads/e3773b85655ea8646005158a/9cebea4c95edc877ca6f2603",
"document_id": "9cebea4c95edc877ca6f2603"
},
"active": true,
"project": "7b3020dd437ce2a30bae1c5a",
"taskId": "0931952ce4a27f53a3678cfe",
"annotations": [
{
"email": "q1@qc.com",
"messages": [],
"role": "nlp_qc",
"elapsedTime": 14,
"date": "2023-08-04T06:21:00.589Z",
"content": {
"pdf_fingerprint": "c04f692d342c06d433f751ac32c6d8b1",
"metadata": {
"File": "ABSTRACT - Axia.tiff",
"TaskId": "0931952ce4a27f53a3678cfe",
"ocr_model": "Textract (default)",
"use-textract-only": true,
"source_ref": "/uploads/e3773b85655ea8646005158a/9cebea4c95edc877ca6f2603",
"document_id": "9cebea4c95edc877ca6f2603",
"Type of Project": "OCR"
},
"tags": [
{
"page": 1,
"text": "N A M E",
"id": 1,
"type": "Name",
"kv_type": "key",
"words": [
"N",
"A",
"M",
"E"
],
"boxes": [
[
0.06499018520116806,
0.11739349365234375,
0.07347860559821129,
0.12546881940215826
],
[
0.06458062678575516,
0.13079734146595,
0.0742951761931181,
0.1387380100786686
],
[
0.06520503759384155,
0.14403623342514038,
0.07536023296415806,
0.15211013052612543
],
[
0.06526166200637817,
0.15757058560848236,
0.07337938901036978,
0.16564789321273565
]
],
"range": [
[
71,
72
],
[
126,
127
],
[
165,
166
],
[
194,
195
]
]
},
{
"page": 1,
"text": "Axia Women's Health",
"id": 2,
"type": "Name",
"textAdjust": "Axia Women's",
"kv_type": "value",
"words": [
"Axia",
"Women's",
"Health"
],
"boxes": [
[
0.0935770571231842,
0.11707708239555359,
0.11941905505955219,
0.1253887191414833
],
[
0.12276646494865417,
0.11710146069526672,
0.17684946581721306,
0.1254600789397955
],
[
0.18119750916957855,
0.11732043325901031,
0.21823260188102722,
0.12542327493429184
]
],
"range": [
[
73,
77
],
[
78,
85
],
[
86,
92
]
]
},
{
"page": 1,
"text": "BILL TO",
"id": 3,
"type": "Name",
"rawBox": true,
"kv_type": "key",
"words": [
"BILL TO"
],
"boxes": [
[
0.4980276134122288,
0.10967250571210967,
0.5374753451676528,
0.1706016755521706
]
],
"range": []
},
{
"page": 1,
"text": "Regional Womens Health",
"id": 4,
"type": "Name",
"rotate": 24,
"rawBox": true,
"kv_type": "value",
"words": [
"Regional Womens Health"
],
"boxes": [
[
0.5473372781065089,
0.11119573495811119,
0.7682445759368837,
0.12795125666412796
]
],
"range": []
},
{
"page": 1,
"text": "Cat.",
"id": 5,
"type": "Name",
"table": {
"id": 4,
"x": 0,
"y": 1,
"cell": true
},
"kv_type": "key",
"words": [
"Cat."
],
"boxes": [
[
0.39583876729011536,
0.3084534704685211,
0.4190108198672533,
0.31684120278805494
]
],
"range": [
[
543,
547
]
]
},
{
"page": 1,
"text": "Cat.",
"id": 6,
"type": "TABLEHEADER",
"table": {
"id": 4,
"x": 0,
"y": 1,
"cell": true
},
"words": [
"Cat."
],
"boxes": [
[
0.39583876729011536,
0.3084534704685211,
0.4190108198672533,
0.31684120278805494
]
],
"range": [
[
543,
547
]
]
},
{
"page": 1,
"text": "Description",
"id": 7,
"type": "Name",
"table": {
"id": 4,
"x": 1,
"y": 1,
"cell": true
},
"kv_type": "key",
"words": [
"Description"
],
"boxes": [
[
0.4328092038631439,
0.3084268271923065,
0.49752890318632126,
0.3184952298179269
]
],
"range": [
[
548,
559
]
]
},
{
"page": 1,
"text": "Description",
"id": 8,
"type": "TABLEHEADER",
"table": {
"id": 4,
"x": 1,
"y": 1,
"cell": true
},
"words": [
"Description"
],
"boxes": [
[
0.4328092038631439,
0.3084268271923065,
0.49752890318632126,
0.3184952298179269
]
],
"range": [
[
548,
559
]
]
},
{
"page": 1,
"text": "Effective",
"id": 9,
"type": "TABLEHEADER",
"table": {
"id": 4,
"x": 3,
"y": 0,
"cell": true
},
"words": [
"Effective"
],
"boxes": [
[
0.6239141225814819,
0.2947663366794586,
0.6735980845987797,
0.30344805866479874
]
],
"range": [
[
476,
485
]
]
},
{
"page": 1,
"text": "Sqft.",
"id": 10,
"type": "Name",
"table": {
"id": 4,
"x": 2,
"y": 1,
"cell": true
},
"kv_type": "key",
"words": [
"Sqft."
],
"boxes": [
[
0.5750880241394043,
0.30830204486846924,
0.6010445598512888,
0.3183623990043998
]
],
"range": [
[
560,
565
]
]
},
{
"page": 1,
"text": "Sqft.",
"id": 11,
"type": "TABLEHEADER",
"table": {
"id": 4,
"x": 2,
"y": 1,
"cell": true
},
"words": [
"Sqft."
],
"boxes": [
[
0.5750880241394043,
0.30830204486846924,
0.6010445598512888,
0.3183623990043998
]
],
"range": [
[
560,
565
]
]
},
{
"page": 1,
"text": "ABA",
"id": 12,
"type": "TABLECELL",
"table": {
"id": 4,
"x": 0,
"y": 2,
"cell": true
},
"words": [
"ABA"
],
"boxes": [
[
0.3953396677970886,
0.3291471600532532,
0.42196371778845787,
0.3373938351869583
]
],
"range": [
[
626,
629
]
]
},
{
"page": 1,
"text": "Date",
"id": 13,
"type": "Name",
"table": {
"id": 4,
"x": 3,
"y": 1,
"cell": true
},
"kv_type": "key",
"words": [
"Date"
],
"boxes": [
[
0.6240901350975037,
0.3085164725780487,
0.6510729901492596,
0.31685456447303295
]
],
"range": [
[
566,
570
]
]
},
{
"page": 1,
"text": "Date",
"id": 14,
"type": "TABLEHEADER",
"table": {
"id": 4,
"x": 3,
"y": 1,
"cell": true
},
"words": [
"Date"
],
"boxes": [
[
0.6240901350975037,
0.3085164725780487,
0.6510729901492596,
0.31685456447303295
]
],
"range": [
[
566,
570
]
]
},
{
"page": 1,
"text": "Rent Abatements/Cor",
"id": 15,
"type": "TABLECELL",
"table": {
"id": 4,
"x": 1,
"y": 2,
"cell": true
},
"words": [
"Rent",
"Abatements/Cor"
],
"boxes": [
[
0.4329037368297577,
0.3290809392929077,
0.4603371527045965,
0.3374354373663664
],
[
0.46285462379455566,
0.32896438241004944,
0.5594801902770996,
0.3374544633552432
]
],
"range": [
[
630,
634
],
[
635,
649
]
]
},
{
"page": 1,
"text": "4,850",
"id": 16,
"type": "TABLECELL",
"table": {
"id": 4,
"x": 2,
"y": 2,
"cell": true
},
"words": [
"4,850"
],
"boxes": [
[
0.5759893655776978,
0.3291241228580475,
0.6087189093232155,
0.3381931884214282
]
],
"range": [
[
650,
655
]
]
},
{
"page": 1,
"text": "6/15/2021",
"id": 17,
"type": "TABLECELL",
"table": {
"id": 4,
"x": 3,
"y": 2,
"cell": true
},
"words": [
"6/15/2021"
],
"boxes": [
[
0.6162644028663635,
0.32898813486099243,
0.6728598773479462,
0.3374910345301032
]
],
"range": [
[
656,
665
]
]
}
],
"pageOffsets": [
0,
3355,
5983
],
"links": [
{
"page": 1,
"id1": 1,
"id2": 2,
"relationship": "key-pair"
},
{
"page": 1,
"id1": 3,
"id2": 4,
"relationship": "key-pair"
}
],
"attributes": {
"Is document damaged": "No"
},
"pageAttributes": [
{
"Is page damaged?": "No"
}
],
"tables": [
{
"x": [
0.3953396677970886,
0.4273864608258009,
0.567284107208252,
0.6124916560947895,
0.6735980845987797
],
"y": [
0.2947663366794586,
0.305875051766634,
0.32372980611398816,
0.3381931884214282
],
"rows": 3,
"cols": 4,
"box": [
0.3953396677970886,
0.2947663366794586,
0.6735980845987797,
0.3381931884214282
],
"id": 4,
"page": 1,
"description": "Table 1"
}
],
"plainText": {
"1": "Lease Id: PR0001 - 000222 Lease Profile Master Occupant Id: 00000162-1 N Axia Women's Health B Regional Womens Health Managem A HP Main Line LLC I T 227 Laurel Road M L o Echelon One, Suite 300 E Bryn Mawr PA 19010 L Voorhees NJ 08043 Legal Name: Regional Womens Health Management Tenant Id: Contact Name: Jenni Witters Tenant Type Id: Phone No: SIC Group: Fax No: NAICS Code Lease Stop: No Suite Information Current Recurring Charges Building Id: PR0001 Execution: 3/15/2021 Effective Monthly Annual Amount Suite Id: 401 Beginning: 6/15/2021 Cat. Description Sqft. Date Amount Amount PSF Lease Id: 000222 Occupancy: 9/1/2021 ABA Rent Abatements/Cor 4,850 6/15/2021 -12,125.00 -145,500.00 -30.00 Leased Sqft: 4,850 Rent Start: 6/15/2021 ABA Rent Abatements/Cor 4,850 12/1/2021 0.00 0.00 0.00 Pro-Rata Share: 0.17 Expiration: 9/30/2028 ROF Base Rent Office 4,850 6/15/2021 12,125.00 145,500.00 30.00 Ann. Mkt. Rent PSF: 0.00 Vacate: TIC Tenant Improvement 4,850 11/1/2021 3,059.54 36,714.48 7.57 UTI Utility Reimbursement 4,850 6/15/2021 808.33 9,699.96 2.00 Occupancy Status: Current Rate Change Schedule Effective Monthly Annual Amount Cat. Description Sqft. Date Amount Amount PSF ABA Rent Abatements/Con 4,850 11/1/2021 -2,575.00 -30,900.00 -6.37 ROF Base Rent Office 4,850 7/1/2022 12,367.50 148,410.00 30.60 ROF Base Rent Office 4,850 7/1/2023 12,614.04 151,368.48 31.21 ROF Base Rent Office 4,850 7/1/2024 12,868.67 154,424.04 31.84 ROF Base Rent Office 4,850 7/1/2025 13,123.29 157,479.48 32.47 ROF Base Rent Office 4,850 7/1/2026 13,386.00 160,632.00 33.12 ROF Base Rent Office 4,850 7/1/2027 13,652.75 163,833.00 33.78 ROF Base Rent - Office 4,850 7/1/2028 13,927.58 167,130.96 34.46 Lease Notes Effective Date Ref 1 Ref 2 Note 3/15/2021 ALTERTN Article 8 of Lease Landlord's consent required for any alterations, other than cosmetic Alterations which do not cost more than $1,000 per alteration and which do not affect (i) the structural portions or roof of the Premises or the 3/15/2021 ASGNSUB Article 9 Landlord consent required for any assignment/sublease. Landlord has 30 days after receipt of notice from Tenant to either approve assignment/sublease, not approve assignment/sublease, recapture the Premises 3/15/2021 DEFAULT Article 18 of Lease 1. If Tenant does not make payment within 5 days after date due, provided that, Landlord shall not more than 1 time per 12 full calendar month period of the term, deliver written notice to Tenant with respect to 3/15/2021 ESTOPEL Article 17 of Lease Estoppel required to be provided within 10 days after request. In the form set forth in Exhibit D 3/15/2021 HOLDOVR Section 19 (b) of Lease Landlord may either (i) increase Rent to 200% of the highest monthly aggregate Fixed Rent and additional 3/15/2021 INS Article 11 - Landlord responsible for repairs to all plumbing and other fixtures, equipment and systems (including replacement, if necessary) in or serving the Premises. Landlord to provide janitorial services (Exhibit E) and pest control as needed. 3/15/2021 LATECHG Article 3 of Lease Tenant shall pay Landlord a service and handling charge equal to five percent (5%) of any Rent not paid within five (5) days after the date first due, which shall apply cumulatively each month with respect to Report Id WEBX_PROFILE Database HAVERFORD Reported by Joe Staugaard 1/7/2022 11:50 Page 1"
},
"dimensions": [
{
"width": 1275,
"height": 1650
},
{
"width": 1275,
"height": 1650
}
],
"review": {
"rate": "Ok",
"note": "",
"reviewerId": "61685a5eb492d0845eb5e6b4"
},
"jobStart": 1691128396,
"sessionTime": 14,
"elapsedTime": 86,
"updateTime": 1691130059,
"selectBoundingBox": true,
"lastUpdate": 1691130060583
}
}
],
"ext": "pdf"
}
- table:Scanned(OCR) Dataset Export With GroundTruth Project Summary:
Field Names |
Type |
Description |
---|---|---|
source |
str |
The presigned URL or S3 path of the data source |
name |
str |
The name of the dataset item |
itemId |
str |
The Id of the dataset item |
datasetId |
str |
The Id of the dataset |
type |
str |
The type of the dataset |
tags |
list |
List of tags associated with the dataset item |
metadata |
str |
Metadata associated with the dataset and dataset item |
ocr_model |
str |
The OCR model used for processing |
use_textract_only |
bool |
Indicates if only Textract is used for processing |
source_ref |
str |
Reference to the source of dataset item |
document_id |
str |
The Id of the document |
active |
bool |
Indicates the dataset item is currently active |
project |
str |
The project associated with the dataset item |
taskId |
str |
The Id of the task |
annotations |
list |
List of dictionaries containing details of annotations |
str |
The email associated with user |
|
messages |
str |
The messages associated with the user |
role |
str |
The role associated with the user |
elapsed time |
str |
The elapsed time of the annotation |
date |
str |
The date of the annotation |
content |
dict |
The content of the annotation |
pdf_fingerprint |
str |
The fingerprint of the document |
metadata |
str |
The metadata associated with the task and project |
File |
str |
The name of the file |
TaskId |
str |
The Id of the task |
ocr_model |
str |
The OCR model used for processing |
use_textract_only |
bool |
Indicates if only Textract is used for processing |
source_ref |
str |
Reference to the source of dataset item |
document_id |
str |
The Id of the document |
tags |
list |
List of dictionaries containing the annotated tags |
pages |
int |
The page number of selected text |
text |
str |
The selectd text for annotation |
id |
str |
The Id of selected text for annotation |
type |
str |
The type of the label |
kv_type |
str |
Flag to indicate whether tag is key or value (KEY/VAL) |
words |
str |
The words in the selected text |
boxes |
list |
List of bounding box coordinates for OCRed words |
range |
list |
List of selected text box start offset and end offset using plaintext |
textAdjust |
str |
Modified OCRed text |
rawbox |
str |
Flag to indicate if bounding box is created manually |
rotate |
str |
The angle of bounding box rotation(degrees) |
table |
list |
The table information |
id |
str |
The Id of the table |
x |
int |
The vertical grid coordinates |
y |
int |
The horiziontal grid coordinates |
cell |
bool |
Flag to indicate if the current object is a cell of the table |
pageoffsets |
list |
The list of page offsets |
links |
list |
The list of relationship |
page |
int |
The page number associated with key and value field |
id1 |
int |
The Id of the key field |
id2 |
int |
The Id of the value field |
relationship |
str |
The name of the relationship |
attributes |
dict |
The document attributes associated with task |
pageAttributes |
list |
List of dictionaries containing the attributes for each page |
tables |
list |
List of dictionaries containing table information |
x |
int |
The vertical grid coordinates |
y |
int |
The horiziontal grid coordinates |
rows |
int |
The number of rows in the table |
cols |
int |
The number of columns in the table |
box |
list |
List of bounding box coordinates for OCRed words |
id |
int |
The Id of the table |
page |
int |
The page number of the table |
description |
str |
The title of the table |
plaintext |
str |
Dictionary containing page numbers and the corresponding plain text extracted from the file |
dimensions |
list |
The dimensions of the pages |
width |
float |
The width of the page |
height |
float |
The height of the page |
review |
dict |
The review details |
rate |
str |
The rate of the review |
note |
str |
The note associated with the reviewer |
reviewerId |
str |
The Id of the reviewer |
jobstart |
str |
The start time of the annotation |
sessionTime |
str |
The session time of the annotation |
elapsedTime |
str |
The elapsed time of the annotation |
updateTime |
str |
The update time of the annotation |
lastUpdate |
str |
The last update time |
ext |
str |
The extension of the local file |
2. PDF Dataset¶
1. Dataset Export¶
{
"source": "s3://EXAMPLE-BUCKET/testna.pdf",
"name": "testna.pdf",
"itemId": "aec104ce48aa0eece0a94c1b",
"datasetId": "8d9736f30411ae81fa4983d4",
"type": "application/pdf",
"tags": [],
"metadata": {
"xxx": 14,
"presigned": "http://aaa.com"
},
"active": true
}
- table:PDF Dataset Export Summary:
source |
str |
The presigned URL or S3 path of the data source |
name |
str |
The name of the dataset item |
itemId |
str |
The Id of the dataset item |
datasetId |
str |
The Id of the dataset |
type |
str |
The type of the dataset |
tags |
list |
List of tags associated with the dataset item |
metadata |
dict |
Metadata associated with the dataset and dataset item |
active |
bool |
Indicates the dataset item is currently active |
ext(localfiles) |
str |
Extension of local files, if any |
2. With GroundTruth Project¶
{
"source": "https://sandboxdocuments.tensoract.com/presigned/33e268b66cb90138b84cc627a501afa2.pdf?sig=cc753891da92d55d769969ebf280f7aabaa8847de2ff31141c7b1869900a6c84f3b09f5fb2f4d32e27dc442f9f2841dfc94983f1e42df8569b849cb9153c866a:9ad06a28916bab71cf5140fedd06ae74:64b65760:d5e5b898249da98bf428147b361c0094",
"name": "1810.04805.pdf",
"itemId": "0ed98ab31666242a417504f9",
"datasetId": "8d9736f30411ae81fa4983d4",
"type": "application/pdf",
"tags": [
"dataset tag 1"
],
"metadata": {
"Dataset": "PDF"
},
"active": true,
"project": "866ad732042bde9b94929cc3",
"taskId": "d6aae2114d0947b1bfe5dcd3",
"annotations": [
{
"email": "yannevarsha6@gmail.com",
"messages": [],
"role": "nlp_qc",
"elapsedTime": 18,
"date": "2023-07-17T09:11:08.530Z",
"content": {
"pdf_fingerprint": "dccb9bc542f22b2bdd94110918c68f96",
"metadata": {
"File": "1810.04805.pdf",
"TaskId": "d6aae2114d0947b1bfe5dcd3",
"Type of Project": "NER"
},
"tags": [
{
"page": 1,
"text": "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding",
"id": 1,
"type": "DATE",
"box": [
0.1957394553114858,
0.08355623157419612,
0.8080743211552288,
0.11953028994286674
]
},
{
"page": 1,
"text": "Jacob Devlin",
"id": 2,
"type": "PERSON",
"box": [
0.20464120844784606,
0.15506947083348188,
0.31506005550366556,
0.16926990081839677
]
},
{
"page": 1,
"text": "Ming-Wei Chang",
"id": 3,
"type": "PERSON",
"box": [
0.34016437686048157,
0.15506947083348188,
0.48781795335273054,
0.16926990081839677
]
},
{
"page": 1,
"text": "2018a",
"id": 4,
"type": "DATE",
"box": [
0.3736872717865327,
0.3484841506610129,
0.4145903056733348,
0.36031776312819985
]
},
{
"page": 2,
"text": "(2018a)",
"id": 5,
"type": "DATE",
"box": [
0.3769863661562031,
0.3271071821734426,
0.4339806024432365,
0.3400650507786048
]
}
],
"pageOffsets": [
0,
3988,
8509,
12206,
17069,
20918,
25368,
29080,
33539,
37641,
42160,
46926,
50816,
54525,
58589,
60965,
64088
],
"links": [
{
"page": 1,
"id1": 2,
"id2": 3,
"relationship": "Precede"
},
{
"page": 1,
"id1": 4,
"id2": 5,
"relationship": "Precede"
}
],
"attributes": {
"tags": [],
"links": [],
"Doc Ok?": "Yes"
},
"pageAttributes": [
{
"Page OK?": null
},
{
"Page OK?": "Yes"
}
],
"boxes": [
{
"page": 1,
"box": [
0.6285714285714286,
0.1505226480836237,
0.8216748768472907,
0.178397212543554
],
"label": "Bounding_box"
},
{
"page": 2,
"box": [
0.10246305418719212,
0.3797909407665505,
0.49064039408866994,
0.4961672473867596
],
"label": "Bounding_box",
"rotate": 22
}
],
"plainText": {
"1": "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.com Abstract We introduce a new language representa- tion model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language repre- sentation models (Peters et al., 2018a; Rad- ford et al., 2018), BERT is designed to pre- train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a re- sult, the pre-trained BERT model can be fine- tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task- specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art re- sults on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answer- ing Test F1 to 93.2 (1.5 point absolute im- provement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement). 1 Introduction Language model pre-training has been shown to be effective for improving many natural language processing tasks (Dai and Le, 2015; Peters et al., 2018a; Radford et al., 2018; Howard and Ruder, 2018). These include sentence-level tasks such as natural language inference (Bowman et al., 2015; Williams et al., 2018) and paraphrasing (Dolan and Brockett, 2005), which aim to predict the re- lationships between sentences by analyzing them holistically, as well as token-level tasks such as named entity recognition and question answering, wheremodels are required to produce fine-grained output at the token level (Tjong Kim Sang and DeMeulder, 2003; Rajpurkar et al., 2016). There are two existing strategies for apply- ing pre-trained language representations to down- stream tasks: feature-based and fine-tuning. The feature-based approach, such as ELMo (Peters et al., 2018a), uses task-specific architectures that include the pre-trained representations as addi- tional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pre- trained parameters. The two approaches share the same objective function during pre-training,where they use unidirectional language models to learn general language representations. We argue that current techniques restrict the power of the pre-trained representations, espe- cially for the fine-tuning approaches. The ma- jor limitation is that standard language models are unidirectional, and this limits the choice of archi- tectures that can be used during pre-training. For example, inOpenAIGPT, the authors use a left-to- right architecture, where every token can only at- tend to previous tokens in the self-attention layers of the Transformer (Vaswani et al., 2017). Such re- strictions are sub-optimal for sentence-level tasks, and could be very harmful when applying fine- tuning based approaches to token-level tasks such as question answering, where it is crucial to incor- porate context from both directions. In this paper, we improve the fine-tuning based approaches by proposing BERT: Bidirectional Encoder Representations from Transformers. BERT alleviates the previously mentioned unidi- rectionality constraint by using a “masked lan- guage model” (MLM) pre-training objective, in- spired by the Cloze task (Taylor, 1953). The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked a r X i v : 1 8 1 0 . 0 4 8 0 5 v 2 [ c s . C L ] 2 4 M a y 2 0 1 9",
"2": "word based only on its context. Unlike left-to- right language model pre-training, the MLM ob- jective enables the representation to fuse the left and the right context, which allows us to pre- train a deep bidirectional Transformer. In addi- tion to the masked language model, we also use a “next sentence prediction” task that jointly pre- trains text-pair representations. The contributions of our paper are as follows: • We demonstrate the importance of bidirectional pre-training for language representations. Un- like Radford et al. (2018), which uses unidirec- tional language models for pre-training, BERT uses masked language models to enable pre- trained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs. • We show that pre-trained representations reduce the need for many heavily-engineered task- specific architectures. BERT is the first fine- tuning based representationmodel that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outper- forming many task-specific architectures. • BERT advances the state of the art for eleven NLP tasks. The code and pre-trained mod- els are available at https://github.com/ google-research/bert. 2 RelatedWork There is a long history of pre-training general lan- guage representations, and we briefly review the most widely-used approaches in this section. 2.1 Unsupervised Feature-based Approaches Learning widely applicable representations of words has been an active area of research for decades, including non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006) and neural (Mikolov et al., 2013; Pennington et al., 2014) methods. Pre-trained word embeddings are an integral part of modern NLP systems, of- fering significant improvements over embeddings learned from scratch (Turian et al., 2010). To pre- train word embedding vectors, left-to-right lan- guage modeling objectives have been used (Mnih and Hinton, 2009), as well as objectives to dis- criminate correct from incorrect words in left and right context (Mikolov et al., 2013). These approaches have been generalized to coarser granularities, such as sentence embed- dings (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le andMikolov, 2014). To train sentence representations, prior work has used objectives to rank candidate next sentences (Jernite et al., 2017; Logeswaran and Lee, 2018), left-to-right generation of next sen- tence words given a representation of the previous sentence (Kiros et al., 2015), or denoising auto- encoder derived objectives (Hill et al., 2016). ELMo and its predecessor (Peters et al., 2017, 2018a) generalize traditional word embedding re- search along a different dimension. They extract context-sensitive features from a left-to-right and a right-to-left language model. The contextual rep- resentation of each token is the concatenation of the left-to-right and right-to-left representations. When integrating contextual word embeddings with existing task-specific architectures, ELMo advances the state of the art for severalmajor NLP benchmarks (Peters et al., 2018a) including ques- tion answering (Rajpurkar et al., 2016), sentiment analysis (Socher et al., 2013), and named entity recognition (Tjong Kim Sang and De Meulder, 2003). Melamud et al. (2016) proposed learning contextual representations through a task to pre- dict a single word from both left and right context using LSTMs. Similar to ELMo, their model is feature-based and not deeply bidirectional. Fedus et al. (2018) shows that the cloze task can be used to improve the robustness of text generation mod- els. 2.2 Unsupervised Fine-tuning Approaches As with the feature-based approaches, the first works in this direction only pre-trained word em- bedding parameters from unlabeled text (Col- lobert andWeston, 2008). More recently, sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and fine-tuned for a supervised downstream task (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018). The advantage of these approaches is that few parameters need to be learned from scratch. At least partly due to this advantage, OpenAI GPT (Radford et al., 2018) achieved pre- viously state-of-the-art results on many sentence- level tasks from the GLUE benchmark (Wang et al., 2018a). Left-to-right language model-"
},
"dimensions": [
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
}
],
"review": {
"rate": "Ok",
"note": "",
"reviewerId": "61685a5eb492d0845eb5e6b4"
},
"jobStart": 1689583831,
"sessionTime": 18,
"elapsedTime": 31,
"updateTime": 1689585066,
"lastUpdate": 1689585068525
}
}
],
"ext": "pdf"
}
- table:PDF Dataset Export With GroundTruth Project Summary:
Field Names |
Type |
Description |
---|---|---|
source |
str |
The presigned URL or S3 path of the data source |
name |
str |
The name of the dataset item |
itemId |
str |
The Id of the dataset item |
datasetId |
str |
The Id of the dataset |
type |
str |
The type of the dataset |
tags |
list |
List of tags associated with the dataset item |
metadata |
str |
Metadata associated with the dataset and dataset item |
active |
bool |
Indicates the dataset item is currently active |
project |
str |
The project associated with the dataset item |
taskId |
str |
The Id of the task |
annotations |
list |
List of dictionaries containing details of annotations |
str |
The email associated with user |
|
messages |
str |
The messages associated with the user |
role |
str |
The role associated with the user |
elapsed time |
str |
The elapsed time of the annotation |
date |
str |
The date of the annotation |
content |
dict |
The content of the annotation |
pdf_fingerprint |
str |
The fingerprint of the document |
metadata |
str |
The metadata associated with task and project |
File |
str |
The name of the file |
TaskId |
str |
The Id of the task |
Type of Project |
bool |
The metadata added in advanced setting of project |
tags |
list |
List of dictionaries containing the annotated tags |
pages |
int |
The page number for selected text |
text |
str |
The select text for annotation |
id |
str |
The Id of selected text for annotation |
type |
str |
The type of the label |
box |
list |
The annotation bounding box |
pageoffsets |
list |
List of page offsets |
links |
list |
The list of relationships |
attributes |
dict |
The document attributes associated with task |
pageAttributes |
list |
List of dictionaries containing the attributes for each page |
plaintext |
str |
Dictionary containing page numbers and the corresponding plain text extracted from the file |
dimensions |
list |
The dimensions of the pages |
width |
float |
The width of the page |
height |
float |
The height of the page |
review |
dict |
The review details |
rate |
str |
The rate of the review |
note |
str |
The note associated with the reviewer |
reviewerId |
str |
The Id of the reviewer |
jobstart |
str |
The start time of the annotation |
sessionTime |
str |
The session time of the annotation |
elapsedTime |
str |
The elapsed time of the annotation |
updateTime |
str |
The update time of the annotation |
lastUpdate |
str |
The last update time |
ext |
str |
The extension of the local file,if any |
3. TXT Dataset¶
1. Dataset Export¶
{
"source": "https://********/presigned/b423bc857fcb780860add83807e61316.txt?sig=e61dcf794cd5bf4dadcca9e964de63f8b9a4f07f57ce1a65628ba45a976ab99c759c8b7fe315002b58911afddc00ff7b3e2ea51169a5389901b15c9c850f5d7f:fcc4c466036fd1ca58bfa36f53ea4507:64b761d1:6d5745dabe02fc0123cf535b1fd5cb9c",
"name": "ca_newspapers_en_ab_the_calgary_herald_1950_05_29_issue1_page_0008.txt",
"itemId": "214fb51145ff6524a7c5fa23",
"datasetId": "414855a6e615c76816fba51f",
"type": "text/plain",
"tags": [
"dataset tag 1"
],
"metadata": {
"Dataset": "TXT"
},
"active": true,
"ext": "txt"
}
- table:TXT Dataset Export Summary:
Field Names |
Type |
Description |
---|---|---|
source |
str |
The presigned URL or S3 path of the data source |
name |
str |
The name of the dataset item |
itemId |
str |
The Id of the dataset item |
datasetId |
str |
The Id of the dataset |
type |
str |
The type of the dataset |
tags |
list |
List of tags associated with the dataset item |
metadata |
dict |
Metadata associated with the dataset and dataset item |
active |
bool |
Indicates the dataset item is currently active |
ext(localfiles) |
str |
Extension of local files, if any |
2. With GroundTruth NER-project¶
{
"source": "https://********/presigned/ffd1f1fd01f23a07150051eb3a0ba3ed.txt?sig=fa417407505cfdc13f08a4144f12c7a42c2794470913d6f21d4f3a9ce71f92d4714be51eed1f21f2865f6d2947fe90f4fad19710e7d91a5604a931fb1f4d064b:8daf874146dea01203e9966672b452af:64b76a36:365a5794f4451398d62e10cb08f82929",
"name": "ca_newspapers_en_ab_edmonton_journal_1928_02_16_issue1_page_0010.txt",
"itemId": "17a66e34546c77d6a6ed095a",
"datasetId": "414855a6e615c76816fba51f",
"type": "text/plain",
"tags": [],
"metadata": {
"Dataset": "TXT"
},
"active": true,
"project": "866ad732042bde9b94929cc3",
"taskId": "52755d415dd68822fbdafc20",
"annotations": [
{
"email": "q1@qc.com",
"messages": [],
"role": "nlp_qc",
"elapsedTime": 8,
"date": "2023-07-18T04:40:32.847Z",
"content": {
"metadata": {
"File": "ca_newspapers_en_ab_edmonton_journal_1928_02_16_issue1_page_0010.txt",
"TaskId": "52755d415dd68822fbdafc20",
"Type of Project": "NER"
},
"absoluteOffsets": true,
"tags": [
{
"page": 1,
"text": "BRITAIN WILL HONOR",
"id": 1,
"type": "PERSON"
},
{
"page": 1,
"text": "Grent Britain",
"id": 2,
"type": "PERSON"
},
{
"page": 1,
"text": "February 21",
"id": 3,
"type": "DATE"
},
{
"page": 1,
"text": "Earl of Oxford",
"id": 4,
"type": "ORGANIZATION"
},
{
"page": 1,
"text": "SUTTON COURTENAY",
"id": 5,
"type": "ORGANIZATION"
}
],
"pageOffsets": [
0
],
"links": [
{
"page": 1,
"id1": 1,
"id2": 2,
"relationship": "Precede"
},
{
"page": 1,
"id1": 4,
"id2": 5,
"relationship": "Precede"
}
],
"attributes": {
"tags": [],
"links": [],
"Doc Ok?": "Yes"
},
"pageAttributes": [
{
"Page OK?": null
},
{
"Page OK?": "Yes"
}
],
"plainText": {
"1": "BRITAIN WILL HONOR EARL AT ABBEY SERVICE ! But Great British Statesman Is to Be Buried Privately SUTTON COURTENAT. England, Feb. eminent men and the press of Grent Britain praised the Earl of Oxford's life of service ed mourned his death, the body of the aged state man, who died at his home here early yesterday, was carried last night to the parish church of Sutton Courtenny. The early will be buried privately and not in Il'estminster Abbey. Tals announcement was made last night by the family. and the decision was in accordance with the special wish expressed by Lord Oxford some time ago Memorial Service A memorial service for the former premier, however, will be held In the abbey at noon February 21. A simple service Tor the family w! be held In the parish church Saturday morning. Praise of the Earl of Oxford and Asquith as a great parliamentarian, a forceful, gracious debater and an the selfish servant of the nation's welfare is contained in thousands of messages of condolence published and received my his widow. All recall his activities In the early days of the war. when. as ========== man Is to Be Buried Privately SUTTON COURTENAY. England, Feb. 16. -Wl'hlle eminent men and the press of Grent Britain praised the Earl of Oxford's life of service ed mourned his death, the body of the aged states man, who died at his home here early yesterday, was carried last night to the parish church of Sutton Courtenay, The early will be burled privately and not in l'estminster Abbey. Tals announcement was made last night by the family. and the decision was in accordance with the special wish is. pressed by Lord Oxford some time ago Memorial Service A memorial service for the former premier, however, will be held In the abbey at noon February 21. A simple service Tor the family n! be held in the parish church Saturday morning. Praise of the Earl of Oxford and Asquith as a great parliament, a forceful, gracious debater and an the selfish servant of the nation's welfare is contained in thousands of messages of condolence published and received my his widow. All recall his activities In the early days of the war. when. prime minister, he breathed the Britisn ---------- Recall Declaration Many proudly remember his declaration In the face of Germany's seemingly irresistible advance when the \"* We shall never sheathe the sword which we have not lightly drawn until Beiglum recovers in full measure all and more than all. she had sacrificed. until France is adequately secured against the menace of aggression ; until the rights of the stiller nationalists of Europe are placed upon an unassailable foundation, and until the mill ward domination of Prussia Is wholly and finally destroyed \" ========== Ottawa House Pays Mead of Tribute OTTAWA, Feb. 16. --Tho prime minister at the opening of the house or commons yesterday afternoon rose to suggest that the house should pause In the midst* of its duties to pay, tribute to the memory of Lord Oxtord and Asquith, Mr. King reminded the house that Lord Oxford's career cox tended over the greater part of half a century and that he had held the post of prime minister continuously for 0 longer period than any who had over held that office. As to his part in the war Premier King stated that the burden of responsibility undoubtedly affected the constitution of the former prime of Britaln and hastened his death. \" was fitting that members of the Can'1dian committee should join with the members of l'estminster in extending sympathy to the people of Great Britain for the great old that bad been created. ---------- Bennett Adds Word Hon. R. L. Bennett, leader of the opposition, said that it fell to the leader of the house, the prime minister to extend the sympathy of the people. Un behalf of those who sat in opposition he desired to Join In the sympathy that had been expressed. The prime minister of Canada, Mr. Bennett said, might feel that he was u worthy disciple of Mr. Asquith because the latter had held office for some time with the aid of conflicting groups in the house of commons. Mr. Asquith had been a great scholar, n great orator, and had well maintained the noble traditions of parliament. The empire had lost a very fine citizen but he had left behind him n most Inspiring legacy. Robert Gardiner (U. F / Aendin) speaking on behalf of his Grotius joined In the tribute to a man who would he best remembered r. s. the man who had \" at heart the Interests of the common people \""
},
"review": {
"rate": "Ok",
"note": "",
"reviewerId": "62de356a2f027ab62a00bef1"
},
"jobStart": 1689655223,
"sessionTime": 8,
"elapsedTime": 8,
"updateTime": 1689655231,
"lastUpdate": 1689655232841
}
}
],
"ext": "txt"
}
- table:TXT Dataset Export With GroundTruth NER Project Summary:
Field Names |
Type |
Description |
---|---|---|
source |
str |
The presigned URL or S3 path of the data source |
name |
str |
The name of the dataset item |
itemId |
str |
The Id of the dataset item |
datasetId |
str |
The Id of the dataset |
type |
str |
The type of the dataset |
tags |
list |
List of tags associated with the dataset item |
metadata |
str |
Metadata associated with the dataset and dataset item |
active |
bool |
Indicates the dataset item is currently active |
project |
str |
The project associated with the dataset item |
taskId |
str |
The Id of the task |
annotations |
list |
List of dictionaries containing details of annotations |
str |
The email associated with user |
|
messages |
str |
The messages associated with the user |
role |
str |
The role associated with the user |
elapsed time |
str |
The elapsed time of the annotation |
date |
str |
The date of the annotation |
content |
dict |
Dictionary containing all the details of the the task as metadata,tags,pageoffsets etc. |
metadata |
str |
The metadata associated with task and project |
File |
str |
The name of the file |
TaskId |
str |
The Id of the task |
Type of Project |
bool |
The metadata added in advanced setting of project |
absoluteOffsets |
bool |
Indicates annotation format is absolute entity offset |
tags |
list |
List of dictionaries containing the annotated tags |
pages |
int |
Page number of selected text |
text |
str |
The selectd text for annotation |
id |
str |
The Id of selected text for annotation |
type |
str |
The type of the label |
links |
list |
The list of relationship |
attributes |
dict |
The document attributes associated with task |
pageAttributes |
list |
List of dictionaries containing the attributes for each page |
plaintext |
str |
Dictionary containing page numbers and the corresponding plain text extracted from the file |
dimensions |
list |
The dimensions of the pages |
width |
float |
The width of the page |
height |
float |
The height of the page |
review |
dict |
The review details |
rate |
str |
The rate of the review |
note |
str |
The note associated with the reviewer |
reviewerId |
str |
The Id of the reviewer |
jobstart |
str |
The start time of the annotation |
sessionTime |
str |
The session time of the annotation |
elapsedTime |
str |
The elapsed time of the annotation |
updateTime |
str |
The update time of the annotation |
lastUpdate |
str |
The last update time |
ext |
str |
The extension of the local file , if any |
3. With GroundTruth Classification Project¶
{
"source": "https://sandboxdocuments.tensoract.com/presigned/3c9a446180a44faa43ab2464d45633c7.txt?sig=e5128796abbbc2f33ba9b83af2a755207d9b247a3a624c7992bf8c70a5af621d95cd95c7b63b940d14832d295814310e2b7fff4622944e9ac9815d52ed507311:2e41603c382003a3c456e47b2768981f:64b7848f:2650c588fc2cced10e6118802086d776",
"name": "business_2.txt",
"itemId": "2d2020aa8e3deb383fb7c74f",
"datasetId": "60974d4e9e7759842cdff3be",
"type": "text/plain",
"tags": [
"dataset tag"
],
"metadata": {
"Dataset Type": "TXT"
},
"active": true,
"project": "591351b938d008ca0745510a",
"taskId": "8c939483b594e0de5d5efb54",
"annotations": [
{
"email": "johndoe@me.com",
"messages": [],
"role": "nlp_qc",
"elapsedTime": 8,
"date": "2023-07-18T06:03:46.657Z",
"content": {
"metadata": {
"File": "business_2.txt",
"TaskId": "8c939483b594e0de5d5efb54",
"Type of Project": "Classification"
},
"classificationTypes": {
"Select Type of Document": "select",
"Type of Documents": "multi",
"Put a note": "text"
},
"classifications": {
"Select Type of Document": [
"Technology"
],
"Type of Documents": [
"Graphics",
"Bussiness"
],
"Put a note": [
"Multi-type document"
]
},
"plainText": {
"1": "Japanese growth grinds to a halt Growth in Japan evaporated in the three months to September, sparking renewed concern about an economy not long out of a decade-long trough. Output in the period grew just 0.1%, an annual rate of 0.3%. Exports - the usual engine of recovery - faltered, while domestic demand stayed subdued and corporate investment also fell short. The growth falls well short of expectations, but does mark a sixth straight quarter of expansion. The economy had stagnated throughout the 1990s, experiencing only brief spurts of expansion amid long periods in the doldrums. One result was deflation - prices falling rather than rising - which made Japanese shoppers cautious and kept them from spending. The effect was to leave the economy more dependent than ever on exports for its recent recovery. But high oil prices have knocked 0.2% off the growth rate, while the falling dollar means products shipped to the US are becoming relatively more expensive. The performance for the third quarter marks a sharp downturn from earlier in the year. The first quarter showed annual growth of 6.3%, with the second showing 1.1%, and economists had been predicting as much as 2% this time around. \"Exports slowed while capital spending became weaker,\" said Hiromichi Shirakawa, chief economist at UBS Securities in Tokyo. \"Personal consumption looks good, but it was mainly due to temporary factors such as the Olympics. \"The amber light is flashing.\" The government may now find it more difficult to raise taxes, a policy it will have to implement when the economy picks up to help deal with Japan's massive public debt. "
},
"review": {
"rate": "Ok",
"note": "",
"reviewerId": "614b55be8af65dcf41da535b"
},
"jobStart": 1689660217,
"sessionTime": 8,
"elapsedTime": 8,
"updateTime": 1689660225,
"pageOffsets": [
0
],
"lastUpdate": 1689660226654
}
}
],
"ext": "txt"
}
- table:TXT Dataset Export With GroundTruth Classification Project Summary:
Field Names |
Type |
Description |
---|---|---|
source |
str |
The pesigned URL or S3 path of the data source |
name |
str |
The name of the dataset item |
itemId |
str |
The Id of the dataset item |
datasetId |
str |
The Id of the dataset |
type |
str |
The type of the dataset |
tags |
list |
List of tags associated with the dataset item |
metadata |
str |
Metadata associated with the dataset and dataset item |
active |
bool |
Indicates the dataset item is currently active |
project |
str |
The project associated with the dataset item |
taskId |
str |
The Id of the task |
annotations |
list |
List of dictionaries containing details of annotations |
str |
The email associated with user |
|
messages |
str |
The messages associated with the user |
role |
str |
The role associated with the user |
elapsed time |
str |
The elapsed time of the annotation |
date |
str |
The date of the annotation |
content |
dict |
The content of the annotation |
metadata |
str |
The metadata associated with task and project |
File |
str |
The name of the file |
TaskId |
str |
The Id of the task |
Type of Project |
bool |
The metadata added in advanced setting of project |
classificationTypes |
dict |
Dictionary containing the labels defined in the project |
Select Type of Document |
str |
Single Select Label |
Type of Documents |
str |
Multi Select Label |
Put a note |
str |
Plain Text Label |
classifications |
dict |
Dictionary containing the classifications labels in the task |
plaintext |
dict |
Dictionary containing page numbers and the corresponding plain text extracted from the file |
review |
dict |
The review details |
rate |
str |
The rate of the review |
note |
str |
The note associated with the reviewer |
reviewerId |
str |
The Id of the reviewer |
jobstart |
str |
The start time of the annotation |
sessionTime |
str |
The session time of the annotation |
elapsedTime |
str |
The elapsed time of the annotation |
updateTime |
str |
The update time of the annotation |
lastUpdate |
str |
The last update time |
ext |
str |
The extension of the local file |
4. Image Dataset¶
1. Dataset Export¶
{
"source": "s3://newton-ai-internal-share/files/car1.jpeg",
"name": "car1.jpeg",
"itemId": "9a0498002663227e1e7d5e14",
"datasetId": "25987f46e5febb50484e8497",
"type": "image/jpeg",
"tags": [
"dataset tag"
],
"metadata": {
"Dataset Type": "Image",
"xxx": 12,
"presigned": "http://aaa.com"
},
"active": true
}
- table:Image Dataset Export:
Field Names |
Type |
Description |
---|---|---|
source |
str |
The presigned URL or S3 path of the data source |
name |
str |
The name of the dataset item |
itemId |
str |
The Id of the dataset item |
datasetId |
str |
The Id of the dataset |
type |
str |
Type of the dataset |
tags |
list |
List of tags associated with the dataset item |
metadata |
dict |
Metadata associated with the dataset and datasetitem |
active |
bool |
Indicates the dataset item is currently active |
ext(localfiles) |
str |
Extension of local files, if any |
2. With GroundTruth Bulk Image Classification Project¶
{
"source": "https://**********/presigned/b2b6d70656c53b10b0a194296a3598b5.tiff?sig=d3c6212eb811deb8c325f38e829d6f542a32e8a171c062d7be0e2125ff7c62628e05b3e66d996edf089745ee35699c48bc7fb7c3db86ada01667cb3f18c54990:6d4aa6c56aa19b8ac45d6630fb465e4d:64b7b5c5:ea2ae71fa9cc9950931e3335ced82926",
"name": "cyan.tiff",
"itemId": "93007aa49a9288b9b460528b",
"datasetId": "25987f46e5febb50484e8497",
"type": "image/tiff",
"tags": [
"color images"
],
"metadata": {
"Dataset Type": "Image",
"color": "cyan"
},
"active": true,
"project": "e3c9b4a1dd6df1c4c7091895",
"taskId": "ff5e6111d67ec09cb7578577",
"annotations": [],
"classification": "Cyan",
"ext": "tiff"
}
- table:Image Dataset Export With GroundTruth Bulk Image Classification Project Summary:
Field Names |
Type |
Description |
---|---|---|
source |
str |
The presigned URL or S3 path of the data source |
name |
str |
The name of the dataset item |
itemId |
str |
The Id of the dataset item |
datasetId |
str |
The Id of the dataset |
type |
str |
The type of the dataset |
tags |
list |
List of tags associated with the dataset item |
metadata |
str |
Metadata associated with the dataset and dataset item |
active |
bool |
Indicates the dataset item is currently active |
project |
str |
The project associated with the dataset item |
taskId |
str |
The Id of the task |
classification |
str |
Classified Label of task |
ext |
str |
The extension of the local file |
3. With GroundTruth Object Detection Project¶
{
"source": "https://sandboxdocuments.tensoract.com/presigned/2ba176b5957d922c0b5867cf99a6895a.jpg?sig=6ef9b9444985b2d637dc53869e894b6b64779f65dc5dfa0a1d23129ed89621c2859f32988ca8fb76ed3d292a441c167dd85dd9987b27ed6997e90e4cd1051584:59e6f8bc53146080d740b01e67328244:64b8a815:f5df07dd0c032aa261ffa1470b799829",
"name": "im00.jpg",
"itemId": "9c8b14f93c1fbc5b2fe996ae",
"datasetId": "abd204685e5c074b282d6744",
"type": "image/jpeg",
"tags": [
"dataset tag 1"
],
"metadata": {
"Dataset": "Image Dataset"
},
"active": true,
"project": "0461637f62c18082f3c14cc3",
"taskId": "dff56fb67e79f0cb887263cb",
"annotations": [
{
"email": "jdoeqa@acme.org",
"messages": [],
"role": "nlp_qc",
"elapsedTime": 13,
"date": "2023-06-17T10:10:29.789Z",
"content": {
"url": "https://sandboxdocuments.tensoract.com/presigned/2ba176b5957d922c0b5867cf99a6895a.jpeg?sig=35d229ffb09200bbd28eb9c0ab00d7c2a446f0c85daa6b6204e0b1f043229c1e0b25b6633b477b2145d77d96a0ddd6cb7922e104e72df00583df7c9bed233058:1f88e35afe77cfb705c7dfabce067f20:648ed802:fc875ac8ba21580b86f20874bafee1d9",
"imageWidth": 720,
"imageHeight": 1280,
"selected": null,
"boxes": [
{
"x1": 310.7026,
"y1": 468.443,
"x2": 507.9414,
"y2": 1021.1236,
"id": "b0",
"type": "box",
"oid": "b1",
"outside_image": {},
"occluded": {},
"invisible": false,
"attrs": {},
"title": "",
"label": "Group 1",
"sub_labels": [
{
"x1": 357.9577,
"y1": 495.1525,
"id": "b1",
"type": "keypoint",
"oid": "b2",
"outside_image": {},
"occluded": {},
"invisible": false,
"title": "",
"label": "Top of head",
"sub_labels": []
}
]
}
],
"image_attrs": {},
"review": {
"rate": "Rejected",
"note": ""
},
"jobStart": 1686996615,
"sessionTime": 13,
"elapsedTime": 13,
"tsSeconds": true,
"updateTime": 1686996628,
"lastUpdate": 1686996629784
}
}
],
"ext": "jpg"
}
- table:Image Dataset Export With GroundTruth Object Detection Project Summary:
Field Names |
Type |
Description |
---|---|---|
source |
str |
The presigned URL or S3 path of the data source |
name |
str |
The name of the dataset item |
itemId |
str |
The Id of the dataset item |
datasetId |
str |
The Id of the dataset |
type |
str |
The type of the dataset |
tags |
list |
List of tags associated with the dataset item |
metadata |
str |
Metadata associated with the dataset and dataset item |
active |
bool |
Indicates the dataset item is currently active |
project |
str |
The project associated with the dataset item |
taskId |
str |
The Id of the task |
annotations |
list |
List of dictionaries representing the annotations |
str |
The email associated with the user |
|
messages |
str |
The messages associated with the user |
role |
str |
The role associated with user |
elapsed_time |
str |
The elapsed time of the annotation |
date |
str |
The date of the annotation |
content |
dict |
Dictionary containing all the details of the the task as metadata,tags,pageoffsets etc. |
url |
str |
The presigned URL or S3 path of the task |
imageWidth |
int |
The width of the image in pixels |
imageHeight |
int |
The height of the image in pixels |
boxes |
list |
List of bounding boxes drawn around objects in the image |
x1 |
float |
The x-coordinate of the top-left corner of the bounding box |
y1 |
float |
The y-coordinate of the top-left corner of the bounding box |
x2 |
float |
The x-coordinate of the bottom-right corner of the bounding box |
y2 |
float |
The y-coordinate of the bottom-right corner of the bounding box |
id |
str |
The Id of bounding box |
type |
str |
Flag to indicate wheter it is box/keypoint |
outside_image |
dict |
Indicates whether the object extends beyond the boundaries of the image |
occuluded |
dict |
Indicates whether the object is occluded or partially hidden |
attrs |
dict |
Represents any additional attributes or properties associated with the object |
label |
str |
The label assigned to the bounding box |
sub_labels |
list |
Represents any sub-labels or sub-categories associated with the object |
x1 |
float |
The x-coordinate of the top-left corner of the keypoint |
y1 |
float |
The y-coordinate of the top-left corner of the keypoint |
id |
str |
The Id of keypoint |
type |
str |
Flag to indicate wheter it is box/keypoint |
outside_image |
dict |
Indicates whether the object extends beyond the boundaries of the image |
occuluded |
dict |
Indicates whether the object is occluded or partially hidden |
label |
str |
The label or category assigned to keypoint |
sub_labels |
list |
Represents any sub-labels associated with the object |
image_attrs |
dict |
The image attributes associated with the task |
review |
dict |
The review details |
rate |
str |
The rate of the review |
note |
str |
The note associated with the reviewer |
elapsedTime |
int |
The elapsed time of the annotation |
update time |
int |
The update time of the annotation |
lastUpdate |
int |
The last update time |
ext |
str |
Extension of local files, if any |
5. Video Dataset¶
1. Dataset Export¶
{
"source": "s3://test-pocs/mira640.mp4",
"name": "mira640.mp4",
"itemId": "a90bf16c2ba7b6e9ac1a6d9d",
"datasetId": "84431d90c6497d6ab6425dfc",
"type": "video/mp4",
"tags": [
"dataset tag"
],
"metadata": {
"xxx": 11,
"presigned": "http://aaa.com"
},
"active": true
}
- table:Video Dataset Export Summary:
Field Names |
Type |
Description |
---|---|---|
source |
str |
The presigned URL or S3 path of the data source |
name |
str |
The name of the dataset item |
itemId |
str |
The Id of the dataset item |
datasetId |
str |
The ID of the dataset |
type |
str |
Type of the dataset |
tags |
list |
List of tags associated with the dataset item |
metadata |
dict |
The metadata associated with the dataset and dataset item |
active |
bool |
Indicates the dataset item is currently active |
ext(localfiles) |
str |
Extension of local files, if any |
2. With GroundTruth Media Transcription Project¶
{
"source": "https://sandboxdocuments.tensoract.com/presigned/da00a38a91f62483658e2126e789f63e.mp4?sig=6ab60805f723d8565a15b6dfacc057e8953f99fbb21d39ff30a8a0da2f39b1cbefb9f28139b8eedc55545d6cd0fadaf56ed14a450ead8f79a48b1d233083ba63:f116065b73cb90ba8f217d0bfee72ab5:64b8acc9:3ab540299e282e90e6ece2f127f7d15a",
"name": "Video 1.mp4",
"itemId": "0b9ff34daa6a2f6f95c59bb3",
"datasetId": "1ac8fb72573008ce5626bbfb",
"type": "video/mp4",
"tags": [
"dataset tag1"
],
"metadata": {
"Dataset Type": "Video"
},
"active": true,
"project": "c859cc0a92b7bd4d6d166707",
"taskId": "c53ca1cace7d7e784705b631",
"annotations": [
{
"email": "jdoeqa@acme.org",
"messages": [],
"role": "nlp_qc",
"elapsedTime": 62,
"date": "2023-06-17T12:16:38.937Z",
"content": {
"review": {
"rate": "Ok",
"note": "",
"reviewerId": "63ca81cd31d698c1825328f3"
},
"videoSource": "https://sandboxdocuments.tensoract.com/presigned/da00a38a91f62483658e2126e789f63e.mp4?sig=9f13caf4ac583fbc9c074b91770e579120196b3f08e192ad641290c246a7add58f6d6f40ce1aa63e031ada96ff8c996038e4ed46516ffce1f56262cc6a435eeb:062d59d190dc03e090dcd5ee5ff17faa:648ef563:ac1a510bd2fb6c0ab328a3202eb9c846",
"streams": {
"Transcription": [
{
"start": 0.025000260441083017,
"end": 0.9500098967611547,
"confidence": 1,
"text": "Bonjoi"
},
{
"start": 0.9500098967611547,
"end": 2.0812716817201613,
"confidence": 1,
"text": "Tava Tuti"
},
{
"start": 2.1187720723817858,
"end": 3.156282880686731,
"confidence": 1,
"text": "Hello"
},
{
"start": 3.156282880686731,
"end": 4.13629310905087,
"confidence": 1,
"text": "Ola"
},
{
"start": 4.165043351337061,
"end": 4.608797974166285,
"confidence": 1,
"text": "Tutu beng"
},
{
"start": 10.453858750849383,
"end": 12.07887567951978,
"confidence": 1,
"text": "oye tutubeng"
}
],
"Language Segmentation": [
{
"start": 0.018750195330812264,
"end": 0.5687559250346386,
"confidence": 1,
"tag": "French"
},
{
"start": 0.5937561854757216,
"end": 2.0937718119407025,
"confidence": 1,
"tag": "German"
},
{
"start": 2.0937718119407025,
"end": 3.1650329694568993,
"confidence": 1,
"tag": "English"
},
{
"start": 3.2437837922305217,
"end": 4.156293298330052,
"confidence": 1,
"tag": "French"
},
{
"start": 4.250044274984113,
"end": 6.318815826483732,
"confidence": 1,
"tag": "Russian"
},
{
"start": 6.8338213441595235,
"end": 8.190085473088276,
"confidence": 1,
"tag": "French"
},
{
"start": 8.321336840403962,
"end": 9.865102922640839,
"confidence": 1,
"tag": "English"
},
{
"start": 9.915103443523005,
"end": 11.071365488923096,
"confidence": 1,
"tag": "German"
},
{
"start": 11.233867181790135,
"end": 12.6088815060497,
"confidence": 1,
"tag": "Arabic"
}
]
},
"mediaAttributes": {
"Is Video Clear?": "Yes",
"Aditional Notes": ""
},
"jobStart": 1687003443,
"sessionTime": 62,
"elapsedTime": 93.075,
"tsSeconds": true,
"updateTime": 1687004194,
"metadata": {
"File": "Video 1.mp4",
"TaskId": "c53ca1cace7d7e784705b631",
"Type": "Media Transcription"
},
"lastUpdate": 1687004198934
}
}
],
"ext": "mp4"
}
- table:Video Dataset Export With GroundTruth Media Transcription Project Summary:
Field Names |
Type |
Description |
---|---|---|
source |
str |
The presigned URL or S3 path of the data source |
name |
str |
The name of the dataset item |
itemId |
str |
The Id of the dataset item |
datasetId |
str |
The Id of the dataset |
type |
str |
The type of the dataset |
tags |
list |
List of tags associated with the dataset item |
metadata |
str |
The metadata associated with the dataset and dataset item |
active |
bool |
Indicates the dataset item is currently active |
project |
str |
The project associated with the dataset item |
taskId |
str |
The Id of the task |
annotations |
list |
List of dictionaries representing the annotations |
str |
The email associated with the user |
|
messages |
str |
The messages associated with the user |
role |
str |
The role associated with user |
elapsed_time |
str |
The elapsed time of the annotation |
date |
str |
The date of the annotation |
content |
dict |
Dictionary containing the PDF fingerprint and metadata |
review |
dict |
The review details |
rate |
str |
The rate of the review |
note |
str |
The note associated with the reviewer |
reviewerId |
str |
The Id of the reviewer |
videoSource |
str |
The presigned URL or S3 path of the task |
streams |
dict |
Dictionary of different streams within the video, each containing specific information |
Transcription |
list |
|
start |
float |
|
end |
float |
The ending timestamp(in seconds) of the the transcribed text segment |
confidence |
int |
Indicates the confidence level |
text |
str |
The actual transcribed text for the corresponding segment |
Segmentation |
list |
The stream containing information about the segemtations in the video |
start |
float |
The starting timestamp of the segment |
end |
float |
The ending timestamp of the segment |
confidence |
int |
Indicates the confidence level |
tag |
str |
The tag in the corresponding segment |
mediaAttribute |
dict |
The media attributes associated with the task |
jobstart |
int |
The start time of the annotation |
sessiontime |
int |
The session time of the annotation |
elapsedTime |
int |
The elapsed time of the annotation |
tsSeconds |
bool |
|
update time |
int |
The update time of the annotation |
metadata |
dict |
|
lastUpdate |
int |
The last update time |
ext(localfiles) |
str |
Extension of local files, if any |
6. Audio Dataset¶
1. Dataset Export¶
{ "source": "s3://test-pocs/mira640.mp3", "name": "mira640.mp3", "itemId": "4012452ecf60608061c2baed", "datasetId": "a38d3e9d800fc55e079a3b1d", "type": "audio/mpeg", "tags": [ "dataset tag 1" ], "metadata": { "DATASET Type": "Audio", "xxx": 11, "presigned": "http://aaa.com" }, "active": true }
- table:Audio Dataset Export Summary:
Field Names |
Type |
Description |
---|---|---|
source |
str |
The presigned URL or S3 path of the data source |
name |
str |
The name of the dataset item |
itemId |
str |
The Id of the dataset item |
datasetId |
str |
The Id of the dataset |
type |
str |
Type of the dataset |
tags |
list |
List of tags associated with the dataset item |
metadata |
dict |
The metadata associated with the dataset and dataset item |
active |
bool |
Indicates the dataset item is currently active |
ext(localfiles) |
str |
Extension of local files, if any |
3. With GroundTruth Media Transcription Project¶
{
"source": "s3://test-pocs/mira640.mp3",
"name": "mira640.mp3",
"itemId": "1c944e058058dbe48c980ead",
"datasetId": "edd9c7d7ae5c1b0c2bc73643",
"type": "audio/mpeg",
"tags": [
"dataset tag"
],
"metadata": {
"Dataset Type": "Audio"
},
"active": true,
"project": "c859cc0a92b7bd4d6d166707",
"taskId": "eba777874d6d268ece56b33a",
"annotations": [
{
"email": "johndoe@me.com",
"messages": [],
"role": "nlp_qc",
"elapsedTime": 6,
"date": "2023-07-19T04:01:20.036Z",
"content": {
"review": {
"rate": "Ok",
"note": "",
"reviewerId": "614b55be8af65dcf41da535b"
},
"audioSource": "https://test-pocs.s3.amazonaws.com/mira640.mp3?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUD4REC47DTY4PF7A%2F20230719%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230719T040111Z&X-Amz-Expires=7200&X-Amz-Signature=10048222e39f53cd9dbbb2b97b8aa210e323d8fc544a9ea18ff46e5922974f44&X-Amz-SignedHeaders=host",
"streams": {
"Transcription": [
{
"start": 0.018749917353218702,
"end": 0.6312472175583629,
"confidence": 1,
"text": "Bonjoi"
},
{
"start": 0.7187468318733835,
"end": 1.881241707772943,
"confidence": 1,
"text": "Tava Tuti"
},
{
"start": 2.0124911292454737,
"end": 2.4874890355270143,
"confidence": 1,
"text": "Hello"
}
],
"Language Segmentation": [
{
"start": 0.006249972451072901,
"end": 0.4437480440261759,
"confidence": 1,
"tag": "French"
},
{
"start": 0.5437476032433424,
"end": 1.8687417628707974,
"confidence": 1,
"tag": "German"
},
{
"start": 1.9187415424793806,
"end": 2.6687382366081285,
"confidence": 1,
"tag": "English"
}
]
},
"mediaAttributes": {
"Is Video Clear?": "Yes",
"Aditional Notes": ""
},
"jobStart": 1689739272,
"sessionTime": 6,
"elapsedTime": 6,
"tsSeconds": true,
"updateTime": 1689739278,
"lastUpdate": 1689739280033
}
}
]
}
- table:Audio Dataset Export With GroundTruth Media Transcription Project Summary:
Field Names |
Type |
Description |
---|---|---|
source |
str |
The presigned URL or S3 path of the data source |
name |
str |
The name of the dataset item |
itemId |
str |
The Id of the dataset item |
datasetId |
str |
The Id of the dataset |
type |
str |
The type of the dataset |
tags |
list |
List of tags associated with the dataset item |
metadata |
str |
The metadata associated with the dataset and dataset item |
active |
bool |
Indicates the dataset item is currently active |
project |
str |
The project associated with the dataset item |
taskId |
str |
The Id of the task |
annotations |
list |
List of dictionaries representing the annotations |
str |
The email associated with the user |
|
messages |
str |
The messages associated with the user |
role |
str |
The role associated with user |
elapsed_time |
str |
The elapsed time of the annotation |
date |
str |
The date of the annotation |
content |
dict |
Dictionary containing the PDF fingerprint and metadata |
review |
dict |
The review details |
rate |
str |
The rate of the review |
note |
str |
The note associated with the reviewer |
reviewerId |
str |
The Id of the reviewer |
audioSource |
str |
The presigned URL or S3 path of the task |
streams |
dict |
Dictionary of different streams within the video, each containing specific information |
Transcription |
list |
|
start |
float |
|
end |
float |
The ending timestamp(in seconds) of the the transcribed text segment |
confidence |
int |
Indicates the confidence level |
text |
str |
The actual transcribed text for the corresponding segment |
Segmentation |
list |
The stream containing information about the segemtations in the video |
start |
float |
The starting timestamp of the segment |
end |
float |
The ending timestamp of the segment |
confidence |
int |
Indicates the confidence level |
tag |
str |
The tag in the corresponding segment |
mediaAttribute |
dict |
The media attributes associated with the task |
jobstart |
int |
The start time of the annotation |
sessiontime |
int |
The session time of the annotation |
elapsedTime |
int |
The elapsed time of the annotation |
update time |
int |
The update time of the annotation |
metadata |
dict |
|
lastUpdate |
int |
The last update time |
ext(localfiles) |
str |
Extension of local files, if any |
Project Exports¶
1. OCR Project¶
{
"project_id": "7b3020dd437ce2a30bae1c5a",
"project_name": "Test-OCR-Project-1",
"project_type": "OCR",
"datasetId": "e3773b85655ea8646005158a",
"itemId": "9cebea4c95edc877ca6f2603",
"file_name": "ABSTRACT - Axia.tiff",
"file_type": "application/pdf",
"source": "https://sandboxdocuments.tensoract.com/presigned/a01f5c95d843b4fd4f890570e5cac51c.pdf?sig=fc8cd601e4b6d76de180378b1663c1b8b1ac21c2a82fd7909bf959bce43964344830f137475dd53602d458cbd780b08626379b8329f5dea96f7bdf78b727d5f2:60847ae533c350f801adca47a54b6cfb:64cdee0b:bad3e8fb52c965453a0f9fd8ffde6c9e",
"state": 4,
"task_id": "0931952ce4a27f53a3678cfe",
"state_description": "Approved",
"annotations": [
{
"email": "yannevarsha6@gmail.com",
"messages": [],
"role": "Reviewer",
"elapsedTime": 14,
"date": "2023-08-04T06:21:00.589Z",
"content": {
"pdf_fingerprint": "c04f692d342c06d433f751ac32c6d8b1",
"metadata": {
"ocr_model": "Textract (default)",
"use-textract-only": true,
"source_ref": "/uploads/e3773b85655ea8646005158a/9cebea4c95edc877ca6f2603",
"document_id": "9cebea4c95edc877ca6f2603",
"Type of Project": "OCR"
},
"tags": [
{
"page": 1,
"text": "N A M E",
"id": 1,
"type": "Name",
"kv_type": "key",
"words": [
"N",
"A",
"M",
"E"
],
"boxes": [
[
0.06499018520116806,
0.11739349365234375,
0.07347860559821129,
0.12546881940215826
],
[
0.06458062678575516,
0.13079734146595,
0.0742951761931181,
0.1387380100786686
],
[
0.06520503759384155,
0.14403623342514038,
0.07536023296415806,
0.15211013052612543
],
[
0.06526166200637817,
0.15757058560848236,
0.07337938901036978,
0.16564789321273565
]
],
"range": [
[
71,
72
],
[
126,
127
],
[
165,
166
],
[
194,
195
]
]
},
{
"page": 1,
"text": "Axia Women's Health",
"id": 2,
"type": "Name",
"textAdjust": "Axia Women's",
"kv_type": "value",
"words": [
"Axia",
"Women's",
"Health"
],
"boxes": [
[
0.0935770571231842,
0.11707708239555359,
0.11941905505955219,
0.1253887191414833
],
[
0.12276646494865417,
0.11710146069526672,
0.17684946581721306,
0.1254600789397955
],
[
0.18119750916957855,
0.11732043325901031,
0.21823260188102722,
0.12542327493429184
]
],
"range": [
[
73,
77
],
[
78,
85
],
[
86,
92
]
]
},
{
"page": 1,
"text": "BILL TO",
"id": 3,
"type": "Name",
"rawBox": true,
"kv_type": "key",
"words": [
"BILL TO"
],
"boxes": [
[
0.4980276134122288,
0.10967250571210967,
0.5374753451676528,
0.1706016755521706
]
],
"range": []
},
{
"page": 1,
"text": "Regional Womens Health",
"id": 4,
"type": "Name",
"rotate": 24,
"rawBox": true,
"kv_type": "value",
"words": [
"Regional Womens Health"
],
"boxes": [
[
0.5473372781065089,
0.11119573495811119,
0.7682445759368837,
0.12795125666412796
]
],
"range": []
},
{
"page": 1,
"text": "Cat.",
"id": 5,
"type": "Name",
"table": {
"id": 4,
"x": 0,
"y": 1,
"cell": true
},
"kv_type": "key",
"words": [
"Cat."
],
"boxes": [
[
0.39583876729011536,
0.3084534704685211,
0.4190108198672533,
0.31684120278805494
]
],
"range": [
[
543,
547
]
]
},
{
"page": 1,
"text": "Cat.",
"id": 6,
"type": "TABLEHEADER",
"table": {
"id": 4,
"x": 0,
"y": 1,
"cell": true
},
"words": [
"Cat."
],
"boxes": [
[
0.39583876729011536,
0.3084534704685211,
0.4190108198672533,
0.31684120278805494
]
],
"range": [
[
543,
547
]
]
},
{
"page": 1,
"text": "Description",
"id": 7,
"type": "Name",
"table": {
"id": 4,
"x": 1,
"y": 1,
"cell": true
},
"kv_type": "key",
"words": [
"Description"
],
"boxes": [
[
0.4328092038631439,
0.3084268271923065,
0.49752890318632126,
0.3184952298179269
]
],
"range": [
[
548,
559
]
]
},
{
"page": 1,
"text": "Description",
"id": 8,
"type": "TABLEHEADER",
"table": {
"id": 4,
"x": 1,
"y": 1,
"cell": true
},
"words": [
"Description"
],
"boxes": [
[
0.4328092038631439,
0.3084268271923065,
0.49752890318632126,
0.3184952298179269
]
],
"range": [
[
548,
559
]
]
},
{
"page": 1,
"text": "Effective",
"id": 9,
"type": "TABLEHEADER",
"table": {
"id": 4,
"x": 3,
"y": 0,
"cell": true
},
"words": [
"Effective"
],
"boxes": [
[
0.6239141225814819,
0.2947663366794586,
0.6735980845987797,
0.30344805866479874
]
],
"range": [
[
476,
485
]
]
},
{
"page": 1,
"text": "Sqft.",
"id": 10,
"type": "Name",
"table": {
"id": 4,
"x": 2,
"y": 1,
"cell": true
},
"kv_type": "key",
"words": [
"Sqft."
],
"boxes": [
[
0.5750880241394043,
0.30830204486846924,
0.6010445598512888,
0.3183623990043998
]
],
"range": [
[
560,
565
]
]
},
{
"page": 1,
"text": "Sqft.",
"id": 11,
"type": "TABLEHEADER",
"table": {
"id": 4,
"x": 2,
"y": 1,
"cell": true
},
"words": [
"Sqft."
],
"boxes": [
[
0.5750880241394043,
0.30830204486846924,
0.6010445598512888,
0.3183623990043998
]
],
"range": [
[
560,
565
]
]
},
{
"page": 1,
"text": "ABA",
"id": 12,
"type": "TABLECELL",
"table": {
"id": 4,
"x": 0,
"y": 2,
"cell": true
},
"words": [
"ABA"
],
"boxes": [
[
0.3953396677970886,
0.3291471600532532,
0.42196371778845787,
0.3373938351869583
]
],
"range": [
[
626,
629
]
]
},
{
"page": 1,
"text": "Date",
"id": 13,
"type": "Name",
"table": {
"id": 4,
"x": 3,
"y": 1,
"cell": true
},
"kv_type": "key",
"words": [
"Date"
],
"boxes": [
[
0.6240901350975037,
0.3085164725780487,
0.6510729901492596,
0.31685456447303295
]
],
"range": [
[
566,
570
]
]
},
{
"page": 1,
"text": "Date",
"id": 14,
"type": "TABLEHEADER",
"table": {
"id": 4,
"x": 3,
"y": 1,
"cell": true
},
"words": [
"Date"
],
"boxes": [
[
0.6240901350975037,
0.3085164725780487,
0.6510729901492596,
0.31685456447303295
]
],
"range": [
[
566,
570
]
]
},
{
"page": 1,
"text": "Rent Abatements/Cor",
"id": 15,
"type": "TABLECELL",
"table": {
"id": 4,
"x": 1,
"y": 2,
"cell": true
},
"words": [
"Rent",
"Abatements/Cor"
],
"boxes": [
[
0.4329037368297577,
0.3290809392929077,
0.4603371527045965,
0.3374354373663664
],
[
0.46285462379455566,
0.32896438241004944,
0.5594801902770996,
0.3374544633552432
]
],
"range": [
[
630,
634
],
[
635,
649
]
]
},
{
"page": 1,
"text": "4,850",
"id": 16,
"type": "TABLECELL",
"table": {
"id": 4,
"x": 2,
"y": 2,
"cell": true
},
"words": [
"4,850"
],
"boxes": [
[
0.5759893655776978,
0.3291241228580475,
0.6087189093232155,
0.3381931884214282
]
],
"range": [
[
650,
655
]
]
},
{
"page": 1,
"text": "6/15/2021",
"id": 17,
"type": "TABLECELL",
"table": {
"id": 4,
"x": 3,
"y": 2,
"cell": true
},
"words": [
"6/15/2021"
],
"boxes": [
[
0.6162644028663635,
0.32898813486099243,
0.6728598773479462,
0.3374910345301032
]
],
"range": [
[
656,
665
]
]
}
],
"pageOffsets": [
0,
3355,
5983
],
"links": [
{
"page": 1,
"id1": 1,
"id2": 2,
"relationship": "key-pair"
},
{
"page": 1,
"id1": 3,
"id2": 4,
"relationship": "key-pair"
}
],
"attributes": {
"Is document damaged": "No"
},
"pageAttributes": [
{
"Is page damaged?": "No"
}
],
"tables": [
{
"x": [
0.3953396677970886,
0.4273864608258009,
0.567284107208252,
0.6124916560947895,
0.6735980845987797
],
"y": [
0.2947663366794586,
0.305875051766634,
0.32372980611398816,
0.3381931884214282
],
"rows": 3,
"cols": 4,
"box": [
0.3953396677970886,
0.2947663366794586,
0.6735980845987797,
0.3381931884214282
],
"id": 4,
"page": 1,
"mergedList": null,
"description": "Table 1"
}
],
"plainText": {
"1": "Lease Id: PR0001 - 000222 Lease Profile Master Occupant Id: 00000162-1 N Axia Women's Health B Regional Womens Health Managem A HP Main Line LLC I T 227 Laurel Road M L o Echelon One, Suite 300 E Bryn Mawr PA 19010 L Voorhees NJ 08043 Legal Name: Regional Womens Health Management Tenant Id: Contact Name: Jenni Witters Tenant Type Id: Phone No: SIC Group: Fax No: NAICS Code Lease Stop: No Suite Information Current Recurring Charges Building Id: PR0001 Execution: 3/15/2021 Effective Monthly Annual Amount Suite Id: 401 Beginning: 6/15/2021 Cat. Description Sqft. Date Amount Amount PSF Lease Id: 000222 Occupancy: 9/1/2021 ABA Rent Abatements/Cor 4,850 6/15/2021 -12,125.00 -145,500.00 -30.00 Leased Sqft: 4,850 Rent Start: 6/15/2021 ABA Rent Abatements/Cor 4,850 12/1/2021 0.00 0.00 0.00 Pro-Rata Share: 0.17 Expiration: 9/30/2028 ROF Base Rent Office 4,850 6/15/2021 12,125.00 145,500.00 30.00 Ann. Mkt. Rent PSF: 0.00 Vacate: TIC Tenant Improvement 4,850 11/1/2021 3,059.54 36,714.48 7.57 UTI Utility Reimbursement 4,850 6/15/2021 808.33 9,699.96 2.00 Occupancy Status: Current Rate Change Schedule Effective Monthly Annual Amount Cat. Description Sqft. Date Amount Amount PSF ABA Rent Abatements/Con 4,850 11/1/2021 -2,575.00 -30,900.00 -6.37 ROF Base Rent Office 4,850 7/1/2022 12,367.50 148,410.00 30.60 ROF Base Rent Office 4,850 7/1/2023 12,614.04 151,368.48 31.21 ROF Base Rent Office 4,850 7/1/2024 12,868.67 154,424.04 31.84 ROF Base Rent Office 4,850 7/1/2025 13,123.29 157,479.48 32.47 ROF Base Rent Office 4,850 7/1/2026 13,386.00 160,632.00 33.12 ROF Base Rent Office 4,850 7/1/2027 13,652.75 163,833.00 33.78 ROF Base Rent - Office 4,850 7/1/2028 13,927.58 167,130.96 34.46 Lease Notes Effective Date Ref 1 Ref 2 Note 3/15/2021 ALTERTN Article 8 of Lease Landlord's consent required for any alterations, other than cosmetic Alterations which do not cost more than $1,000 per alteration and which do not affect (i) the structural portions or roof of the Premises or the 3/15/2021 ASGNSUB Article 9 Landlord consent required for any assignment/sublease. Landlord has 30 days after receipt of notice from Tenant to either approve assignment/sublease, not approve assignment/sublease, recapture the Premises 3/15/2021 DEFAULT Article 18 of Lease 1. If Tenant does not make payment within 5 days after date due, provided that, Landlord shall not more than 1 time per 12 full calendar month period of the term, deliver written notice to Tenant with respect to 3/15/2021 ESTOPEL Article 17 of Lease Estoppel required to be provided within 10 days after request. In the form set forth in Exhibit D 3/15/2021 HOLDOVR Section 19 (b) of Lease Landlord may either (i) increase Rent to 200% of the highest monthly aggregate Fixed Rent and additional 3/15/2021 INS Article 11 - Landlord responsible for repairs to all plumbing and other fixtures, equipment and systems (including replacement, if necessary) in or serving the Premises. Landlord to provide janitorial services (Exhibit E) and pest control as needed. 3/15/2021 LATECHG Article 3 of Lease Tenant shall pay Landlord a service and handling charge equal to five percent (5%) of any Rent not paid within five (5) days after the date first due, which shall apply cumulatively each month with respect to Report Id WEBX_PROFILE Database HAVERFORD Reported by Joe Staugaard 1/7/2022 11:50 Page 1"
},
"dimensions": [
{
"width": 1275,
"height": 1650
},
{
"width": 1275,
"height": 1650
}
],
"review": {
"rate": "Ok",
"note": "",
"reviewerId": "61685a5eb492d0845eb5e6b4"
},
"jobStart": 1691128396,
"sessionTime": 14,
"elapsedTime": 86,
"updateTime": 1691130059,
"selectBoundingBox": true,
"lastUpdate": 1691130060583
}
}
]
}
- table:OCR-Project-Manifest:
Field Names |
Type |
Description |
---|---|---|
project_id |
str |
The Id of the project |
project_name |
str |
The name of the project |
project_type |
str |
The type of the project |
datasetId |
str |
The Id of the dataset |
itemId |
str |
The Id of the dataset item |
file_name |
str |
The name of the file |
file_type |
str |
The type of the file |
source |
str |
Internal source file reference on local storage disk |
state |
int |
The state of the task |
task_id |
str |
The Id of the task |
state_description |
str |
The state description of the task |
annotations |
list |
List of dictionaries representing the annotations |
str |
The email associated with the user |
|
messages |
list |
The messages associated with the user |
role |
str |
The role associated with user |
elapsed_time |
str |
The elapsed time of the annotation |
date |
str |
The date of the annotation |
content |
dict |
Dictionary containing all the details of the the task as metadata,tags,pageoffsets etc. |
pdf_fingerprint |
str |
The fingerprint of the document |
metadata |
dict |
Dictionary containing metadata of the task and the project |
ocr_model |
str |
The OCR model used for processing |
use_textract_only |
bool |
Indicates if only textract is used for processing |
source_ref |
str |
The reference to the source of the document |
document_id |
str |
The Id of the document |
tags |
list |
List of dictionaries containing the tags added in the task |
page |
int |
The page number for selected text |
text |
str |
The selected text for annotation |
id |
int |
The Id of selected text for annotation |
type |
str |
The type of the label |
kv_type |
str |
Flag to indicate whether tag is key or value (KEY/VAL) |
words |
str |
The words in the selected text |
boxes |
list |
List of bounding box coordinates for OCRed words |
range |
str |
List of selected text box start offset and end offset using plaintext |
textAdjust |
str |
Modified OCRed text |
rawbox |
bool |
Flag to indicate if bounding box is created manually |
rotate |
str |
The angle of bounding box rotation(degrees) |
table |
list |
The table information |
id |
int |
The id of the table |
x |
int |
The vertical grid coordinates |
y |
int |
The horiziontal grid coordinates |
cell |
bool |
Flag to indicate if the current object is a cell of the table |
pageoffsets |
list |
List of page offsets |
links |
list |
List containing the relationships added in the task |
page |
int |
The page number associated with key and value field |
id1 |
int |
The Id of the key field |
id2 |
int |
The Id of the value field |
relationship |
str |
The name of the relationship |
attributes |
dict |
The document attributes associated with the task |
pageattributes |
list |
List of dictionaries containing the attributes for each page |
tables |
list |
List of dictionaries containing table information |
x |
int |
The vertical grid coordinates |
y |
int |
The horiziontal grid coordinates |
rows |
int |
The number of rows in the table |
cols |
int |
The number of columns in the table |
box |
list |
List of bounding box coordinates for OCRed words |
id |
int |
The Id of the table |
page |
int |
The page number of the table |
description |
str |
The title of the table |
plaintext |
dict |
Dictionary containing page numbers and the corresponding plain text extracted from the file |
dimensions |
list |
List of dictionaries containing dimensions of pages in the task |
width |
float |
The width of the page |
height |
flaot |
The height of the page |
review |
dict |
The review details |
rate |
str |
The rate of the review |
note |
str |
The note associated with the reviewer |
reviewerId |
str |
The Id of the reviwer |
jobstart |
int |
The start time of the annotation |
sessiontime |
int |
The session time of the annotation |
elapsedTime |
int |
The elapsed time of the annotation |
update time |
int |
The update time of the annotation |
lastUpdate |
int |
The last update time |
2. NER Project¶
{
"project_id": "866ad732042bde9b94929cc3",
"project_name": "NER-Project-DB",
"project_type": "NER",
"datasetId": "8d9736f30411ae81fa4983d4",
"itemId": "0ed98ab31666242a417504f9",
"file_name": "1810.04805.pdf",
"file_type": "application/pdf",
"source": "https://sandboxdocuments.tensoract.com/presigned/33e268b66cb90138b84cc627a501afa2.pdf?sig=64e0f921a163164ebdac2b74a35f80c4dee52434a405f990a9163ea306ebb99cb1ee12cb6fba3d313a531f39c0f9195083dbb2582d9d397a00553ea403d7cc4e:102e036d864eee6141450c9ad545cf66:64b8d6cf:1e215800c51838b6308d8fb24fc60adc",
"state": 4,
"task_id": "d6aae2114d0947b1bfe5dcd3",
"state_description": "Approved",
"annotations": [
{
"email": "q1@qc.com",
"messages": [],
"role": "Reviewer",
"elapsedTime": 18,
"date": "2023-07-17T09:11:08.530Z",
"content": {
"pdf_fingerprint": "dccb9bc542f22b2bdd94110918c68f96",
"metadata": {
"File": "1810.04805.pdf",
"TaskId": "d6aae2114d0947b1bfe5dcd3",
"Type of Project": "NER"
},
"tags": [
{
"page": 1,
"range": [
0,
80
],
"text": "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding",
"id": 1,
"type": "DATE",
"box": [
0.1957394553114858,
0.08355623157419612,
0.8080743211552288,
0.11953028994286674
]
},
{
"page": 1,
"range": [
81,
93
],
"text": "Jacob Devlin",
"id": 2,
"type": "PERSON",
"box": [
0.20464120844784606,
0.15506947083348188,
0.31506005550366556,
0.16926990081839677
]
},
{
"page": 1,
"range": [
94,
108
],
"text": "Ming-Wei Chang",
"id": 3,
"type": "PERSON",
"box": [
0.34016437686048157,
0.15506947083348188,
0.48781795335273054,
0.16926990081839677
]
},
{
"page": 1,
"range": [
423,
428
],
"text": "2018a",
"id": 4,
"type": "DATE",
"box": [
0.3736872717865327,
0.3484841506610129,
0.4145903056733348,
0.36031776312819985
]
},
{
"page": 2,
"range": [
743,
750
],
"text": "(2018a)",
"id": 5,
"type": "DATE",
"box": [
0.3769863661562031,
0.3271071821734426,
0.4339806024432365,
0.3400650507786048
]
}
],
"pageOffsets": [
0,
3988,
8509,
12206,
17069,
20918,
25368,
29080,
33539,
37641,
42160,
46926,
50816,
54525,
58589,
60965,
64088
],
"links": [
{
"page": 1,
"id1": 2,
"id2": 3,
"relationship": "Precede"
},
{
"page": 1,
"id1": 4,
"id2": 5,
"relationship": "Precede"
}
],
"attributes": {
"tags": [],
"links": [],
"Doc Ok?": "Yes"
},
"pageAttributes": [
{
"Page OK?": null
},
{
"Page OK?": "Yes"
}
],
"boxes": [
{
"page": 1,
"box": [
0.6285714285714286,
0.1505226480836237,
0.8216748768472907,
0.178397212543554
],
"label": "Bounding_box"
},
{
"page": 2,
"box": [
0.10246305418719212,
0.3797909407665505,
0.49064039408866994,
0.4961672473867596
],
"label": "Bounding_box",
"rotate": 22
}
],
"plainText": {
"1": "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.com Abstract We introduce a new language representa- tion model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language repre- sentation models (Peters et al., 2018a; Rad- ford et al., 2018), BERT is designed to pre- train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a re- sult, the pre-trained BERT model can be fine- tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task- specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art re- sults on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answer- ing Test F1 to 93.2 (1.5 point absolute im- provement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement). 1 Introduction Language model pre-training has been shown to be effective for improving many natural language processing tasks (Dai and Le, 2015; Peters et al., 2018a; Radford et al., 2018; Howard and Ruder, 2018). These include sentence-level tasks such as natural language inference (Bowman et al., 2015; Williams et al., 2018) and paraphrasing (Dolan and Brockett, 2005), which aim to predict the re- lationships between sentences by analyzing them holistically, as well as token-level tasks such as named entity recognition and question answering, wheremodels are required to produce fine-grained output at the token level (Tjong Kim Sang and DeMeulder, 2003; Rajpurkar et al., 2016). There are two existing strategies for apply- ing pre-trained language representations to down- stream tasks: feature-based and fine-tuning. The feature-based approach, such as ELMo (Peters et al., 2018a), uses task-specific architectures that include the pre-trained representations as addi- tional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pre- trained parameters. The two approaches share the same objective function during pre-training,where they use unidirectional language models to learn general language representations. We argue that current techniques restrict the power of the pre-trained representations, espe- cially for the fine-tuning approaches. The ma- jor limitation is that standard language models are unidirectional, and this limits the choice of archi- tectures that can be used during pre-training. For example, inOpenAIGPT, the authors use a left-to- right architecture, where every token can only at- tend to previous tokens in the self-attention layers of the Transformer (Vaswani et al., 2017). Such re- strictions are sub-optimal for sentence-level tasks, and could be very harmful when applying fine- tuning based approaches to token-level tasks such as question answering, where it is crucial to incor- porate context from both directions. In this paper, we improve the fine-tuning based approaches by proposing BERT: Bidirectional Encoder Representations from Transformers. BERT alleviates the previously mentioned unidi- rectionality constraint by using a “masked lan- guage model” (MLM) pre-training objective, in- spired by the Cloze task (Taylor, 1953). The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked a r X i v : 1 8 1 0 . 0 4 8 0 5 v 2 [ c s . C L ] 2 4 M a y 2 0 1 9",
"2": "word based only on its context. Unlike left-to- right language model pre-training, the MLM ob- jective enables the representation to fuse the left and the right context, which allows us to pre- train a deep bidirectional Transformer. In addi- tion to the masked language model, we also use a “next sentence prediction” task that jointly pre- trains text-pair representations. The contributions of our paper are as follows: • We demonstrate the importance of bidirectional pre-training for language representations. Un- like Radford et al. (2018), which uses unidirec- tional language models for pre-training, BERT uses masked language models to enable pre- trained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs. • We show that pre-trained representations reduce the need for many heavily-engineered task- specific architectures. BERT is the first fine- tuning based representationmodel that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outper- forming many task-specific architectures. • BERT advances the state of the art for eleven NLP tasks. The code and pre-trained mod- els are available at https://github.com/ google-research/bert. 2 RelatedWork There is a long history of pre-training general lan- guage representations, and we briefly review the most widely-used approaches in this section. 2.1 Unsupervised Feature-based Approaches Learning widely applicable representations of words has been an active area of research for decades, including non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006) and neural (Mikolov et al., 2013; Pennington et al., 2014) methods. Pre-trained word embeddings are an integral part of modern NLP systems, of- fering significant improvements over embeddings learned from scratch (Turian et al., 2010). To pre- train word embedding vectors, left-to-right lan- guage modeling objectives have been used (Mnih and Hinton, 2009), as well as objectives to dis- criminate correct from incorrect words in left and right context (Mikolov et al., 2013). These approaches have been generalized to coarser granularities, such as sentence embed- dings (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le andMikolov, 2014). To train sentence representations, prior work has used objectives to rank candidate next sentences (Jernite et al., 2017; Logeswaran and Lee, 2018), left-to-right generation of next sen- tence words given a representation of the previous sentence (Kiros et al., 2015), or denoising auto- encoder derived objectives (Hill et al., 2016). ELMo and its predecessor (Peters et al., 2017, 2018a) generalize traditional word embedding re- search along a different dimension. They extract context-sensitive features from a left-to-right and a right-to-left language model. The contextual rep- resentation of each token is the concatenation of the left-to-right and right-to-left representations. When integrating contextual word embeddings with existing task-specific architectures, ELMo advances the state of the art for severalmajor NLP benchmarks (Peters et al., 2018a) including ques- tion answering (Rajpurkar et al., 2016), sentiment analysis (Socher et al., 2013), and named entity recognition (Tjong Kim Sang and De Meulder, 2003). Melamud et al. (2016) proposed learning contextual representations through a task to pre- dict a single word from both left and right context using LSTMs. Similar to ELMo, their model is feature-based and not deeply bidirectional. Fedus et al. (2018) shows that the cloze task can be used to improve the robustness of text generation mod- els. 2.2 Unsupervised Fine-tuning Approaches As with the feature-based approaches, the first works in this direction only pre-trained word em- bedding parameters from unlabeled text (Col- lobert andWeston, 2008). More recently, sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and fine-tuned for a supervised downstream task (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018). The advantage of these approaches is that few parameters need to be learned from scratch. At least partly due to this advantage, OpenAI GPT (Radford et al., 2018) achieved pre- viously state-of-the-art results on many sentence- level tasks from the GLUE benchmark (Wang et al., 2018a). Left-to-right language model-"
},
"dimensions": [
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
},
{
"width": 595.276,
"height": 841.89
}
],
"review": {
"rate": "Ok",
"note": "",
"reviewerId": "61685a5eb492d0845eb5e6b4"
},
"jobStart": 1689583831,
"sessionTime": 18,
"elapsedTime": 31,
"updateTime": 1689585066,
"lastUpdate": 1689585068525
}
}
]
}
- table:NER-Project-Manifest:
Field Names |
Type |
Description |
---|---|---|
project_id |
str |
The Id of the project |
project_name |
str |
The name of the project |
project_type |
str |
The type of the project |
datasetId |
str |
The Id of the dataset |
itemId |
str |
The Id of the dataset item |
file_name |
str |
The name of the file |
file_type |
str |
The type of the file |
source |
str |
Internal source file reference on local storage disk |
state |
int |
The state of the task |
task_id |
str |
The Id of the task |
state_description |
str |
The state description of the task |
annotations |
list |
List of dictionaries representing the annotations |
str |
The email associated with the user |
|
messages |
list | The messages associated with the user |
|
role |
str |
The role associated with user |
elapsed_time |
str |
The elapsed time of the annotation |
date |
str |
The date of the annotation |
content |
dict |
Dictionary containing all the details of the the task as metadata,tags,pageoffsets etc. |
pdf_fingerprint |
str |
The fingerprint of the document |
metadata |
dict |
Dictionary containing metadata of the task and the project |
File |
str |
The name of the file |
Task_id |
str |
The Id of the task |
Type of Project |
str |
The metadata added in advanced setting of project |
tags |
list |
List of dictionaries containing the tags added in the task |
page |
int |
The page number for selected text |
range |
list |
Selected text box start offset and end offset using plaintext |
text |
str |
The selected text for annotation |
id |
int |
The id of selected text for annotation |
type |
str |
The type of the label |
box |
list |
The annotation bounding box |
pageoffsets |
list |
List of page offsets |
link |
list |
List containing the relationships added in the task |
id1 |
number |
The Id of the first annotation field |
id2 |
number |
The Id of the second annotation field |
relationship |
str |
The name of the relationship |
attributes |
dict |
The document attributes associated with the task |
pageattributes |
list |
List of dictionaries containing the attributes for each page |
boxes |
list |
List of dictionaries containing details of bounding box |
page |
int |
The page number in which bounding box is created |
box |
list |
The annotation bounding box |
labels |
str |
The type of label |
rotate |
int |
The angle of bounding box rotation (degrees) |
plaintext |
dict |
Dictionary containing page numbers and the corresponding plain text extracted from the file |
dimensions |
list |
List of dictionaries containing dimensions of pages in the task |
width |
float |
The width of the page |
height |
float |
The height of the page |
review |
dict |
The review details |
rate |
str |
The rate of the review |
note |
str |
The note associated with the reviewer |
reviewerId |
str |
The Id of the reviewer |
jobstart |
int |
The start time of the annotation |
sessiontime |
int |
The session time of the annotation |
elapsedTime |
int |
The elapsed time of the annotation |
update time |
int |
The update time of the annotation |
lastUpdate |
int |
The last update time |
3. Classification Project¶
{
"project_id": "591351b938d008ca0745510a",
"project_name": "Classification-Project-1",
"project_type": "Classification",
"datasetId": "60974d4e9e7759842cdff3be",
"itemId": "2d2020aa8e3deb383fb7c74f",
"file_name": "business_2.txt",
"file_type": "text/plain",
"source": "https://sandboxdocuments.tensoract.com/presigned/3c9a446180a44faa43ab2464d45633c7.txt?sig=bc0af1e30e3bec94f7780fe1638f77b7d7080b5508545d891bc2a229d3e2c20e8179c3b4690e2ac7a19af4a978c363af312b570f2e89efa03ae41171e066c65d:670927f13e9c689722e1e85fe5649347:64b78055:aef5974f70f486095f3cd5f4cc922486",
"state": 4,
"task_id": "8c939483b594e0de5d5efb54",
"state_description": "Approved",
"annotations": [
{
"email": "johndoe@me.com",
"messages": [],
"role": "Reviewer",
"elapsedTime": 8,
"date": "2023-07-18T06:03:46.657Z",
"content": {
"metadata": {
"File": "business_2.txt",
"TaskId": "8c939483b594e0de5d5efb54",
"Type of Project": "Classification"
},
"classificationTypes": {
"Select Type of Document": "select",
"Type of Documents": "multi",
"Put a note": "text"
},
"classifications": {
"Select Type of Document": [
"Technology"
],
"Type of Documents": [
"Graphics",
"Bussiness"
],
"Put a note": [
"Multi-type document"
]
},
"plainText": {
"1": "Japanese growth grinds to a halt Growth in Japan evaporated in the three months to September, sparking renewed concern about an economy not long out of a decade-long trough. Output in the period grew just 0.1%, an annual rate of 0.3%. Exports - the usual engine of recovery - faltered, while domestic demand stayed subdued and corporate investment also fell short. The growth falls well short of expectations, but does mark a sixth straight quarter of expansion. The economy had stagnated throughout the 1990s, experiencing only brief spurts of expansion amid long periods in the doldrums. One result was deflation - prices falling rather than rising - which made Japanese shoppers cautious and kept them from spending. The effect was to leave the economy more dependent than ever on exports for its recent recovery. But high oil prices have knocked 0.2% off the growth rate, while the falling dollar means products shipped to the US are becoming relatively more expensive. The performance for the third quarter marks a sharp downturn from earlier in the year. The first quarter showed annual growth of 6.3%, with the second showing 1.1%, and economists had been predicting as much as 2% this time around. \"Exports slowed while capital spending became weaker,\" said Hiromichi Shirakawa, chief economist at UBS Securities in Tokyo. \"Personal consumption looks good, but it was mainly due to temporary factors such as the Olympics. \"The amber light is flashing.\" The government may now find it more difficult to raise taxes, a policy it will have to implement when the economy picks up to help deal with Japan's massive public debt. "
},
"review": {
"rate": "Ok",
"note": "",
"reviewerId": "614b55be8af65dcf41da535b"
},
"jobStart": 1689660217,
"sessionTime": 8,
"elapsedTime": 8,
"updateTime": 1689660225,
"pageOffsets": [
0
],
"lastUpdate": 1689660226654
}
}
]
}
- table:Classification Project Manifest Summary:
Field Names |
Type |
Description |
---|---|---|
project_id |
str |
The Id of the project |
project_name |
str |
The name of the project |
project_type |
str |
The type of the project |
datasetI |
str |
The Id of the dataset |
itemId |
str |
The Id of the dataset item |
file_name |
str |
The name of the file |
file_type |
str |
The type of the file |
source |
str |
Internal source file reference on local storage disk |
state |
int |
The state of the task |
task_id |
str |
The Id of the task |
state_description |
str |
The state description of the task |
annotations |
list |
List of dictionaries representing the annotations |
str |
The email associated with the user |
|
messages |
str |
The messages associated with the user |
role |
str |
The role associated with user |
elapsed_time |
str |
The elapsed time of the annotation |
date |
str |
The date of the annotation |
content |
dict |
Dictionary containing all the details of the the task as metadata,tags,pageoffsets etc. |
metadata |
dict |
Dictionary containing metadata of the task and the project |
File |
str |
The name of the file |
Task_id |
str |
The Id of the task |
Type of Project |
str |
The metadata added in advanced setting of project |
classificationTypes |
dict |
Dictionary containing the labels defined in the project |
Select Type of Document |
str |
Single Select Label |
Type of Documents |
str |
Multi Select Label |
Put a note |
str |
Plain Text Label |
classifications |
dict |
Dictionary containing the classifications labels in the task |
plaintext |
dict |
Dictionary containing page numbers and the corresponding plain text extracted from the file |
review |
dict |
The review details |
rate |
str |
The rate of the review |
note |
str |
The note associated with the reviewer |
reviewerId |
str |
The Id of the reviewer |
jobstart |
int |
The start time of the annotation |
sessiontime |
int |
The session time of the annotation |
elapsedTime |
int |
The elapsed time of the annotation |
update time |
int |
The update time of the annotation |
lastUpdate |
int |
The last update time |
4. Bulk Image Classification Project¶
{
"project_id": "e3c9b4a1dd6df1c4c7091895",
"project_name": "Bulk Image Classification DB",
"project_type": "Bulk Image Classification",
"datasetId": "25987f46e5febb50484e8497",
"itemId": "1431d151a693edeb3baade14",
"file_name": "green.tiff",
"file_type": "image/tiff",
"source": "https://sandboxdocuments.tensoract.com/presigned/6cc95fbb44dccfacbacc923fbd24091e.tiff?sig=7748be40ae97b0c4559e0c9de0016e925d5e49a89bc1415ae03370f986b067ac42fc06ba2b89848f1245f5aacbf37c44a9d69ea7ff4d952e6d1ebb2cc7bff2e8:188babf42f5690cdb9fee99f58ee209f:64b8db5e:73a9d14256d5247bb72703427484eab4",
"state": 4,
"task_id": "b5ac251d8209079a300868f1",
"state_description": "Approved",
"annotations": [
{
"email": "q1@qc.com",
"messages": [],
"role": "Reviewer",
"elapsedTime": 4.333333333333333,
"date": "2023-07-18T09:08:27.935Z",
"content": {
"redOffset": 1,
"greenOffset": 1,
"brightness": 1,
"selected": false,
"classification": "Green",
"review": {
"rate": "Ok"
},
"elapsedTime": 4.333333333333333,
"updateTime": 1689671308,
"lastUpdate": 1689671307935,
"metadata": {
"color": "green"
}
}
}
],
"dataset_id": "25987f46e5febb50484e8497",
"item_id": "1431d151a693edeb3baade14",
"item_metadata": {
"color": "green"
},
"project_metada": {
"Type of Project": "Bulk"
},
"classification": "Green"
}
- table:Bulk Image Classification Project Manifest Summary:
Field Names |
Type |
Description |
---|---|---|
project_id |
str |
The Id of the project |
project_name |
str |
The name of the project |
project_type |
str |
The type of the project |
datasetId |
str |
The Id of the dataset |
itemId |
str |
The Id of the dataset item |
file_name |
str |
The name of the file |
file_type |
str |
The type of the file |
source |
str |
Internal source file reference on local storage disk |
state |
int |
The state of the task |
task_id |
str |
The Id of the task |
state_description |
str |
The state description of the task |
annotations |
list |
List of dictionaries representing the annotations |
str |
The email associated with the user |
|
messages |
str |
The messages associated with the user |
role |
str |
The role associated with user |
elapsed_time |
str |
The elapsed time of the annotation |
date |
str |
The date of the annotation |
content |
dict |
Dictionary containing the PDF fingerprint and metadata |
redOffset |
int |
An integer value representing the offset or adjustment applied to the red color channel. |
greenOffset |
int |
An integer value representing the offset or adjustment applied to the red color channel. |
brightness |
int |
An integer value representing the overall brightness adjustment applied to the image. |
classification |
str |
Clasified Label |
review |
dict |
The review details |
rate |
str |
The rate of the review |
note |
str |
The note associated with the reviewer |
elapsedTime |
int |
The elapsed time of the annotation |
update time |
int |
The update time of the annotation |
lastUpdate |
int |
The last update time |
metadata |
dict |
The metadta of the task |
dataset_id |
str |
The Id of the dataset |
item_id |
str |
The Id of the dataset item |
item_metadata |
dict |
The metadata of the dataset item |
project_metada |
dict |
The metadata of project |
classification |
str |
Clasified Label |
5. Object Detection Project¶
{
"project_id": "0461637f62c18082f3c14cc3",
"project_name": "Object-Detection-Project-2",
"project_type": "Pose Estimation",
"datasetId": "abd204685e5c074b282d6744",
"itemId": "ab22598223c9ad06a7cb7fbc",
"file_name": "im05.jpg",
"file_type": "image/jpeg",
"source": "https://sandboxdocuments.tensoract.com/presigned/eec676782f87ae20ff1ce9d282043b55.jpg?sig=4bb851b3ea1bee96136f64ec7a0a23aa514068356434dfddb3b9dbbdefb3568d8b6d246a5de7112e3afe8c261ab8647bff8db558f52b5502df86a0984f4d666b:d9838174235badf2412366a794eafbe4:64b7dd9e:54f78a7b1d2c36e89b2f8a43898db7fd",
"state": 4,
"task_id": "528cb44ce376c753590c3b07",
"state_description": "Approved",
"annotations": [
{
"email": "jdoeqa@acme.org",
"messages": [],
"role": "Reviewer",
"elapsedTime": 29,
"date": "2023-06-17T10:08:34.324Z",
"content": {
"url": "https://sandboxdocuments.tensoract.com/presigned/eec676782f87ae20ff1ce9d282043b55.jpeg?sig=a4ecbc5cdcc4745f78f23897c8179a4f0fcd2398a1bf7b0aaf8cd40df2c6a44781ad2222a108af1865c87c7f0027f6d930646c83714f22f94f1f1d4b479e59b6:17fd9429b9bd1ee881c403895971765e:648ed781:84eaba2d2ce4610c8ec1989f2a3ef0fa",
"imageWidth": 720,
"imageHeight": 1280,
"selected": null,
"boxes": [
{
"x1": 97.0271,
"y1": 242.4398,
"x2": 715.4532,
"y2": 1280,
"id": "b0",
"type": "box",
"oid": "b24",
"outside_image": {},
"occluded": {},
"invisible": false,
"attrs": {},
"title": "",
"label": "Group 1",
"sub_labels": [
{
"x1": 329.1937,
"y1": 289.695,
"id": "b1",
"type": "keypoint",
"oid": "b27",
"outside_image": {},
"occluded": {},
"invisible": false,
"title": "",
"label": "Top of head",
"sub_labels": []
},
{
"x1": 360.0123,
"y1": 357.496,
"id": "b2",
"type": "keypoint",
"oid": "b28",
"outside_image": {},
"occluded": {},
"invisible": false,
"title": "",
"label": "Nose",
"sub_labels": []
},
{
"x1": 345.6303,
"y1": 421.1878,
"id": "b3",
"type": "keypoint",
"oid": "b29",
"outside_image": {},
"occluded": {},
"invisible": false,
"title": "",
"label": "Chin",
"sub_labels": []
},
{
"x1": 300.4297,
"y1": 421.1878,
"id": "b4",
"type": "keypoint",
"oid": "b30",
"outside_image": {},
"occluded": {},
"invisible": false,
"title": "",
"label": "Neck",
"sub_labels": []
},
{
"x1": 454.5226,
"y1": 454.061,
"id": "b5",
"type": "keypoint",
"oid": "b31",
"outside_image": {},
"occluded": {},
"invisible": false,
"title": "",
"label": "Left Shoulder",
"sub_labels": []
},
{
"x1": 191.5374,
"y1": 521.862,
"id": "b6",
"type": "keypoint",
"oid": "b32",
"outside_image": {},
"occluded": {},
"invisible": false,
"title": "",
"label": "Right Shoulder",
"sub_labels": []
},
{
"x1": 575.7423,
"y1": 509.5345,
"id": "b7",
"type": "keypoint",
"oid": "b33",
"outside_image": {},
"occluded": {},
"invisible": false,
"title": "",
"label": "Left Elbow",
"sub_labels": []
},
{
"x1": 197.7011,
"y1": 597.8812,
"id": "b8",
"type": "keypoint",
"oid": "b34",
"outside_image": {},
"occluded": {},
"invisible": false,
"title": "",
"label": "Right Elbow",
"sub_labels": []
},
{
"x1": 296.3206,
"y1": 667.7368,
"id": "b9",
"type": "keypoint",
"oid": "b36",
"outside_image": {},
"occluded": {},
"invisible": false,
"title": "",
"label": "Right Wrist",
"sub_labels": []
},
{
"x1": 339.4666,
"y1": 673.9005,
"id": "b10",
"type": "keypoint",
"oid": "b37",
"outside_image": {},
"occluded": {},
"invisible": false,
"title": "",
"label": "Right Hand",
"sub_labels": []
}
]
}
],
"image_attrs": {
"Is Image clear?": "Yes"
},
"review": {
"rate": "Ok",
"note": ""
},
"jobStart": 1686996484,
"sessionTime": 29,
"elapsedTime": 29,
"tsSeconds": true,
"updateTime": 1686996513,
"lastUpdate": 1686996514320,
"metadata": {}
}
}
]
}
- table:Object Detection Classification Project Manifest Summary:
Field Names |
Type |
Description |
---|---|---|
project_id |
str |
The Id of the project |
project_name |
str |
The name of the project |
project_type |
str |
The type of the project |
datasetId |
str |
The Id of the dataset |
itemId |
str |
The Id of the dataset item |
file_name |
str |
The name of the file |
file_type |
str |
The type of the file |
source |
str |
Internal source file reference on local storage disk |
state |
int |
The state of the task |
task_id |
str |
The Id of the task |
state_description |
str |
The state description of the task |
annotations |
list |
List of dictionaries representing the annotations |
str |
The email associated with the user |
|
messages |
str |
The messages associated with the user |
role |
str |
The role associated with user |
elapsed_time |
str |
The elapsed time of the annotation |
date |
str |
The date of the annotation |
content |
dict |
Dictionary containing the PDF fingerprint and metadata |
url |
str |
The presigned URL or S3 path of the task |
imageWidth |
int |
The width of the image in pixels |
imageHeight |
int |
The height of the image in pixels |
boxes |
list |
List of bounding boxes drawn around objects in the image |
x1 |
float |
The x-coordinate of the top-left corner of the bounding box |
y1 |
float |
The y-coordinate of the top-left corner of the bounding box |
x2 |
float |
The x-coordinate of the bottom-right corner of the bounding box |
y2 |
float |
The x-coordinate of the bottom-right corner of the bounding box |
id |
str |
The Id of bounding box |
type |
str |
Flag to indicate wheter it is box/keypoint |
outside_image |
dict |
Indicates whether the object extends beyond the boundaries of the image |
occuluded |
Indicates whether the object is occluded or partially hidden |
|
attrs |
dict |
Represents any additional attributes or properties associated with the object |
label |
str |
The label or category assigned to the object |
sub_labels |
list |
Represents any sub-labels or sub-categories associated with the object |
x1 |
float |
The x-coordinate of the top-left corner of the bounding box |
y1 |
float |
The y-coordinate of the top-left corner of the bounding box |
id |
str |
The Id of keypoint |
type |
str |
Flag to indicate wheter it is box/keypoint |
outside_image |
dict |
Indicates whether the object extends beyond the boundaries of the image |
occuluded |
dict |
Indicates whether the object is occluded or partially hidden |
label |
str |
The label or category assigned to keypoint |
sub_labels |
list |
Represents any sub-labels associated with the object |
image_attrs |
dict |
The image attributes associated with the task |
review |
dict |
The review details |
rate |
str |
The rate of the review |
note |
str |
The note associated with the reviewer |
jobstart |
int |
The start time of the annotation |
sessiontime |
int |
The session time of the annotation |
elapsedTime |
int |
The elapsed time of the annotation |
update time |
int |
The update time of the annotation |
metadata |
dict |
The metdata of task and project |
6. Media Transcription Project¶
Video Files
{
"project_id": "c859cc0a92b7bd4d6d166707",
"project_name": "Video-Project-3",
"project_type": "Media Transcription",
"datasetId": "1ac8fb72573008ce5626bbfb",
"itemId": "0b9ff34daa6a2f6f95c59bb3",
"file_name": "Video 1.mp4",
"file_type": "video/mp4",
"source": "https://sandboxdocuments.tensoract.com/presigned/da00a38a91f62483658e2126e789f63e.mp4?sig=41de6d036aafbad36d075616c0d0b8d56a460c26f769a5128a57f411d3a47c0562f33c3ae0ec77e1adb8f6c51c46a6b11863d9b0bcfb9a8ec48b9c02a6f1d220:5f547425b01d3a41ecc069ae0dc15acc:64b7ddb9:f0dbc85b6ebaf920717d096daa954cc2",
"state": 4,
"task_id": "c53ca1cace7d7e784705b631",
"state_description": "Approved",
"annotations": [
{
"email": "jdoeqa@acme.org",
"messages": [],
"role": "Reviewer",
"elapsedTime": 62,
"date": "2023-06-17T12:16:38.937Z",
"content": {
"review": {
"rate": "Ok",
"note": "",
"reviewerId": "63ca81cd31d698c1825328f3"
},
"videoSource": "https://sandboxdocuments.tensoract.com/presigned/da00a38a91f62483658e2126e789f63e.mp4?sig=9f13caf4ac583fbc9c074b91770e579120196b3f08e192ad641290c246a7add58f6d6f40ce1aa63e031ada96ff8c996038e4ed46516ffce1f56262cc6a435eeb:062d59d190dc03e090dcd5ee5ff17faa:648ef563:ac1a510bd2fb6c0ab328a3202eb9c846",
"streams": {
"Transcription": [
{
"start": 0.025000260441083017,
"end": 0.9500098967611547,
"confidence": 1,
"text": "Bonjoi"
},
{
"start": 0.9500098967611547,
"end": 2.0812716817201613,
"confidence": 1,
"text": "Tava Tuti"
},
{
"start": 2.1187720723817858,
"end": 3.156282880686731,
"confidence": 1,
"text": "Hello"
},
{
"start": 3.156282880686731,
"end": 4.13629310905087,
"confidence": 1,
"text": "Ola"
},
{
"start": 4.165043351337061,
"end": 4.608797974166285,
"confidence": 1,
"text": "Tutu beng"
},
{
"start": 10.453858750849383,
"end": 12.07887567951978,
"confidence": 1,
"text": "oye tutubeng"
}
],
"Language Segmentation": [
{
"start": 0.018750195330812264,
"end": 0.5687559250346386,
"confidence": 1,
"tag": "French"
},
{
"start": 0.5937561854757216,
"end": 2.0937718119407025,
"confidence": 1,
"tag": "German"
},
{
"start": 2.0937718119407025,
"end": 3.1650329694568993,
"confidence": 1,
"tag": "English"
},
{
"start": 3.2437837922305217,
"end": 4.156293298330052,
"confidence": 1,
"tag": "French"
},
{
"start": 4.250044274984113,
"end": 6.318815826483732,
"confidence": 1,
"tag": "Russian"
},
{
"start": 6.8338213441595235,
"end": 8.190085473088276,
"confidence": 1,
"tag": "French"
},
{
"start": 8.321336840403962,
"end": 9.865102922640839,
"confidence": 1,
"tag": "English"
},
{
"start": 9.915103443523005,
"end": 11.071365488923096,
"confidence": 1,
"tag": "German"
},
{
"start": 11.233867181790135,
"end": 12.6088815060497,
"confidence": 1,
"tag": "Arabic"
}
]
},
"mediaAttributes": {
"Is Video Clear?": "Yes",
"Aditional Notes": ""
},
"jobStart": 1687003443,
"sessionTime": 62,
"elapsedTime": 93.075,
"tsSeconds": true,
"updateTime": 1687004194,
"metadata": {
"File": "Video 1.mp4",
"TaskId": "c53ca1cace7d7e784705b631",
"Type": "Media Transcription"
},
"lastUpdate": 1687004198934
}
}
]
}
- table:Media Transcription Project Manifest Summary:
Field Names |
Type |
Description |
---|---|---|
project_id |
str |
The Id of the project |
project_name |
str |
The name of the project |
project_type |
str |
The type of the project |
datasetId |
str |
The Id of the dataset |
itemId |
str |
The Id of the dataset tem |
file_name |
str |
The name of the file |
file_type |
str |
The type of the file |
source |
str |
Internal source file reference on local storage disk |
state |
int |
The state of the task |
task_id |
str |
The Id of the task |
state_description |
str |
The state description of the task |
annotations |
list |
List of dictionaries representing the annotations |
str |
The email associated with the user |
|
messages |
str |
The messages associated with the user |
role |
str |
The role associated with user |
elapsed_time |
str |
The elapsed time of the annotation |
date |
str |
The date of the annotation |
content |
dict |
Dictionary containing the PDF fingerprint and metadata |
review |
dict |
The review details |
rate |
str |
The rate of the review |
reviewerId |
int |
The Id of reviewer |
videoSource |
str |
The presigned URL or S3 path of the task |
streams |
dict |
Dictionary of different streams within the video, each containing specific information |
Transcription |
list |
The stream containing transcribed text from the video |
start |
float |
The starting timestamp(in seconds) of the transcribed text segment |
end |
float |
The ending timestamp(in seconds) of the transcribed text segment |
confidence |
int |
Indicates the confidence level |
text |
str |
The actual transcribed text for the corresponding segment |
Segmentation |
list |
The stream containing information about the segemtations in the video |
start |
float |
The starting timestamp of the segment |
end |
float |
The ending timestamp of the segment |
confidence |
int |
Indicates the confidence level |
tag |
str |
The tag in the corresponding segment |
mediaAttributes |
dict |
The media attributes associated with the task |
jobstart |
int |
The start time of the annotation |
sessiontime |
int |
The session time of the annotation |
elapsedTime |
int |
The elapsed time of the annotation |
tsseconds |
bool |
|
update time |
int |
The update time of the annotation |
lastUpdate |
int |
The last update time |
Audio Files
{
"project_id": "c859cc0a92b7bd4d6d166707",
"project_name": "Video-Project-3",
"project_type": "Media Transcription",
"datasetId": "edd9c7d7ae5c1b0c2bc73643",
"itemId": "1c944e058058dbe48c980ead",
"file_name": "mira640.mp3",
"file_type": "audio/mpeg",
"source": "s3://test-pocs/mira640.mp3",
"state": 4,
"task_id": "eba777874d6d268ece56b33a",
"state_description": "Approved",
"annotations": [
{
"email": "johndoe@me.com",
"messages": [],
"role": "Reviewer",
"elapsedTime": 6,
"date": "2023-07-19T04:01:20.036Z",
"content": {
"review": {
"rate": "Ok",
"note": "",
"reviewerId": "614b55be8af65dcf41da535b"
},
"audioSource": "https://test-pocs.s3.amazonaws.com/mira640.mp3?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUD4REC47DTY4PF7A%2F20230719%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230719T040111Z&X-Amz-Expires=7200&X-Amz-Signature=10048222e39f53cd9dbbb2b97b8aa210e323d8fc544a9ea18ff46e5922974f44&X-Amz-SignedHeaders=host",
"streams": {
"Transcription": [
{
"start": 0.018749917353218702,
"end": 0.6312472175583629,
"confidence": 1,
"text": "Bonjoi"
},
{
"start": 0.7187468318733835,
"end": 1.881241707772943,
"confidence": 1,
"text": "Tava Tuti"
},
{
"start": 2.0124911292454737,
"end": 2.4874890355270143,
"confidence": 1,
"text": "Hello"
}
],
"Language Segmentation": [
{
"start": 0.006249972451072901,
"end": 0.4437480440261759,
"confidence": 1,
"tag": "French"
},
{
"start": 0.5437476032433424,
"end": 1.8687417628707974,
"confidence": 1,
"tag": "German"
},
{
"start": 1.9187415424793806,
"end": 2.6687382366081285,
"confidence": 1,
"tag": "English"
}
]
},
"mediaAttributes": {
"Is Video Clear?": "Yes",
"Aditional Notes": ""
},
"jobStart": 1689739272,
"sessionTime": 6,
"elapsedTime": 6,
"tsSeconds": true,
"updateTime": 1689739278,
"lastUpdate": 1689739280033
}
}
]
}
- table:Media Transcription Project Manifest Summary:
Field Names |
Type |
Description |
---|---|---|
project_id |
str |
The Id of the project |
project_name |
str |
The name of the project |
project_type |
str |
The type of the project |
datasetId |
str |
The Id of the dataset |
itemId |
str |
The Id of the dataset tem |
file_name |
str |
The name of the file |
file_type |
str |
The type of the file |
source |
str |
Internal source file reference on local storage disk |
state |
int |
The state of the task |
task_id |
str |
The Id of the task |
state_description |
str |
The state description of the task |
annotations |
list |
List of dictionaries representing the annotations |
str |
The email associated with the user |
|
messages |
str |
The messages associated with the user |
role |
str |
The role associated with user |
elapsed_time |
str |
The elapsed time of the annotation |
date |
str |
The date of the annotation |
content |
dict |
Dictionary containing the PDF fingerprint and metadata |
review |
dict |
The review details |
rate |
str |
The rate of the review |
reviewerId |
int |
The Id of reviewer |
audioSource |
str |
The presigned URL or S3 path of the task |
streams |
dict |
Dictionary of different streams within the video, each containing specific information |
Transcription |
list |
The stream containing transcribed text from the video |
start |
float |
The starting timestamp(in seconds) of the transcribed text segment |
end |
float |
The ending timestamp(in seconds) of the transcribed text segment |
confidence |
int |
Indicates the confidence level |
text |
str |
The actual transcribed text for the corresponding segment |
Segmentation |
list |
The stream containing information about the segemtations in the video |
start |
float |
The starting timestamp of the segment |
end |
float |
The ending timestamp of the segment |
confidence |
int |
Indicates the confidence level |
tag |
str |
The tag in the corresponding segment |
mediaAttributes |
dict |
The media attributes associated with the task |
jobstart |
int |
The start time of the annotation |
sessiontime |
int |
The session time of the annotation |
elapsedTime |
int |
The elapsed time of the annotation |
tsseconds |
bool |
|
update time |
int |
The update time of the annotation |
lastUpdate |
int |
The last update time |
Model Integration: Request Payloads and Responses¶
1. NER Labeling¶
A NER (Named Entity Recognition) labeling model is a model designed to automatically identify and classify named entities (such as names of people, organizations, locations, etc.) in text.
Request payload
{
"text": {
"1": "This is a text by Michael Smith",
"2": "A paper from Oxford University"
}
}
- table:NER Labeling Request Payload:
Field Names |
Type |
Description |
---|---|---|
text |
dict |
Dictionary containing text. Each key represents a page number, and the corresponding value is a text. |
1 |
str |
Page numbers accompanied by corresponding text. |
2 |
str |
Page numbers accompanied by corresponding text. |
Response
{
"entities": {
"1": [
{
"type": "Person Entity",
"text": "Michael Smith",
"range": [
18,
31
]
}
],
"2": [
{
"type": "Organization Entity",
"text": "Oxford University",
"range": [
13,
30
]
}
]
}
}
- table:NER Labeling Response:
Field Names |
Type |
Description |
---|---|---|
entities |
Dict |
A dictionary containing entity annotations |
1 |
Str |
Page numbers accompanied by corresponding text. |
type |
Str |
The type of the label |
text |
Str |
The selected text for labeling |
range |
list |
Selected text box start offset and end offset using plaintext |
2 |
Str |
Page numbers accompanied by corresponding text. |
type |
Str |
The type of the label |
text |
Str |
The selected text for labeling |
range |
list |
Selected text box start offset and end offset using plaintext |
2.OCR(Tesseract) Model¶
An OCR (Optical Character Recognition) model is employed for extracting text from images, such as scanned documents. It processes visual content to recognize characters and convert them into editable and searchable text.
Request payload
{
"source": "https://sandbox.tensoract.com/testfiles/test_text.png"
}
- table:OCR Model Request Payload:
Field Names |
Type |
Description |
---|---|---|
source |
str |
The URL pointing to the source image (e.g.scanned document) for OCR extraction |
- table:OCR Model Response:
{
"pages": [
{
"page": 1,
"dimentions": {
"width": 2484,
"height": 3509
},
"words": [
{
"box": [
0.12198067632850242,
0.09204901681390709,
0.20128824476650564,
0.11456255343402678
],
"text": "This"
},
{
"box": [
0.2177938808373591,
0.09204901681390709,
0.24476650563607086,
0.11456255343402678
],
"text": "is"
},
{
"box": [
0.26006441223832527,
0.09803362781419207,
0.2805958132045089,
0.11456255343402678
],
"text": "a"
},
{
"box": [
0.29589371980676327,
0.09290396124251923,
0.3647342995169082,
0.11456255343402678
],
"text": "test"
},
{
"box": [
0.3788244766505636,
0.09204901681390709,
0.539049919484702,
0.11456255343402678
],
"text": "scanned"
},
{
"box": [
0.5559581320450886,
0.09204901681390709,
0.7455716586151369,
0.11456255343402678
],
"text": "document"
},
{
"box": [
0.12198067632850242,
0.14990025648332858,
0.17149758454106281,
0.15958962667426618
],
"text": "Lorem"
},
{
"box": [
0.17914653784219,
0.14990025648332858,
0.22584541062801933,
0.16186947848389854
],
"text": "ipsum"
},
{
"box": [
0.23309178743961353,
0.14990025648332858,
0.27375201288244766,
0.15958962667426618
],
"text": "dolor"
},
{
"box": [
0.27938808373590984,
0.14990025648332858,
0.2966988727858293,
0.15958962667426618
],
"text": "sit"
},
{
"box": [
0.3027375201288245,
0.15018523795953262,
0.3466183574879227,
0.1610145340552864
],
"text": "amet,"
},
{
"box": [
0.3538647342995169,
0.15018523795953262,
0.44887278582930756,
0.15958962667426618
],
"text": "consectetur"
},
{
"box": [
0.45450885668276975,
0.14990025648332858,
0.534219001610306,
0.16215445996010258
],
"text": "adipiscing"
},
{
"box": [
0.5414653784219001,
0.14990025648332858,
0.5680354267310789,
0.1610145340552864
],
"text": "elit,"
},
{
"box": [
0.5752818035426731,
0.14990025648332858,
0.6030595813204509,
0.15958962667426618
],
"text": "sed"
},
{
"box": [
0.6103059581320451,
0.14990025648332858,
0.6292270531400966,
0.15958962667426618
],
"text": "do"
},
{
"box": [
0.6356682769726248,
0.14990025648332858,
0.7033011272141707,
0.15958962667426618
],
"text": "eiusmod"
},
{
"box": [
0.7101449275362319,
0.15018523795953262,
0.7677133655394525,
0.16186947848389854
],
"text": "tempor"
},
{
"box": [
0.7737520128824477,
0.14990025648332858,
0.8498389694041868,
0.15958962667426618
],
"text": "incididunt"
},
{
"box": [
0.856682769726248,
0.15018523795953262,
0.8703703703703703,
0.15958962667426618
],
"text": "ut"
},
{
"box": [
0.12198067632850242,
0.1669991450555714,
0.1710950080515298,
0.17668851524650897
],
"text": "labore"
},
{
"box": [
0.177938808373591,
0.16728412653177543,
0.19202898550724637,
0.17668851524650897
],
"text": "et"
},
{
"box": [
0.19806763285024154,
0.1669991450555714,
0.24798711755233493,
0.17668851524650897
],
"text": "dolore"
},
{
"box": [
0.25523349436392917,
0.1695639783414078,
0.30917874396135264,
0.1792533485323454
],
"text": "magna"
},
{
"box": [
0.31602254428341386,
0.1669991450555714,
0.36835748792270534,
0.17896836705614136
],
"text": "aliqua."
},
{
"box": [
0.37640901771336555,
0.1669991450555714,
0.4396135265700483,
0.17668851524650897
],
"text": "Porttitor"
},
{
"box": [
0.44565217391304346,
0.1669991450555714,
0.5092592592592593,
0.17668851524650897
],
"text": "rhoncus"
},
{
"box": [
0.5161030595813204,
0.1669991450555714,
0.5563607085346216,
0.17668851524650897
],
"text": "dolor"
},
{
"box": [
0.5623993558776168,
0.1695639783414078,
0.606682769726248,
0.17896836705614136
],
"text": "purus"
},
{
"box": [
0.6139291465378421,
0.1695639783414078,
0.6421095008051529,
0.17668851524650897
],
"text": "non"
},
{
"box": [
0.6493558776167472,
0.1669991450555714,
0.6920289855072463,
0.17668851524650897
],
"text": "enim."
},
{
"box": [
0.7000805152979066,
0.1669991450555714,
0.7801932367149759,
0.17668851524650897
],
"text": "Habitasse"
},
{
"box": [
0.7870370370370371,
0.1669991450555714,
0.8349436392914654,
0.17896836705614136
],
"text": "platea"
},
{
"box": [
0.1215780998389694,
0.18438301510401825,
0.1888083735909823,
0.19407238529495582
],
"text": "dictumst"
},
{
"box": [
0.19524959742351047,
0.18438301510401825,
0.2584541062801932,
0.1963522371045882
],
"text": "quisque"
},
{
"box": [
0.2648953301127214,
0.18438301510401825,
0.32085346215780997,
0.19663721858079225
],
"text": "sagittis"
},
{
"box": [
0.3276972624798712,
0.18694784838985465,
0.3719806763285024,
0.1963522371045882
],
"text": "purus"
},
{
"box": [
0.3784219001610306,
0.18438301510401825,
0.3961352657004831,
0.19407238529495582
],
"text": "sit"
},
{
"box": [
0.40217391304347827,
0.1846679965802223,
0.4420289855072464,
0.19407238529495582
],
"text": "amet"
},
{
"box": [
0.44806763285024154,
0.18438301510401825,
0.5116747181964574,
0.1963522371045882
],
"text": "volutpat"
},
{
"box": [
0.5181159420289855,
0.1846679965802223,
0.605877616747182,
0.1963522371045882
],
"text": "consequat."
},
{
"box": [
0.6139291465378421,
0.18438301510401825,
0.6501610305958132,
0.19663721858079225
],
"text": "Eget"
},
{
"box": [
0.6561996779388084,
0.1846679965802223,
0.6799516908212561,
0.19407238529495582
],
"text": "est"
},
{
"box": [
0.6863929146537843,
0.18438301510401825,
0.7302737520128825,
0.19407238529495582
],
"text": "lorem"
},
{
"box": [
0.7379227053140096,
0.18438301510401825,
0.784621578099839,
0.1963522371045882
],
"text": "ipsum"
},
{
"box": [
0.7914653784219001,
0.18438301510401825,
0.8321256038647343,
0.19407238529495582
],
"text": "dolor"
},
{
"box": [
0.8377616747181964,
0.18438301510401825,
0.855072463768116,
0.19407238529495582
],
"text": "sit"
},
{
"box": [
0.1215780998389694,
0.20205186662866914,
0.16143317230273752,
0.21145625534340268
],
"text": "amet"
},
{
"box": [
0.16747181964573268,
0.20205186662866914,
0.26247987117552335,
0.21145625534340268
],
"text": "consectetur"
},
{
"box": [
0.26811594202898553,
0.2017668851524651,
0.3526570048309179,
0.2140210886292391
],
"text": "adipiscing."
},
{
"box": [
0.3603059581320451,
0.2017668851524651,
0.4355877616747182,
0.21145625534340268
],
"text": "Senectus"
},
{
"box": [
0.4420289855072464,
0.20205186662866914,
0.45652173913043476,
0.21145625534340268
],
"text": "et"
},
{
"box": [
0.46296296296296297,
0.20205186662866914,
0.5060386473429952,
0.21145625534340268
],
"text": "netus"
},
{
"box": [
0.5128824476650563,
0.20205186662866914,
0.5273752012882448,
0.21145625534340268
],
"text": "et"
},
{
"box": [
0.533816425120773,
0.2017668851524651,
0.6219806763285024,
0.21145625534340268
],
"text": "malesuada"
},
{
"box": [
0.6280193236714976,
0.2017668851524651,
0.677536231884058,
0.21145625534340268
],
"text": "fames"
},
{
"box": [
0.6839774557165862,
0.2043317184383015,
0.7024959742351047,
0.21145625534340268
],
"text": "ac"
},
{
"box": [
0.7081320450885669,
0.2017668851524651,
0.7564412238325282,
0.21373610715303507
],
"text": "turpis."
},
{
"box": [
0.7640901771336553,
0.2017668851524651,
0.8268921095008052,
0.21145625534340268
],
"text": "Gravida"
},
{
"box": [
0.8337359098228664,
0.2043317184383015,
0.8663446054750402,
0.21145625534340268
],
"text": "cum"
},
{
"box": [
0.1215780998389694,
0.2188657737247079,
0.16586151368760063,
0.2285551439156455
],
"text": "sociis"
},
{
"box": [
0.17310789049919484,
0.21915075520091193,
0.23792270531400966,
0.23083499572527785
],
"text": "natoque"
},
{
"box": [
0.24476650563607086,
0.2188657737247079,
0.32286634460547503,
0.23083499572527785
],
"text": "penatibus"
},
{
"box": [
0.32930756843800324,
0.21915075520091193,
0.34782608695652173,
0.2285551439156455
],
"text": "et."
},
{
"box": [
0.35426731078904994,
0.2188657737247079,
0.41626409017713367,
0.2285551439156455
],
"text": "Aenean"
},
{
"box": [
0.4243156199677939,
0.2188657737247079,
0.49074074074074076,
0.23083499572527785
],
"text": "pharetra"
},
{
"box": [
0.49798711755233493,
0.22143060701054432,
0.5523349436392915,
0.2311199772014819
],
"text": "magna"
},
{
"box": [
0.5591787439613527,
0.22143060701054432,
0.5772946859903382,
0.2285551439156455
],
"text": "ac"
},
{
"box": [
0.5841384863123994,
0.2188657737247079,
0.6481481481481481,
0.23083499572527785
],
"text": "placerat"
},
{
"box": [
0.6541867954911433,
0.2188657737247079,
0.7451690821256038,
0.2285551439156455
],
"text": "vestibulum."
},
{
"box": [
0.7536231884057971,
0.2188657737247079,
0.8132045088566827,
0.2311199772014819
],
"text": "Feugiat"
},
{
"box": [
0.8192431561996779,
0.2188657737247079,
0.8470209339774557,
0.2285551439156455
],
"text": "sed"
},
{
"box": [
0.12198067632850242,
0.23624964377315474,
0.1678743961352657,
0.24593901396409235
],
"text": "lectus"
},
{
"box": [
0.17431561996779388,
0.23624964377315474,
0.2608695652173913,
0.24593901396409235
],
"text": "vestibulum"
},
{
"box": [
0.26851851851851855,
0.23624964377315474,
0.31561996779388085,
0.24593901396409235
],
"text": "mattis"
},
{
"box": [
0.322463768115942,
0.23624964377315474,
0.42028985507246375,
0.2482188657737247
],
"text": "ullamcorper."
}
]
}
]
}
- table:OCR Model Response Payload:
Field Names |
Type |
Description |
---|---|---|
pages |
list |
A list containing page objects, each representing a page in the scanned document. |
page |
int |
The page number within the document. |
dimentions |
dict |
A dictionary containing the dimensions (width and height) of the page in pixels. |
width |
int |
The width of the page in pixels. |
height |
int |
The height of the page in pixels. |
words |
list |
A list containing word objects, each representing a word found in the page. |
box |
list[float] |
List of bounding box coordinates for OCRed words |
text |
str |
The text content of the word. |